R4EPIs > Training Materials > Case study walk-through > Univariate Analysis

Univariate Analysis

This section gives some extra coding options in R in order to be able to do some simple descriptive statistics on your data (chi-square, t-test, Kruskal-Wallis) and calculate Odds Ratios as if you were to conduct a case control analysis univariate analysis.

For this we will use the following functions:

tab_linelist
tab_univariate

Create data frame to be used for the analysis

In the example of this AJS data, the current data frame (dataset) you are using contains “Confirmed”, “Suspected” and “Probable cases”.

We would like to compare the exposures between “Confirmed” and “Suspected” cases ONLY, so we need to create a data frame that only includes those cases.

# Creating a data frame needed for the univariate analysis
linelist_cc <- linelist_cleaned %>%
  filter(case_def == "Confirmed" | case_def == "Suspected")

Create list of variables to change

The tab_univariate() function requires your exposure and outcomes variables to be TRUE/FALSE statements.

Thus, you will need to modify the variables you are using to be TRUE/FALSE (in this example: case definition, vomit, jaundice, and patient_facility_type).

To do this, you can gather all of the binary variables into a single vector. With this, you can use mutate_at() to apply str_detect() from the stringr package to all of the binary variables to return TRUE if the elements match either Confirmed, Oui, or Inpatient.

## Create vector that specifies the variables we want to convert
binary_vars <- c("case_def", "ptvomit", "ptjaundice", "patient_facility_type")

## Apply str_detect on each of the columns to return TRUE for each element that 
## matches either Confirmed, Oui, or Inpatient
linelist_cc <- linelist_cc %>%
  mutate_at(.vars = binary_vars, 
            .funs = str_detect,
            pattern = "Confirmed|Oui|Inpatient")

Use Chi-square tests to check the difference in characteristics or exposures between confirmed and suspected cases for categorical variables

To compare the proportion of confirmed and suspected cases exposed to certain categorical variables, we will want to compare them using the chi-square test.

linelist_cc %>%
tab_linelist(age_group, sex, strata = case_def, na.rm = FALSE) %>%
  ## call variables something more accessible for the table output
  rename(suspected_n = "FALSE n",
         confirmed_n = "TRUE n") %>%
  ## group by variable for the chisq test
  group_by(variable) %>%
  ## run chi-sq test on the contingency table
  mutate(binom = list(broom::tidy(chisq.test(cbind(suspected_n, confirmed_n))))) %>%
  ## make results of chisq test available
  tidyr::unnest(cols = c(binom)) %>%
  ## ungroup to be able to change the names
  ungroup() %>%
  ## get rid of duplicate var names and pvals
  mutate(variable = replace(variable, duplicated(variable), NA),
         p.value = replace(p.value, duplicated(p.value), NA)) %>%
  ## select and rename columns
  select(
    variable,
    "Value" = value,
    "Suspected (n)" = suspected_n,
    "%" = "FALSE proportion",
    "Confirmed (n)" = confirmed_n,
    "%" = "TRUE proportion",
    "P-value" = p.value
  ) %>%
  knitr::kable(digits = 1)

variable	Value	Suspected (n)	%	Confirmed (n)	%	P-value
age_group	0-2	40	5.8	4	4	0.1
-	3-14	284	40.9	30	30	-
-	15-29	246	35.4	46	46	-
-	30-44	95	13.7	15	15	-
-	45+	29	4.2	4	4	-
-	Missing	1	0.1	1	1	-
sex	F	346	49.8	59	59	0.1
-	M	349	50.2	41	41	-

Use T-test to check the difference in the characteristics or exposures for continuous variables between confirmed and suspected cases (for normally distributed data)

To compare the means of continuous variables between the confirmed and suspected cases, we will use the t-test.

## run ttest
t.test(age_years ~ case_def, var.equal = TRUE, data = linelist_cc) %>%
  ## convert to a data frame
  broom::tidy() %>%
  ## select and rename columns
  select("Suspected (mean)" = estimate1,
         "Confirmed (mean)" = estimate2,
         "p-value" = p.value) %>%
  ## create a column for the variable name
  tibble::add_column(Variable = "Age (years)", .before = 1) %>%
  knitr::kable(digits = 1)

Variable	Suspected (mean)	Confirmed (mean)	p-value
Age (years)	18	19.6	0.2

Use Kruskal-wallis to check the difference in the characteristics or exposures for continuous variables between confirmed and suspected cases (for non-normally distributed data)

As most of the time, your contiuous variables will not be normally distributed, instead of calculating the difference between your cases and controls (confirmed and suspected cases in this example) with the t-test, we use the Kruskal-Wallis test.

## first create a table with medians and standard deviation
## then a table with the kruskal value
## then bind together

medians_tab <- linelist_cc %>%
  group_by(case_def) %>%
  summarise(Median = median(age_years, na.rm = TRUE),
            SD = sd(age_years, na.rm = TRUE)) %>%
  tidyr::pivot_wider(names_from = case_def, values_from = c("Median", "SD"))

## perform the Kruskal-Wallace test and save the results
kw <- kruskal.test(age_years ~ case_def, data = linelist_cc)

medians_tab %>%
  ## add the variable and p-value columns
  tibble::add_column(Variable = "Age (years)", p.value = kw$p.value) %>%
  ## select and rename the columns in the right order
  select(Variable,
         "Control (median)" = Median_FALSE,
         "SD" = SD_FALSE,
         "Case (median)" = Median_TRUE,
         "SD" = SD_TRUE,
         "p-value" = p.value) %>%
  knitr::kable(digits = 1)

Variable	Control (median)	SD	Case (median)	SD	p-value
Age (years)	15	13.2	20	13	0.1

Calculating odds ratios for a univariate case control analysis

Now you have a good sense of the variables that you might want to include in your univariate case control analysis.

Please note that the coding now switches to the use of “cases” and “controls” to highlight the type of analysis you are conducting. You need to be clear for yourself how you have defined your cases (in this example, “confirmed cases”) and your controls (in this example, “suspected cases”).

## Odds ratios
## other values are already set at the correct defaults for CC
linelist_cc %>%
  tab_univariate(case_def,                                    # select outcome variable
                 ptvomit, ptjaundice, patient_facility_type,  # select exposure variables
                 measure = "OR",                              # calculate odds ratios
                 mergeCI = TRUE,                              # paste lower and upper together
                 digits = 1)   %>%                            # limit decimal places to 1
  select("Exposure" = variable,                               # select and rename columns
         "Exposed cases" = exp_cases,
         "Unexposed cases" = unexp_cases,
         "Exposed controls" = exp_controls,
         "Unexposed controls" = unexp_controls,
         "OR (95%CI)" = est_ci) %>%
  knitr::kable(digits = 1)

Exposure	Exposed cases	Unexposed cases	Exposed controls	Unexposed controls	OR (95%CI)
ptvomit	17	21	203	279	1.1 (0.6–2.2)
ptjaundice	35	4	452	31	0.6 (0.2–1.8)
patient_facility_type	44	52	42	614	12.4 (7.4–20.6)

Calculating stratified odds ratios for a univariate case control analysis

As you will want to double check on particular confounding (before you do a multivariable analysis), you might want to check whether odds ratios change if you conduct the same analysis in a stratified manner.

In this case, we give an example of calculating Odds Ratios for confirmed cases and suspected cases in the stratum ‘patient_facility_type’. This allows us to compare whether the ORs change when comparing groups that are admitted to hospital and those that are not admitted t hospital.

Note: in the current example dataset, this analysis doesnt provide much more insight. But you should as standard practice check for confounding in your outbreak data analysis.

## stratified odds ratios
linelist_cc %>%
  tab_univariate(case_def,                          # select outcome variable
                 ptvomit, ptjaundice,               # select exposure variables
                 strata = patient_facility_type,    # select stratifying variable
                 measure = "OR",                    # calculate odds ratios
                 mergeCI = TRUE,                    # paste lower and upper together
                 digits = 1,                        # limit decimal places to 1
                 woolf_test = TRUE) %>%             # calculate p val between strata
  # get rid of duplicate var names
  mutate(variable = replace(variable, duplicated(variable), NA)) %>%
  select("Exposure" = variable,                     # select and rename columns
         "Measure"  = est_type,
         "Exposed cases" = exp_cases,
         "Unexposed cases" = unexp_cases,
         "Exposed controls" = exp_controls,
         "Unexposed controls" = unexp_controls,
         "OR (95%CI)" = est_ci,
         "p-value" = p.value) %>%
  knitr::kable(digits = 1)

Exposure	Measure	Exposed cases	Unexposed cases	Exposed controls	Unexposed controls	OR (95%CI)	p-value
ptvomit	crude	17	20	195	262	1.1 (0.6–2.2)	0.7
-	patient_facility_type: TRUE	6	5	2	4	2.4 (0.3–19.0)	0.4
-	patient_facility_type: FALSE	11	15	193	258	1.0 (0.4–2.2)	1.0
-	MH	-	-	-	-	1.1 (0.5–2.3)	-
-	woolf	-	-	-	-	NA (NA–NA)	0.5
ptjaundice	crude	33	4	426	31	0.6 (0.2–1.8)	0.4
-	patient_facility_type: TRUE	11	0	7	0	NaN (NaN–NaN)	-
-	patient_facility_type: FALSE	22	4	419	31	0.4 (0.1–1.3)	0.1
-	MH	-	-	-	-	0.4 (0.1–1.3)	-
-	woolf	-	-	-	-	NA (NA–NA)	0.5