This section gives some extra coding options in R in order to be able to do some simple descriptive statistics on your data (chi-square, t-test, Kruskal-Wallis) and calculate Odds Ratios as if you were to conduct a case control analysis univariate analysis.
For this we will use the following functions:
In the example of this AJS data, the current data frame (dataset) you are using contains “Confirmed”, “Suspected” and “Probable cases”.
We would like to compare the exposures between “Confirmed” and “Suspected” cases ONLY, so we need to create a data frame that only includes those cases.
# Creating a data frame needed for the univariate analysis
linelist_cc <- linelist_cleaned %>%
filter(case_def == "Confirmed" | case_def == "Suspected")
The tab_univariate()
function requires your exposure and outcomes variables to be TRUE/FALSE statements.
Thus, you will need to modify the variables you are using to be TRUE/FALSE (in this example: case definition, vomit, jaundice, and patient_facility_type).
To do this, you can gather all of the binary variables into a single vector. With this, you can use mutate_at()
to apply str_detect()
from the stringr package to all of the binary variables to return TRUE if the elements match either Confirmed, Oui, or Inpatient.
## Create vector that specifies the variables we want to convert
binary_vars <- c("case_def", "ptvomit", "ptjaundice", "patient_facility_type")
## Apply str_detect on each of the columns to return TRUE for each element that
## matches either Confirmed, Oui, or Inpatient
linelist_cc <- linelist_cc %>%
mutate_at(.vars = binary_vars,
.funs = str_detect,
pattern = "Confirmed|Oui|Inpatient")
To compare the proportion of confirmed and suspected cases exposed to certain categorical variables, we will want to compare them using the chi-square test.
linelist_cc %>%
tab_linelist(age_group, sex, strata = case_def, na.rm = FALSE) %>%
## call variables something more accessible for the table output
rename(suspected_n = "FALSE n",
confirmed_n = "TRUE n") %>%
## group by variable for the chisq test
group_by(variable) %>%
## run chi-sq test on the contingency table
mutate(binom = list(broom::tidy(chisq.test(cbind(suspected_n, confirmed_n))))) %>%
## make results of chisq test available
tidyr::unnest(cols = c(binom)) %>%
## ungroup to be able to change the names
ungroup() %>%
## get rid of duplicate var names and pvals
mutate(variable = replace(variable, duplicated(variable), NA),
p.value = replace(p.value, duplicated(p.value), NA)) %>%
## select and rename columns
select(
variable,
"Value" = value,
"Suspected (n)" = suspected_n,
"%" = "FALSE proportion",
"Confirmed (n)" = confirmed_n,
"%" = "TRUE proportion",
"P-value" = p.value
) %>%
knitr::kable(digits = 1)
variable | Value | Suspected (n) | % | Confirmed (n) | % | P-value |
---|---|---|---|---|---|---|
age_group | 0-2 | 40 | 5.8 | 4 | 4 | 0.1 |
- | 3-14 | 284 | 40.9 | 30 | 30 | - |
- | 15-29 | 246 | 35.4 | 46 | 46 | - |
- | 30-44 | 95 | 13.7 | 15 | 15 | - |
- | 45+ | 29 | 4.2 | 4 | 4 | - |
- | Missing | 1 | 0.1 | 1 | 1 | - |
sex | F | 346 | 49.8 | 59 | 59 | 0.1 |
- | M | 349 | 50.2 | 41 | 41 | - |
To compare the means of continuous variables between the confirmed and suspected cases, we will use the t-test.
## run ttest
t.test(age_years ~ case_def, var.equal = TRUE, data = linelist_cc) %>%
## convert to a data frame
broom::tidy() %>%
## select and rename columns
select("Suspected (mean)" = estimate1,
"Confirmed (mean)" = estimate2,
"p-value" = p.value) %>%
## create a column for the variable name
tibble::add_column(Variable = "Age (years)", .before = 1) %>%
knitr::kable(digits = 1)
Variable | Suspected (mean) | Confirmed (mean) | p-value |
---|---|---|---|
Age (years) | 18 | 19.6 | 0.2 |
As most of the time, your contiuous variables will not be normally distributed, instead of calculating the difference between your cases and controls (confirmed and suspected cases in this example) with the t-test, we use the Kruskal-Wallis test.
## first create a table with medians and standard deviation
## then a table with the kruskal value
## then bind together
medians_tab <- linelist_cc %>%
group_by(case_def) %>%
summarise(Median = median(age_years, na.rm = TRUE),
SD = sd(age_years, na.rm = TRUE)) %>%
tidyr::pivot_wider(names_from = case_def, values_from = c("Median", "SD"))
## perform the Kruskal-Wallace test and save the results
kw <- kruskal.test(age_years ~ case_def, data = linelist_cc)
medians_tab %>%
## add the variable and p-value columns
tibble::add_column(Variable = "Age (years)", p.value = kw$p.value) %>%
## select and rename the columns in the right order
select(Variable,
"Control (median)" = Median_FALSE,
"SD" = SD_FALSE,
"Case (median)" = Median_TRUE,
"SD" = SD_TRUE,
"p-value" = p.value) %>%
knitr::kable(digits = 1)
Variable | Control (median) | SD | Case (median) | SD | p-value |
---|---|---|---|---|---|
Age (years) | 15 | 13.2 | 20 | 13 | 0.1 |
Now you have a good sense of the variables that you might want to include in your univariate case control analysis.
Please note that the coding now switches to the use of “cases” and “controls” to highlight the type of analysis you are conducting. You need to be clear for yourself how you have defined your cases (in this example, “confirmed cases”) and your controls (in this example, “suspected cases”).
## Odds ratios
## other values are already set at the correct defaults for CC
linelist_cc %>%
tab_univariate(case_def, # select outcome variable
ptvomit, ptjaundice, patient_facility_type, # select exposure variables
measure = "OR", # calculate odds ratios
mergeCI = TRUE, # paste lower and upper together
digits = 1) %>% # limit decimal places to 1
select("Exposure" = variable, # select and rename columns
"Exposed cases" = exp_cases,
"Unexposed cases" = unexp_cases,
"Exposed controls" = exp_controls,
"Unexposed controls" = unexp_controls,
"OR (95%CI)" = est_ci) %>%
knitr::kable(digits = 1)
Exposure | Exposed cases | Unexposed cases | Exposed controls | Unexposed controls | OR (95%CI) |
---|---|---|---|---|---|
ptvomit | 17 | 21 | 203 | 279 | 1.1 (0.6–2.2) |
ptjaundice | 35 | 4 | 452 | 31 | 0.6 (0.2–1.8) |
patient_facility_type | 44 | 52 | 42 | 614 | 12.4 (7.4–20.6) |
As you will want to double check on particular confounding (before you do a multivariable analysis), you might want to check whether odds ratios change if you conduct the same analysis in a stratified manner.
In this case, we give an example of calculating Odds Ratios for confirmed cases and suspected cases in the stratum ‘patient_facility_type’. This allows us to compare whether the ORs change when comparing groups that are admitted to hospital and those that are not admitted t hospital.
Note: in the current example dataset, this analysis doesnt provide much more insight. But you should as standard practice check for confounding in your outbreak data analysis.
## stratified odds ratios
linelist_cc %>%
tab_univariate(case_def, # select outcome variable
ptvomit, ptjaundice, # select exposure variables
strata = patient_facility_type, # select stratifying variable
measure = "OR", # calculate odds ratios
mergeCI = TRUE, # paste lower and upper together
digits = 1, # limit decimal places to 1
woolf_test = TRUE) %>% # calculate p val between strata
# get rid of duplicate var names
mutate(variable = replace(variable, duplicated(variable), NA)) %>%
select("Exposure" = variable, # select and rename columns
"Measure" = est_type,
"Exposed cases" = exp_cases,
"Unexposed cases" = unexp_cases,
"Exposed controls" = exp_controls,
"Unexposed controls" = unexp_controls,
"OR (95%CI)" = est_ci,
"p-value" = p.value) %>%
knitr::kable(digits = 1)
Exposure | Measure | Exposed cases | Unexposed cases | Exposed controls | Unexposed controls | OR (95%CI) | p-value |
---|---|---|---|---|---|---|---|
ptvomit | crude | 17 | 20 | 195 | 262 | 1.1 (0.6–2.2) | 0.7 |
- | patient_facility_type: TRUE | 6 | 5 | 2 | 4 | 2.4 (0.3–19.0) | 0.4 |
- | patient_facility_type: FALSE | 11 | 15 | 193 | 258 | 1.0 (0.4–2.2) | 1.0 |
- | MH | - | - | - | - | 1.1 (0.5–2.3) | - |
- | woolf | - | - | - | - | NA (NA–NA) | 0.5 |
ptjaundice | crude | 33 | 4 | 426 | 31 | 0.6 (0.2–1.8) | 0.4 |
- | patient_facility_type: TRUE | 11 | 0 | 7 | 0 | NaN (NaN–NaN) | - |
- | patient_facility_type: FALSE | 22 | 4 | 419 | 31 | 0.4 (0.1–1.3) | 0.1 |
- | MH | - | - | - | - | 0.4 (0.1–1.3) | - |
- | woolf | - | - | - | - | NA (NA–NA) | 0.5 |