Becuase the Am Timan dataset is not yet aligned with the MSF data dictionary and what the template expects, there are several data cleaning steps we must complete.
NA
at importCurrently, missing values in the dataset are represented by a period (.). This causes problems because R expects missing values to be coded as NA
. For example, if we attempt to set the variable date_of_onset
to class Date, R gives an error because the periods are considered character values that do not fit the expected YYYY-MM-DD date format.
To fix this problem we scroll up to the read_nonDHIS_data
chunk of the template and insert an additional argument in the rio::import()
command. The argument na = "."
specifies a period as the value in the Excel sheet that R will consider to be “missing”. As the data are re-imported, all cells with that value are now changed to NA
.
Remember that after making this change to the import command you must re-run the entire script. This way, the dataset is correctly imported AND the later cleaning steps, changes to variable names, etc. are also performed on this newly-imported dataset
# The old import command
linelist_raw <- rio::import("AJS_AmTiman.xlsx", which = "linelist")
# The revised import command (don't forget the comma between arguments!)
linelist_raw <- rio::import("AJS_AmTiman.xlsx", which = "linelist", na = ".")
The next code chunk in the script, remove_unused_data
, drops observations with missing case_number
or date_of_consultation_admission
. Because the Am Timan dataset contains observations for patients seen in the community and at the hospital, these criteria may not be relevant.
We can check, using the code below, and see that of the 1447 observations, there are 0 missing case_number
or dateentry
, but 616 observations missing date_of_consultation_admission
. These 616 community-identified cases are still of interest for our report, so we will not drop them.
# Find the number of observations missing case_number, dateentry, and date_of_consultation_admission
nrow(linelist_cleaned %>% filter(is.na(case_number)))
## [1] 0
nrow(linelist_cleaned %>% filter(is.na(dateentry)))
## [1] 0
nrow(linelist_cleaned %>% filter(is.na(date_of_consultation_admission)))
## [1] 616
Notice that the command above returns the number of rows missing date_of_consultation_admission
, but does not overwrite linelist_cleaned
with anything, as there is not an assignment operator <-
.
After viewing the above results, we must comment out (#) the two lines that filter()
the linelist_cleaned
dataset - we do not want to remove those 616 observations.
# linelist_cleaned <- linelist_cleaned %>%
# filter(!is.na(case_number) & !is.na(date_of_consultation_admission))
In linelist_cleaned
there are seven variables containing dates (dateentry
, epiquestdate
, date_of_consultation_admission
, date_of_onset
, date_lab_sample_taken
, dthospitalisation
, dtdeath
).
Because the Am Timan dataset contains date variables not found in the data dictionary, we will comment out the first code lines related to DHIS-aligned data (those commands convert only date variables known to the MSF data dictionary).
Instead, we uncomment and use the commands applicable to non-DHIS2 data. You may see messages warning that a few dates are not in the correct timeframe - this is ok and these dates will be addressed in the next step.
You can read more about the guess_dates()
function from the package linelist in the R Basics - Working with Dates page.
## Non-DHIS2 data --------------------------------------------------------------
## Use this section if you did not have DHIS2 data.
# use the guess_dates() function to make a first pass at date variables.
linelist_cleaned <- linelist_cleaned %>%
mutate_at(vars(matches("date|Date")), linelist::guess_dates,
error_tolerance = 0.5)
The code above successfully converted many of our date variables (remember you can check a variable’s class like class(linelist$date_of_onset
). However, this command did not detect the date variables named with “dt”, such as dtdeath
and dthospitalization
. We must add code to convert those variables manually:
# Check class of variable
class(linelist_cleaned$dtdeath)
## [1] "character"
# Individually convert variable using the guess_dates() function from the linelist package
linelist_cleaned$dtdeath <- linelist::guess_dates(linelist_cleaned$dtdeath, error_tolerance = 0.5)
#Re-check class of variable
class(linelist_cleaned$dtdeath)
## [1] "Date"
# Repeat for variable "dtdeath"
linelist_cleaned$dthospitalisation <- linelist::guess_dates(linelist_cleaned$dthospitalisation, error_tolerance = 0.5)
Next we uncomment and modify code to correct unrealistic dates. We have browsed our data and know that there are observations with date_of_onset
outside the reasonable range:
# Check range of date_of_onset values, ignoring (removing) missing values
range(linelist_cleaned$date_of_onset, na.rm = TRUE)
## [1] "2014-11-01" "2017-12-27"
We convert dates outside the expected range (April 2016 to October 2017) to missing using case_when()
. Note that when making the assignment on the right-hand side (RHS), wrap NA
in as.Date()
.
As you modify this chunk for your own situation, remember that each left-hand side (LHS) of the ~
must be a logical statement (not just a value), and to include commas at the end of each case_when()
line (except the last one). In addition, it is best to write case_when()
lines from most specific at top to most general/catch-all at the bottom. You can read more about case_when()
on this tidyverse site.
All the right-hand side (RHS) values must be the same class (either character, numeric, etc.). So, if your other RHS values are character and you want one RHS value to be missing, you cannot just write NA
on the RHS. Instead you must use the special character version of NA
: NAcharacter
.
# Convert dates before April 2016 or after October 2017 to missing (NA)
linelist_cleaned <- mutate(linelist_cleaned,
date_of_onset = case_when(
date_of_onset < as.Date("2016-04-01") ~ as.Date(NA),
date_of_onset > as.Date("2017-10-31") ~ as.Date(NA),
TRUE ~ date_of_onset
))
We also must use the provided code to create a variable called epiweek
. Although there are already variables that give the epidemiological weeks of various data points, it is safer to calculate a new variable ourselves, AND later code chunks rely on the variable being named epiweek
. See the documentation of the package aweek for more information.
# Create variable epiweek
linelist_cleaned$epiweek <- aweek::date2week(linelist_cleaned$date_of_onset,
floor_day = TRUE,
factor = TRUE)
A note on the epiweek variable and factors:
Notice that we converted dates to NA
that were clear errors (outside the expected date range) before creating the epiweek
variable. This is because the epiweek
variable is created as a factor, and once a factor is created it is more difficult to remove factor levels. You can read more about factors and dropping unused factor levels in the R Basics - Advanced and Miscellaneous page.
To create the variable age_group
we must first clean the variable age_years
. If we look at the range of values in age_years
, we see something strange:
# See the range of age_years values, removing (excluding) NA
range(linelist_cleaned$age_years, na.rm = TRUE)
## [1] "0.08" "9"
We know there are ages older than 9 years. So we check class(linelist_cleaned$age_years)
and see that R is reading this variable as class character, not numeric!
We must convert it by adding the following command to the script:
# Convert variable age_years to numeric class
linelist_cleaned$age_years <- as.numeric(linelist_cleaned$age_years)
If we run the range()
command again, we can see that the corrected range is 0.08 to 75.
Then, the chunk contains three age-group commands marked “OPTIONAL”:
We do not want to use “add under 2 years to the age_years variable” because it assumes that we already have a variable age_months
, which we do not. Comment out (#) this code.
We also do not need to use “change those who are above or below a certain age to NA”, because we already know our range of ages and do not have any outside an expected range. Comment out (#) this code.
We do want to use “create an age_months variable from decimal years variable”, as we do have decimal years. This command will create an age_months variable that has a value if a patient is under 5 years. Uncomment and use this code.
# For patients under 5 years, the new variable age_months is based on age_years * 12
linelist_cleaned <- mutate(linelist_cleaned,
age_months = if_else(age_years < 5,
true = age_years * 12,
false = as.numeric(NA)
)
)
Now we can create the variables age_group
age_group_mon
.
First we use the code to create an age_group_mon
variable for children under 5 years based on age_months
.
## create age group variable for under 5 years based on months
linelist_cleaned$age_group_mon <- age_categories(linelist_cleaned$age_months,
breakers = c(0, 6, 9, 12, 24),
ceiling = TRUE)
…and we use the similar code below to create age_group
as groupings of the variable age_years
. The function age_categories()
is used with its breakers =
argument set to a vector c( )
of numbers: 0, 3, 15, 30, and 45. Thus, a 30-year old patient will be included in an age group named “30-44”. Read more about age_categories
by searching the Help pane.
## create an age group variable by specifying categorical breaks
linelist_cleaned$age_group <- age_categories(linelist_cleaned$age_years,
breakers = c(0, 3, 15, 30, 45))
Finally, ensure that the remaining code in create_age_group
is commented out (#).
In the next chunk (create_vars
) we can comment out/ignore the code that makes a new numeric variable obs_days
, because we do not have a date_of_exit
in our dataset.
While the template directs us to next create a DIED
variable based on exit_status
containing the characters “Dead”, we must first clean our exit_status
variable, which is currrently in French.
Add code that uses case_when()
to assign new values in a new exit_status2 variable, as in the code below. We do this by modifing code from just below in the template. The code uses the function case_when()
to re-define linelist_cleaned
as itself, but mutated to create the new variable exit_status2
. The values in exit_status2
are based on the values in exit_status
, such that when exist_status == "Décédé"
, the value in exit_status2
is “Dead”, and so on.
As you modify this chunk for your own situation, remember that each left-hand side (LHS) of the ~
must be a logical statement (not just a value), and to include commas at the end of each case_when()
line (except the last one). It is best to write case_when()
lines from most specific at top to most general/catch-all at bottom. You can read more about case_when()
at this tidyverse site.
All the right-hand side (RHS) values must be the same class (either character, numeric, etc.). So, if your other RHS values are character and you want one RHS value to be missing, you cannot just write NA
on the RHS. Instead you must use the special character version of NA
: NAcharacter
.
# Tabulate to see all possible values of exit_status
table(linelist_cleaned$exit_status, useNA = "always")
##
## - Décédé Déchargé/Guéri Echappé <NA>
## 2 14 55 2 1374
# Create exit_status2 from values in exit_status
linelist_cleaned <- linelist_cleaned %>%
mutate(exit_status2 = case_when(
exit_status == "Décédé" ~ "Dead",
exit_status == "-" ~ NA_character_,
exit_status == "Echappé" ~ "Left",
exit_status == "Déchargé/Guéri" ~ "Discharged"
))
# Tabulate the NEW exit_status2 variable to check correct assignment
table(linelist_cleaned$exit_status2, useNA = "always")
##
## Dead Discharged Left <NA>
## 14 55 2 1376
Now we can make the DIED
variable, which is referenced in later code chunks.
This command creates DIED
as a logical (TRUE
or FALSE
) variable, depending on whether each observation meets the criteria to the right of the assignment operator <-
. For a similar example, read about the %in%
operator on the R Basics - Advanced and Miscellaneous page.
However, we must modify the existing command to detect within the NEW variable exit_status2
, not exit_status
.
## Note we are directing R to look within the NEW exit_status2 variable
linelist_cleaned$DIED <- str_detect(linelist_cleaned$exit_status2, "Dead")
patient_facility_type
When we assigned our variables to match the data dictionary, we used the variable hospitalisation
as the variable patient_facility_type
. However, the values in that variable do not match those expected by the template. In the data dictionary, patient_facility_type
should have values of “Inpatient” or “Outpatient.” Currently, the values are “Oui” and “Non”. In later code chunks, analyses are restricted to observations where patient_facility_type == "Inpatient"
, thus, we must align the values to match the data dictionary.
To clean these values we uncomment and modify code also from the create_vars
chunk, found under the heading “recode character variables” (the template example is of fixing incorrect spellings of village names).
Included below are extra steps before and after the case_when()
command using table()
to verify the correct conversion of values. Remove these two commands after verifying, as otherwise their outputs will appear in the report.
# View all the values in patient_facility_type
table(linelist_cleaned$patient_facility_type, useNA = "always")
##
## Non Oui <NA>
## 699 86 662
# Convert the values
linelist_cleaned <- linelist_cleaned %>%
mutate(patient_facility_type = case_when(
patient_facility_type == "Oui" ~ "Inpatient",
patient_facility_type == "Non" ~ "Outpatient"
))
# Re-check that the values converted sucessfully
table(linelist_cleaned$patient_facility_type, useNA = "always")
##
## Inpatient Outpatient <NA>
## 86 699 662
For later steps, we will need many of our testing variables to the values “Positive” and “Negative” instead of “Pos” and “Neg”. We can uncomment and use this helper code to make that change. Some notes about this step:
View(linelist_cleaned)
.test_hepatitis_e_igm
) has values 0 and 1 that we want to change to “Positive” and “Negative”. The fct_recode()
function expects character RHS inputs (within quotes) - so put “0” and “1” in quotes, as below, and confirm the change.test_hepatitis_b
) has a value recorded as “Weakly pos”. For the purposes of this exercise we re-code this to “Positive”.## sometimes, coding is inconsistent across variables -- for example, "Yes" / "No"
## may be coded as Y, y, yes, 1 / N, n, no, 0. You can change them all at once!
## Create a list of the variables you want to change, and run the following.
## You may need to edit this code if options are different in your data.
# # create list of variables
change_test_vars <- c("hep_e_rdt", "hep_c_rdt", "hep_b_rdt", "test_hepatitis_a", "test_hepatitis_b", "test_hepatitis_c", "malaria_rdt_at_admission", "test_hepatitis_e_genotype", "test_hepatitis_e_igm", "test_hepatitis_e_virus", "hevpcr", "other_arthropod_transmitted_virus")
# standardize options
linelist_cleaned <- linelist_cleaned %>%
mutate_at(vars(change_test_vars), forcats::fct_recode,
Positive = "Pos",
Positive = "pos",
Positive = "yes",
Positive = "Yes",
Positive = "Weakly pos",
Positive = "1",
Negative = "Neg",
Negative = "neg",
Negative = "no",
Negative = "No",
Negative = "0"
)
Lastly for the chunk create_vars
we must create a case definition variable. In this use of case_when()
, the last line left-hand side (LHS) is TRUE, which serves as a catch-all for any other possible values that have not yet met the criteria of the earlier case_when()
lines.
Note how this code checks the variable hep_e_rdt
for “Positive”. The earlier cleaning steps converting hep_e_rdt
values from “Pos” to “Positive” were necessary for this case definition to properly apply.
In addition we need to change the value looked for in other_cases_in_hh
from “Yes” to “Oui”.
# You MUST modify this section to match your case definition. The below
# uses positive RDT for Confirmed and epi link only for Probable.
linelist_cleaned <- linelist_cleaned %>%
mutate(case_def = case_when(
is.na(hep_e_rdt) & is.na(other_cases_in_hh) ~ NA_character_,
hep_e_rdt == "Positive" ~ "Confirmed",
hep_e_rdt != "Positive" & other_cases_in_hh == "Oui" ~ "Probable",
TRUE ~ "Suspected"
))
Factor variables require specialized functions to manipulate (see Factor sections of the R Basics - Advanced and Miscellaneous page). The one factor variable we should clean now is sex
, with the following code:
linelist_cleaned$sex <- fct_recode(linelist_cleaned$sex,
NULL = "Unknown/unspecified")
We can comment out (#) the rest of the factor_vars
chunk, such as the code to change the order of levels in categorical variables. The Am Timan dataset does not include a variable time_to_death
and we do not need to change the order of any categorical variables.
We must add one additional cleaning step necessary for this dataset.
The variable patient_origin
(originally the variable quartier in the raw dataset) is referenced in the place analyses chunks, for example when calculating attack rates by region. In those steps, the variable patient_origin
in linelist_cleaned
is joined to the variable patient_origin
of the data frame population_data_region
(which was imported in the population and lab data steps).
However, the place names in population_data_region
are ALL CAPITAL LETTERS. This is not true in linelist_cleaned
- so the join will not register any matches. The easiest way to solve this problem is to redefine the linelist_cleaned
variable into all capital letters as well, using the base R function toupper()
, which means “to upper case”:
linelist_cleaned$patient_origin <- toupper(linelist_cleaned$patient_origin)
These variables names are stored in vectors that are created using the function c()
. These vectors of names will be referenced in later code chunks. This way if you want to run the same function over these variables you can simply use the named vector rather than typing out each variable individually.
The default template code creates two vectors - one for symptom variables (SYMPTOMS
) and one for laboratory testing variables (LABS
).
# vectors of variable names ----------------------------------------------------
# create a grouping of all symptoms
SYMPTOMS <- c("generalized_itch",
# "history_of_fever",
"fever",
"joint_pains",
"epigastric_pain_heartburn",
"nausea_anorexia",
"vomiting",
"diarrhoea",
"bleeding",
"headache",
"mental_state",
# "convulsions",
"other_symptoms"
)
# create a grouping of all lab tests
LABS <- c("hep_b_rdt",
"hep_c_rdt",
"hep_e_rdt",
"test_hepatitis_a",
"test_hepatitis_b",
"test_hepatitis_c",
# "test_hepatitis_e_igg",
"test_hepatitis_e_igm" ,
"test_hepatitis_e_genotype",
"test_hepatitis_e_virus",
"malaria_rdt_at_admission",
# "malaria_blood_film",
"dengue",
# "dengue_rdt",
"yellow_fever",
# "typhoid",
"chikungunya_onyongnyong",
# "ebola_marburg",
# "lassa_fever",
"other_arthropod_transmitted_virus"
# "other_pathogen"
)
The next code chunk, report_setup
, defines important date-related parameters used to produce the report, and filters to remove any observations with date of onset after the reporting_week
(reporting_week
is an object defined at the beginning of the script).
The second command in this chunk (shown below) uses filter()
to remove observations later than the end of the pre-defined reporting_week
. However, filter()
also removes observations with missing date_of_consultation_admission
. To avoid this we must can add | is.na(date_of_onset)
into the filter (This is read as: “keep observations where date_of_consultation_admission
is less than or equal to obs_end
, OR, if date_of_consultation_admission
is missing”).
# filter out cases after end of reporting week
linelist_cleaned <- linelist_cleaned %>%
filter(date_of_consultation_admission <= obs_end | is.na(date_of_consultation_admission))
And finally, if desired you can export the cleaned dataset for other purposes.
The command export()
is also from the package rio.
First, the object that you want to export is named (linelist_cleaned
).
Then, the function str_glue()
creates the name for the new file by combining “AmTiman_linelist_cleaned_” with the current date and finally the extention “.xlsx”. Don’t be confused - this is equivalen of writing rio::export(linelist_cleaned, "AmTiman_linelist_cleaned_2019-08-24.xlsx")
, but using str_glue()
and Sys.Date()
means the current date will always be used automatically.
## OPTIONAL: save your cleaned dataset!
## put the current date in the name so you know!
rio::export(linelist_cleaned, str_glue("AmTiman_linelist_cleaned_{Sys.Date()}.xlsx"))