R4EPIs > Training Materials > Case study walk-through > Standardise and clean data

Standardise and clean data

Standardize and clean data

Becuase the Am Timan dataset is not yet aligned with the MSF data dictionary and what the template expects, there are several data cleaning steps we must complete.

Re-code missing values from periods (.) to `NA` at import

Currently, missing values in the dataset are represented by a period (.). This causes problems because R expects missing values to be coded as NA. For example, if we attempt to set the variable date_of_onset to class Date, R gives an error because the periods are considered character values that do not fit the expected YYYY-MM-DD date format.

To fix this problem we scroll up to the read_nonDHIS_data chunk of the template and insert an additional argument in the rio::import() command. The argument na = "." specifies a period as the value in the Excel sheet that R will consider to be “missing”. As the data are re-imported, all cells with that value are now changed to NA.

Remember that after making this change to the import command you must re-run the entire script. This way, the dataset is correctly imported AND the later cleaning steps, changes to variable names, etc. are also performed on this newly-imported dataset

# The old import command
linelist_raw <- rio::import("AJS_AmTiman.xlsx", which = "linelist")

# The revised import command (don't forget the comma between arguments!)
linelist_raw <- rio::import("AJS_AmTiman.xlsx", which = "linelist", na = ".")

Drop ineligible observations

The next code chunk in the script, remove_unused_data, drops observations with missing case_number or date_of_consultation_admission. Because the Am Timan dataset contains observations for patients seen in the community and at the hospital, these criteria may not be relevant.

We can check, using the code below, and see that of the 1447 observations, there are 0 missing case_number or dateentry, but 616 observations missing date_of_consultation_admission. These 616 community-identified cases are still of interest for our report, so we will not drop them.

# Find the number of observations missing case_number, dateentry, and date_of_consultation_admission
nrow(linelist_cleaned %>% filter(is.na(case_number)))

## [1] 0

nrow(linelist_cleaned %>% filter(is.na(dateentry)))

## [1] 0

nrow(linelist_cleaned %>% filter(is.na(date_of_consultation_admission)))

## [1] 616

Notice that the command above returns the number of rows missing date_of_consultation_admission, but does not overwrite linelist_cleaned with anything, as there is not an assignment operator <-.

After viewing the above results, we must comment out (#) the two lines that filter() the linelist_cleaned dataset - we do not want to remove those 616 observations.

# linelist_cleaned <- linelist_cleaned %>% 
#   filter(!is.na(case_number) & !is.na(date_of_consultation_admission))

Standardise dates

In linelist_cleaned there are seven variables containing dates (dateentry, epiquestdate, date_of_consultation_admission, date_of_onset, date_lab_sample_taken, dthospitalisation, dtdeath).

Because the Am Timan dataset contains date variables not found in the data dictionary, we will comment out the first code lines related to DHIS-aligned data (those commands convert only date variables known to the MSF data dictionary).

Instead, we uncomment and use the commands applicable to non-DHIS2 data. You may see messages warning that a few dates are not in the correct timeframe - this is ok and these dates will be addressed in the next step.

You can read more about the guess_dates() function from the package linelist in the R Basics - Working with Dates page.

## Non-DHIS2 data --------------------------------------------------------------
## Use this section if you did not have DHIS2 data. 
# use the guess_dates() function to make a first pass at date variables.
 linelist_cleaned <- linelist_cleaned %>%
   mutate_at(vars(matches("date|Date")), linelist::guess_dates,
           error_tolerance = 0.5)

The code above successfully converted many of our date variables (remember you can check a variable’s class like class(linelist$date_of_onset). However, this command did not detect the date variables named with “dt”, such as dtdeath and dthospitalization. We must add code to convert those variables manually:

# Check class of variable 
class(linelist_cleaned$dtdeath)

## [1] "character"

# Individually convert variable using the guess_dates() function from the linelist package
linelist_cleaned$dtdeath <- linelist::guess_dates(linelist_cleaned$dtdeath, error_tolerance = 0.5)

#Re-check class of variable
class(linelist_cleaned$dtdeath)

## [1] "Date"

# Repeat for variable "dtdeath"
linelist_cleaned$dthospitalisation <- linelist::guess_dates(linelist_cleaned$dthospitalisation, error_tolerance = 0.5)

Next we uncomment and modify code to correct unrealistic dates. We have browsed our data and know that there are observations with date_of_onset outside the reasonable range:

# Check range of date_of_onset values, ignoring (removing) missing values
range(linelist_cleaned$date_of_onset, na.rm = TRUE)

## [1] "2014-11-01" "2017-12-27"

We convert dates outside the expected range (April 2016 to October 2017) to missing using case_when(). Note that when making the assignment on the right-hand side (RHS), wrap NA in as.Date().

As you modify this chunk for your own situation, remember that each left-hand side (LHS) of the ~ must be a logical statement (not just a value), and to include commas at the end of each case_when() line (except the last one). In addition, it is best to write case_when() lines from most specific at top to most general/catch-all at the bottom. You can read more about case_when() on this tidyverse site.

All the right-hand side (RHS) values must be the same class (either character, numeric, etc.). So, if your other RHS values are character and you want one RHS value to be missing, you cannot just write NA on the RHS. Instead you must use the special character version of NA : NAcharacter.

# Convert dates before April 2016 or after October 2017 to missing (NA)
linelist_cleaned <- mutate(linelist_cleaned,
                           date_of_onset = case_when(
                                 date_of_onset < as.Date("2016-04-01")  ~ as.Date(NA),
                                 date_of_onset > as.Date("2017-10-31")  ~ as.Date(NA),
                                 TRUE                                   ~ date_of_onset
                                 ))

We also must use the provided code to create a variable called epiweek. Although there are already variables that give the epidemiological weeks of various data points, it is safer to calculate a new variable ourselves, AND later code chunks rely on the variable being named epiweek. See the documentation of the package aweek for more information.

# Create variable epiweek
linelist_cleaned$epiweek <- aweek::date2week(linelist_cleaned$date_of_onset, 
                                             floor_day = TRUE, 
                                             factor = TRUE)

A note on the epiweek variable and factors:
Notice that we converted dates to NA that were clear errors (outside the expected date range) before creating the epiweek variable. This is because the epiweek variable is created as a factor, and once a factor is created it is more difficult to remove factor levels. You can read more about factors and dropping unused factor levels in the R Basics - Advanced and Miscellaneous page.

Create age group

To create the variable age_group we must first clean the variable age_years. If we look at the range of values in age_years, we see something strange:

# See the range of age_years values, removing (excluding) NA
range(linelist_cleaned$age_years, na.rm = TRUE)

## [1] "0.08" "9"

We know there are ages older than 9 years. So we check class(linelist_cleaned$age_years) and see that R is reading this variable as class character, not numeric!

We must convert it by adding the following command to the script:

# Convert variable age_years to numeric class
linelist_cleaned$age_years <- as.numeric(linelist_cleaned$age_years)

If we run the range() command again, we can see that the corrected range is 0.08 to 75.

Then, the chunk contains three age-group commands marked “OPTIONAL”:

We do not want to use “add under 2 years to the age_years variable” because it assumes that we already have a variable age_months, which we do not. Comment out (#) this code.
We also do not need to use “change those who are above or below a certain age to NA”, because we already know our range of ages and do not have any outside an expected range. Comment out (#) this code.
We do want to use “create an age_months variable from decimal years variable”, as we do have decimal years. This command will create an age_months variable that has a value if a patient is under 5 years. Uncomment and use this code.

# For patients under 5 years, the new variable age_months is based on age_years * 12 
linelist_cleaned <- mutate(linelist_cleaned,
                           age_months = if_else(age_years < 5, 
                                                true  = age_years * 12, 
                                                false = as.numeric(NA)
                                               )
                          )

Now we can create the variables age_group age_group_mon.

First we use the code to create an age_group_mon variable for children under 5 years based on age_months.

## create age group variable for under 5 years based on months
linelist_cleaned$age_group_mon <- age_categories(linelist_cleaned$age_months, 
                                                 breakers = c(0, 6, 9, 12, 24), 
                                                 ceiling = TRUE)

…and we use the similar code below to create age_group as groupings of the variable age_years. The function age_categories() is used with its breakers = argument set to a vector c( ) of numbers: 0, 3, 15, 30, and 45. Thus, a 30-year old patient will be included in an age group named “30-44”. Read more about age_categories by searching the Help pane.

## create an age group variable by specifying categorical breaks
linelist_cleaned$age_group <- age_categories(linelist_cleaned$age_years, 
                                             breakers = c(0, 3, 15, 30, 45))

Finally, ensure that the remaining code in create_age_group is commented out (#).

Create and clean outcome variables

In the next chunk (create_vars) we can comment out/ignore the code that makes a new numeric variable obs_days, because we do not have a date_of_exit in our dataset.

While the template directs us to next create a DIED variable based on exit_status containing the characters “Dead”, we must first clean our exit_status variable, which is currrently in French.

Add code that uses case_when() to assign new values in a new exit_status2 variable, as in the code below. We do this by modifing code from just below in the template. The code uses the function case_when() to re-define linelist_cleaned as itself, but mutated to create the new variable exit_status2. The values in exit_status2 are based on the values in exit_status, such that when exist_status == "Décédé", the value in exit_status2 is “Dead”, and so on.

As you modify this chunk for your own situation, remember that each left-hand side (LHS) of the ~ must be a logical statement (not just a value), and to include commas at the end of each case_when() line (except the last one). It is best to write case_when() lines from most specific at top to most general/catch-all at bottom. You can read more about case_when() at this tidyverse site.

All the right-hand side (RHS) values must be the same class (either character, numeric, etc.). So, if your other RHS values are character and you want one RHS value to be missing, you cannot just write NA on the RHS. Instead you must use the special character version of NA : NAcharacter.

# Tabulate to see all possible values of exit_status
table(linelist_cleaned$exit_status, useNA = "always")

## 
##              -         Décédé Déchargé/Guéri        Echappé           <NA> 
##              2             14             55              2           1374

# Create exit_status2 from values in exit_status
linelist_cleaned <- linelist_cleaned %>% 
  mutate(exit_status2 = case_when(
    exit_status == "Décédé"          ~ "Dead",
    exit_status == "-"               ~ NA_character_,
    exit_status == "Echappé"         ~ "Left",
    exit_status == "Déchargé/Guéri"  ~ "Discharged"
  ))

# Tabulate the NEW exit_status2 variable to check correct assignment
table(linelist_cleaned$exit_status2, useNA = "always")

## 
##       Dead Discharged       Left       <NA> 
##         14         55          2       1376

Now we can make the DIED variable, which is referenced in later code chunks.

This command creates DIED as a logical (TRUE or FALSE) variable, depending on whether each observation meets the criteria to the right of the assignment operator <-. For a similar example, read about the %in% operator on the R Basics - Advanced and Miscellaneous page.

However, we must modify the existing command to detect within the NEW variable exit_status2, not exit_status.

## Note we are directing R to look within the NEW exit_status2 variable
linelist_cleaned$DIED <- str_detect(linelist_cleaned$exit_status2, "Dead")

Re-code values in `patient_facility_type`

When we assigned our variables to match the data dictionary, we used the variable hospitalisation as the variable patient_facility_type. However, the values in that variable do not match those expected by the template. In the data dictionary, patient_facility_type should have values of “Inpatient” or “Outpatient.” Currently, the values are “Oui” and “Non”. In later code chunks, analyses are restricted to observations where patient_facility_type == "Inpatient", thus, we must align the values to match the data dictionary.

To clean these values we uncomment and modify code also from the create_vars chunk, found under the heading “recode character variables” (the template example is of fixing incorrect spellings of village names).

Included below are extra steps before and after the case_when() command using table() to verify the correct conversion of values. Remove these two commands after verifying, as otherwise their outputs will appear in the report.

# View all the values in patient_facility_type
table(linelist_cleaned$patient_facility_type, useNA = "always")

## 
##  Non  Oui <NA> 
##  699   86  662

# Convert the values
linelist_cleaned <- linelist_cleaned %>% 
  mutate(patient_facility_type = case_when(
    patient_facility_type == "Oui" ~ "Inpatient",
    patient_facility_type == "Non" ~ "Outpatient"
  ))

# Re-check that the values converted sucessfully
table(linelist_cleaned$patient_facility_type, useNA = "always")

## 
##  Inpatient Outpatient       <NA> 
##         86        699        662

Recode values in lab test variables

For later steps, we will need many of our testing variables to the values “Positive” and “Negative” instead of “Pos” and “Neg”. We can uncomment and use this helper code to make that change. Some notes about this step:

You may receive warnings about unknown levels in some variables - this is generally okay but check the changes visually to be sure View(linelist_cleaned).
One variable (test_hepatitis_e_igm) has values 0 and 1 that we want to change to “Positive” and “Negative”. The fct_recode() function expects character RHS inputs (within quotes) - so put “0” and “1” in quotes, as below, and confirm the change.
One variable (test_hepatitis_b) has a value recorded as “Weakly pos”. For the purposes of this exercise we re-code this to “Positive”.

## sometimes, coding is inconsistent across variables -- for example, "Yes" / "No"
## may be coded as Y, y, yes, 1 / N, n, no, 0. You can change them all at once!
## Create a list of the variables you want to change, and run the following.
## You may need to edit this code if options are different in your data.

# # create list of variables
change_test_vars <- c("hep_e_rdt", "hep_c_rdt", "hep_b_rdt", "test_hepatitis_a", "test_hepatitis_b", "test_hepatitis_c", "malaria_rdt_at_admission", "test_hepatitis_e_genotype", "test_hepatitis_e_igm", "test_hepatitis_e_virus", "hevpcr", "other_arthropod_transmitted_virus")

# standardize options
  linelist_cleaned <- linelist_cleaned %>%  
       mutate_at(vars(change_test_vars), forcats::fct_recode,   
       Positive = "Pos",
       Positive = "pos",
       Positive = "yes",
       Positive = "Yes",
       Positive = "Weakly pos",
       Positive = "1",
       Negative = "Neg",
       Negative = "neg",
       Negative = "no",
       Negative = "No",
       Negative = "0"
     )

Lastly for the chunk create_vars we must create a case definition variable. In this use of case_when(), the last line left-hand side (LHS) is TRUE, which serves as a catch-all for any other possible values that have not yet met the criteria of the earlier case_when() lines.

Note how this code checks the variable hep_e_rdt for “Positive”. The earlier cleaning steps converting hep_e_rdt values from “Pos” to “Positive” were necessary for this case definition to properly apply.

In addition we need to change the value looked for in other_cases_in_hh from “Yes” to “Oui”.

# You MUST modify this section to match your case definition. The below
# uses positive RDT for Confirmed and epi link only for Probable.

linelist_cleaned <- linelist_cleaned %>%
  mutate(case_def = case_when(
    is.na(hep_e_rdt) & is.na(other_cases_in_hh)           ~ NA_character_,
    hep_e_rdt == "Positive"                               ~ "Confirmed",
    hep_e_rdt != "Positive" & other_cases_in_hh == "Oui"  ~ "Probable",
    TRUE                                                  ~ "Suspected"
  ))

Factor variables

Factor variables require specialized functions to manipulate (see Factor sections of the R Basics - Advanced and Miscellaneous page). The one factor variable we should clean now is sex, with the following code:

linelist_cleaned$sex <- fct_recode(linelist_cleaned$sex, 
                                   NULL = "Unknown/unspecified")

We can comment out (#) the rest of the factor_vars chunk, such as the code to change the order of levels in categorical variables. The Am Timan dataset does not include a variable time_to_death and we do not need to change the order of any categorical variables.

Cleaning patient origin

We must add one additional cleaning step necessary for this dataset.

The variable patient_origin (originally the variable quartier in the raw dataset) is referenced in the place analyses chunks, for example when calculating attack rates by region. In those steps, the variable patient_origin in linelist_cleaned is joined to the variable patient_origin of the data frame population_data_region (which was imported in the population and lab data steps).

However, the place names in population_data_region are ALL CAPITAL LETTERS. This is not true in linelist_cleaned - so the join will not register any matches. The easiest way to solve this problem is to redefine the linelist_cleaned variable into all capital letters as well, using the base R function toupper(), which means “to upper case”:

linelist_cleaned$patient_origin <- toupper(linelist_cleaned$patient_origin)

Gather common variables for later analysis

These variables names are stored in vectors that are created using the function c(). These vectors of names will be referenced in later code chunks. This way if you want to run the same function over these variables you can simply use the named vector rather than typing out each variable individually.

The default template code creates two vectors - one for symptom variables (SYMPTOMS) and one for laboratory testing variables (LABS).

# vectors of variable names ----------------------------------------------------
# create a grouping of all symptoms 

SYMPTOMS <- c("generalized_itch", 
#             "history_of_fever", 
              "fever",
              "joint_pains",
              "epigastric_pain_heartburn",
              "nausea_anorexia",
              "vomiting", 
              "diarrhoea",
              "bleeding", 
              "headache",
              "mental_state",
#              "convulsions",
              "other_symptoms"  
              )

# create a grouping of all lab tests 
LABS <- c("hep_b_rdt", 
          "hep_c_rdt",
          "hep_e_rdt",
          "test_hepatitis_a",
          "test_hepatitis_b",
          "test_hepatitis_c",
#          "test_hepatitis_e_igg",
          "test_hepatitis_e_igm" ,
          "test_hepatitis_e_genotype",
          "test_hepatitis_e_virus",
          "malaria_rdt_at_admission",
#          "malaria_blood_film", 
          "dengue",
#          "dengue_rdt", 
          "yellow_fever",
#          "typhoid",
          "chikungunya_onyongnyong", 
#          "ebola_marburg",
#          "lassa_fever",
          "other_arthropod_transmitted_virus" 
#          "other_pathogen"
          )

Final setup for report

The next code chunk, report_setup, defines important date-related parameters used to produce the report, and filters to remove any observations with date of onset after the reporting_week (reporting_week is an object defined at the beginning of the script).

The second command in this chunk (shown below) uses filter() to remove observations later than the end of the pre-defined reporting_week. However, filter() also removes observations with missing date_of_consultation_admission. To avoid this we must can add | is.na(date_of_onset) into the filter (This is read as: “keep observations where date_of_consultation_admission is less than or equal to obs_end, OR, if date_of_consultation_admission is missing”).

# filter out cases after end of reporting week
linelist_cleaned <- linelist_cleaned %>% 
  filter(date_of_consultation_admission <= obs_end | is.na(date_of_consultation_admission))

Export if desired

And finally, if desired you can export the cleaned dataset for other purposes.

The command export() is also from the package rio.

First, the object that you want to export is named (linelist_cleaned).

Then, the function str_glue() creates the name for the new file by combining “AmTiman_linelist_cleaned_” with the current date and finally the extention “.xlsx”. Don’t be confused - this is equivalen of writing rio::export(linelist_cleaned, "AmTiman_linelist_cleaned_2019-08-24.xlsx"), but using str_glue() and Sys.Date() means the current date will always be used automatically.

## OPTIONAL: save your cleaned dataset! 
## put the current date in the name so you know!
rio::export(linelist_cleaned, str_glue("AmTiman_linelist_cleaned_{Sys.Date()}.xlsx"))

Standardise and clean data

Standardize and clean data

Re-code missing values from periods (.) to NA at import

Drop ineligible observations

Standardise dates

Create age group

Create and clean outcome variables

Re-code values in patient_facility_type

Recode values in lab test variables

Factor variables

Cleaning patient origin

Gather common variables for later analysis

Final setup for report

Export if desired

Re-code missing values from periods (.) to `NA` at import

Re-code values in `patient_facility_type`