1 Introduction

In this tutorial, we will be looking at how to reshape data with complex structure, so that it is in tidy (long) format and ready for analysis. You will be working with two examples:

Hepatitis C patient follow-up data
Patient referral data

Patient follow-up data: challenges

Each row of the data set represents one consultation visit by the patient
This means there are multiple rows of data for each patient
The data needs to summarised so that each patient is counted only once
There are many columns of data, repeated for each visit

Referral data: challenges

The data is in two separate files, which need to be combined
Each file contains data for each month in a separate sheet
The rows of data from the different sheets need to be combined together

1.1 Load packages

As usual, we will begin by loading the packages needed for this session:

# Check if pacman is installed, install and load if not:
if (!require("pacman")) {
  install.packages("pacman") }

# Use pacman to check for and load the other required packages:
pacman::p_load(
  here, 
  rio, 
  skimr,
  janitor,
  gtsummary,
  flextable,
  tidyverse)

1.2 Import data

Now we can import the Hepatitis C data:

# Import the hepatitis C patient follow-up data:
dat <- import(here("data", "Hep C", "MultiCentricCohortHepC_R_session_3.csv"))

1.3 Explore data

We can see that it is a large data set, but it would be useful to briefly check the structure and see what it looks like in the viewer:

# Get a summary of the data structure:
str(dat)

## 'data.frame':    4416 obs. of  302 variables:
##  $ V1                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pat_id                    : int  20000 20000 20000 20000 20000 20000 20000 20000 20000 20000 ...
##  $ redcap_event_name         : chr  "initial_arm_1" "02_arm_1" "03_arm_1" "04_arm_1" ...
##  $ visit_num                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ visit_num_last            : int  13 13 13 13 13 13 13 13 13 13 ...
##  $ outc_end                  : chr  "N" "N" "N" "N" ...
##  $ outc_end_reason           : chr  NA NA NA NA ...
##  $ pat_id_key_calc           : int  68 NA NA NA NA NA NA NA NA NA ...
##  $ pat_id_key                : int  68 NA NA NA NA NA NA NA NA NA ...
##  $ pat_enrolment_dt          : IDate, format: "2016-03-21" NA ...
##  $ pat_enrolment_site        : int  3 NA NA NA NA NA NA NA NA NA ...
##  $ pat_cst_sign              : chr  "NS" "" "" "" ...
##  $ pat_cst_phone             : chr  "NS" "" "" "" ...
##  $ pat_cst_home              : chr  "NS" "" "" "" ...
##  $ pat_cst_contact           : chr  "NS" "" "" "" ...
##  $ pat_id_rh_src             : chr  "SF" "" "" "" ...
##  $ pat_id_rh_src_x           : logi  NA NA NA NA NA NA ...
##  $ pat_id_rh                 : chr  "HTA000773" "" "" "" ...
##  $ pat_id_rx_src             : chr  "" "" "" "" ...
##  $ pat_id_rx                 : chr  "" "" "" "" ...
##  $ chs_ser_dt                : chr  "18-05-2015" "" "" "" ...
##  $ chs_ser_type              : chr  "OQK" "" "" "" ...
##  $ chs_ser_result            : chr  "P" "" "" "" ...
##  $ chs_vir_dt                : chr  "01-10-2015" "" "" "" ...
##  $ chs_vir_result            : int  1270000 NA NA NA NA NA NA NA NA NA ...
##  $ spe_pak_vlqlt             : logi  NA NA NA NA NA NA ...
##  $ spe_uzb_vl_type           : logi  NA NA NA NA NA NA ...
##  $ dem_dob_know              : int  0 NA NA NA NA NA NA NA NA NA ...
##  $ dem_dob                   : IDate, format: NA NA ...
##  $ dem_age_val               : int  42 NA NA NA NA NA NA NA NA NA ...
##  $ dem_age                   : num  42 NA NA NA NA NA NA NA NA NA ...
##  $ dem_sex                   : int  1 NA NA NA NA NA NA NA NA NA ...
##  $ dem_nationality           : chr  "MMR" "" "" "" ...
##  $ dem_nationality_x         : logi  NA NA NA NA NA NA ...
##  $ dem_work                  : chr  "U" "" "" "" ...
##  $ dem_work_x                : chr  "" "" "" "" ...
##  $ spe_daw_work              : logi  NA NA NA NA NA NA ...
##  $ spe_daw_work_x            : logi  NA NA NA NA NA NA ...
##  $ spe_yan_work              : chr  "" "" "" "" ...
##  $ spe_yan_work_x            : chr  "" "" "" "" ...
##  $ dem_marital               : chr  "M" "" "" "" ...
##  $ dem_marital_x             : chr  "" "" "" "" ...
##  $ bsl_height                : chr  "159" "" "" "" ...
##  $ bsl_hcv_tx_done           : chr  "N" "" "" "" ...
##  $ bsl_hcv_tx_completion_dt_y: logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_peg_ifn        : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_ribavirin      : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_sofosbuvir     : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_daclatasvir    : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_ledipasvir     : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_telaprevir     : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_boceprevir     : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_oth_1          : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_oth_1_x        : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_oth_2          : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_oth_2_x        : logi  NA NA NA NA NA NA ...
##  $ bsl_hcv_tx_outc           : logi  NA NA NA NA NA NA ...
##  $ bsl_hbv_ser_vacci         : chr  "DN" "" "" "" ...
##  $ bsl_hbv_ser_done          : chr  "Y" "" "" "" ...
##  $ bsl_hbv_ser_dt            : chr  "10-2015" "" "" "" ...
##  $ bsl_hbv_ser_result        : chr  "N" "" "" "" ...
##  $ cme_hiv_test_done         : chr  "Y" "" "" "" ...
##  $ cme_hiv_test_year         : chr  "2006" "" "" "" ...
##  $ cme_hiv_test_result       : chr  "P" "" "" "" ...
##  $ cme_hiv_followed          : chr  "Y" "" "" "" ...
##  $ cme_hiv_cd4_dt            : chr  "09-2009" "" "" "" ...
##  $ cme_hiv_cd4_result        : chr  "114" "" "" "" ...
##  $ cme_hiv_stage             : chr  "3" "" "" "" ...
##  $ cme_hiv_tx_done           : chr  "Y" "" "" "" ...
##  $ cme_hiv_tx_dt_art         : chr  "11-2009" "" "" "" ...
##  $ spe_uzb_hiv_dt            : logi  NA NA NA NA NA NA ...
##  $ spe_uzb_hiv_vl            : logi  NA NA NA NA NA NA ...
##  $ cme_hiv_tx_dt             : chr  "01-2013" "" "" "" ...
##  $ cme_hiv_tx_regimen        : chr  "X" "" "" "" ...
##  $ cme_hiv_tx_drug_1         : int  1 NA NA NA NA NA NA NA NA NA ...
##  $ cme_hiv_tx_drug_2         : int  13 NA NA NA NA NA NA NA NA NA ...
##  $ cme_hiv_tx_drug_3         : int  4 NA NA NA NA NA NA NA NA NA ...
##  $ cme_hiv_tx_drug_4         : int  90 NA NA NA NA NA NA NA NA NA ...
##  $ cme_oth_tb                : chr  "N" "" "" "" ...
##  $ cme_oth_hbv               : logi  NA NA NA NA NA NA ...
##  $ cme_oth_card              : logi  NA NA NA NA NA NA ...
##  $ cme_oth_hbp               : logi  NA NA NA NA NA NA ...
##  $ cme_oth_thalassemia       : chr  "N" "" "" "" ...
##  $ cme_oth_fluke             : chr  "N" "" "" "" ...
##  $ cme_oth_renal_ipmt        : chr  "N" "" "" "" ...
##  $ cme_oth_diabete           : chr  "N" "" "" "" ...
##  $ cme_dep_phq9              : chr  "2" "" "" "" ...
##  $ rsk_fc_invasive_proc      : chr  "N" "" "" "" ...
##  $ rsk_fc_blood_trs          : chr  "N" "" "" "" ...
##  $ spe_pak_inj               : logi  NA NA NA NA NA NA ...
##  $ rsk_fc_worker_health      : chr  "N" "" "" "" ...
##  $ rsk_fc_prisoner           : chr  "YP" "" "" "" ...
##  $ rsk_fc_partner_hcv        : chr  "DK" "" "" "" ...
##  $ spe_pak_child             : logi  NA NA NA NA NA NA ...
##  $ rsk_fc_transgender        : chr  "N" "" "" "" ...
##  $ rsk_fc_msm                : chr  "N" "" "" "" ...
##  $ rsk_fc_worker_sex         : chr  "N" "" "" "" ...
##  $ spe_pak_cage              : logi  NA NA NA NA NA NA ...
##  $ add_alc_audit             : chr  "4" "" "" "" ...
##   [list output truncated]

# List the column names:
names(dat)

##   [1] "V1"                         "pat_id"                    
##   [3] "redcap_event_name"          "visit_num"                 
##   [5] "visit_num_last"             "outc_end"                  
##   [7] "outc_end_reason"            "pat_id_key_calc"           
##   [9] "pat_id_key"                 "pat_enrolment_dt"          
##  [11] "pat_enrolment_site"         "pat_cst_sign"              
##  [13] "pat_cst_phone"              "pat_cst_home"              
##  [15] "pat_cst_contact"            "pat_id_rh_src"             
##  [17] "pat_id_rh_src_x"            "pat_id_rh"                 
##  [19] "pat_id_rx_src"              "pat_id_rx"                 
##  [21] "chs_ser_dt"                 "chs_ser_type"              
##  [23] "chs_ser_result"             "chs_vir_dt"                
##  [25] "chs_vir_result"             "spe_pak_vlqlt"             
##  [27] "spe_uzb_vl_type"            "dem_dob_know"              
##  [29] "dem_dob"                    "dem_age_val"               
##  [31] "dem_age"                    "dem_sex"                   
##  [33] "dem_nationality"            "dem_nationality_x"         
##  [35] "dem_work"                   "dem_work_x"                
##  [37] "spe_daw_work"               "spe_daw_work_x"            
##  [39] "spe_yan_work"               "spe_yan_work_x"            
##  [41] "dem_marital"                "dem_marital_x"             
##  [43] "bsl_height"                 "bsl_hcv_tx_done"           
##  [45] "bsl_hcv_tx_completion_dt_y" "bsl_hcv_tx_peg_ifn"        
##  [47] "bsl_hcv_tx_ribavirin"       "bsl_hcv_tx_sofosbuvir"     
##  [49] "bsl_hcv_tx_daclatasvir"     "bsl_hcv_tx_ledipasvir"     
##  [51] "bsl_hcv_tx_telaprevir"      "bsl_hcv_tx_boceprevir"     
##  [53] "bsl_hcv_tx_oth_1"           "bsl_hcv_tx_oth_1_x"        
##  [55] "bsl_hcv_tx_oth_2"           "bsl_hcv_tx_oth_2_x"        
##  [57] "bsl_hcv_tx_outc"            "bsl_hbv_ser_vacci"         
##  [59] "bsl_hbv_ser_done"           "bsl_hbv_ser_dt"            
##  [61] "bsl_hbv_ser_result"         "cme_hiv_test_done"         
##  [63] "cme_hiv_test_year"          "cme_hiv_test_result"       
##  [65] "cme_hiv_followed"           "cme_hiv_cd4_dt"            
##  [67] "cme_hiv_cd4_result"         "cme_hiv_stage"             
##  [69] "cme_hiv_tx_done"            "cme_hiv_tx_dt_art"         
##  [71] "spe_uzb_hiv_dt"             "spe_uzb_hiv_vl"            
##  [73] "cme_hiv_tx_dt"              "cme_hiv_tx_regimen"        
##  [75] "cme_hiv_tx_drug_1"          "cme_hiv_tx_drug_2"         
##  [77] "cme_hiv_tx_drug_3"          "cme_hiv_tx_drug_4"         
##  [79] "cme_oth_tb"                 "cme_oth_hbv"               
##  [81] "cme_oth_card"               "cme_oth_hbp"               
##  [83] "cme_oth_thalassemia"        "cme_oth_fluke"             
##  [85] "cme_oth_renal_ipmt"         "cme_oth_diabete"           
##  [87] "cme_dep_phq9"               "rsk_fc_invasive_proc"      
##  [89] "rsk_fc_blood_trs"           "spe_pak_inj"               
##  [91] "rsk_fc_worker_health"       "rsk_fc_prisoner"           
##  [93] "rsk_fc_partner_hcv"         "spe_pak_child"             
##  [95] "rsk_fc_transgender"         "rsk_fc_msm"                
##  [97] "rsk_fc_worker_sex"          "spe_pak_cage"              
##  [99] "add_alc_audit"              "spe_man_alcohol"           
## [101] "add_drg_use"                "add_drg_opioid"            
## [103] "add_drg_coke"               "add_drg_amphet"            
## [105] "add_drg_other"              "add_drg_how"               
## [107] "spe_uzb_drg_how"            "spe_man_assist"            
## [109] "add_drg_ost"                "add_drg_syr"               
## [111] "interview_complete"         "vst_id_key"                
## [113] "vst_site"                   "vst_type___c"              
## [115] "vst_type___b"               "vst_type___r"              
## [117] "vst_dt"                     "vst_dt_bio"                
## [119] "vst_interviewer"            "cli_pulse"                 
## [121] "cli_weight"                 "cli_height_child"          
## [123] "cli_bmi"                    "cli_liver_ascite"          
## [125] "cli_liver_encephalo"        "cmn_drug_consumpt"         
## [127] "cmn_ost"                    "spe_man_oh_cons"           
## [129] "cmn_new"                    "cmn_hiv_infection"         
## [131] "cmn_hbv_infection"          "cmn_tb"                    
## [133] "cmn_diabete"                "cmn_malaria"               
## [135] "cmn_dengue"                 "cmn_other"                 
## [137] "cmn_other_x"                "bio_result_hemato"         
## [139] "bio_result_hemato_dt"       "bio_bld_rbc"               
## [141] "bio_bld_wbc"                "bio_bld_neutro_val"        
## [143] "bio_bld_neutro_unit"        "bio_bld_neutro"            
## [145] "bio_bld_platelet"           "bio_bld_hb_val"            
## [147] "bio_bld_hb_unit"            "bio_bld_hb"                
## [149] "bio_bld_coag"               "bio_result_bld_dt"         
## [151] "bio_bld_qt"                 "bio_bld_inr"               
## [153] "bio_bld_inr_cat"            "bio_bld_bioch"             
## [155] "bio_result_bc_dt"           "bio_bc_alt"                
## [157] "bio_bc_ast"                 "bio_bc_ast_uln"            
## [159] "bio_bc_alb_val"             "bio_bc_alb_unit"           
## [161] "bio_bc_alb"                 "bio_bc_alb_cat"            
## [163] "bio_bc_bilit_val"           "bio_bc_bilit_unit"         
## [165] "bio_bc_bilit"               "bio_bc_bilit_cat"          
## [167] "bio_bc_bilid_val"           "bio_bc_bilid_unit"         
## [169] "bio_bc_bilid"               "bio_bc_glu_val"            
## [171] "bio_bc_glu_unit"            "bio_bc_glu"                
## [173] "bio_bc_crea_val"            "bio_bc_crea_unit"          
## [175] "bio_bc_crea"                "bio_bc_crcl"               
## [177] "bio_bc_tsh_val"             "bio_bc_tsh_unit"           
## [179] "bio_bc_tsh"                 "bio_hc_results"            
## [181] "bio_hc_vl_dt"               "bio_hc_vl"                 
## [183] "bio_hc_vl_und"              "bio_hc_gen_dt"             
## [185] "bio_hc_gen_type"            "bio_hc_gen_type_sub"       
## [187] "bio_hc_fs_dt"               "bio_hc_fs_kpa"             
## [189] "bio_hc_fs_valid"            "bio_hc_fs_stage"           
## [191] "bio_hc_lb_dt"               "bio_hc_lb_score"           
## [193] "bio_hc_liver_apri"          "bio_hc_liver_apri_calc"    
## [195] "bio_hc_liver_plug"          "bio_hc_liver_plug_calc"    
## [197] "bio_hiv_results"            "bio_hiv_dt"                
## [199] "bio_ht_cd4"                 "bio_ht_vl"                 
## [201] "bio_ht_vl_und"              "bio_ht"                    
## [203] "bio_var_results"            "bio_mc_pregnancy_dt"       
## [205] "bio_mc_pregnancy"           "bio_mc_proteinurie_dt"     
## [207] "bio_mc_proteinurie"         "bio_mc_hbs_ag_dt"          
## [209] "bio_mc_hbs_ag"              "bio_mc_hbsab_dt"           
## [211] "bio_mc_hbsab"               "bio_mc_hbv_hbe_dt"         
## [213] "bio_mc_hbv_hbe"             "bio_mc_hbv_vl_dt"          
## [215] "bio_mc_hbv_vl"              "bio_mc_hcv_afp_dt"         
## [217] "bio_mc_hcv_afp"             "bio_mc_us_dt"              
## [219] "bio_mc_us_nl"               "bio_mc_us_hcc"             
## [221] "bio_mc_endo"                "bio_mc_endo_var"           
## [223] "bio_mc_endo_bb"             "hiv_care_positive"         
## [225] "hiv_care_stage"             "hiv_care_outcome"          
## [227] "hiv_care_tx_regimen"        "hiv_care_tx_drug_1"        
## [229] "hiv_care_tx_drug_2"         "hiv_care_tx_drug_3"        
## [231] "hiv_care_tx_drug_4"         "outc_eligible"             
## [233] "outc_eligible_reason"       "outc_eligible_reason_x"    
## [235] "outc_tx_decision"           "outc_refill_warning"       
## [237] "outc_modif_reason_x"        "outc_stop_reason_x"        
## [239] "outc_protocole"             "outc_protocole_reason"     
## [241] "outc_protocole_x"           "outc_tx_drug_1"            
## [243] "outc_tx_drug_2"             "outc_tx_drug_3"            
## [245] "outc_child_drug_1"          "outc_child_drug1_dos"      
## [247] "outc_child_drug_2"          "outc_child_drug2_dos"      
## [249] "outc_child_drug_3"          "outc_child_drug3_dos"      
## [251] "outc_tx_drug_wk"            "outc_tx_drug_wk_x"         
## [253] "outc_tx_pill"               "spe_yan_pill"              
## [255] "spe_yan_pill_cs"            "ade_np_depression"         
## [257] "ade_np_mania"               "ade_np_headache"           
## [259] "ade_oc_eye_disorder"        "ade_oc_loss_hearing"       
## [261] "ade_gi_jaundice"            "ade_gi_diarrhea"           
## [263] "ade_gi_nausea"              "ade_sd_rash"               
## [265] "ade_sd_photo"               "ade_sd_hairloss"           
## [267] "ade_mr_cought"              "ade_mr_arthralgia"         
## [269] "ade_mr_cardio"              "ade_em_diabete"            
## [271] "ade_em_thyroid"             "ade_bt_anemia"             
## [273] "ade_bt_thrombocytopenia"    "ade_bt_neutropenia"        
## [275] "ade_bt_alt"                 "ade_mr_other"              
## [277] "ade_mr_other_other"         "ade_preg"                  
## [279] "ade_sae"                    "spe_hbv_ttt"               
## [281] "spe_hbv_tx_decision"        "spe_hbv_modif_reason"      
## [283] "spe_hbv_stop_reason"        "spe_hbv_ttt_drug"          
## [285] "spe_hbv_ttt_other"          "outc_next_visit"           
## [287] "spe_man_cond_couns"         "spe_man_couns_foc"         
## [289] "spe_man_fn"                 "spe_man_adh"               
## [291] "spe_man_dep"                "spe_man_couns"             
## [293] "spe_man_cond_priority"      "spe_man_priority_dt"       
## [295] "spe_man_priority"           "spe_pp_mobility"           
## [297] "spe_pp_selfcare"            "spe_pp_activities"         
## [299] "spe_pp_pain"                "spe_pp_anxiety"            
## [301] "spe_pp_health"              "visit_complete"

# View the data in the viewer:
View(head(dat))

2 Reshaping and summarising

In this section, we will reshape, count and summarise the data in some tables.

2.1 Visit counts

It would be useful to know how many consultations each patient had. We can number each row for the same patient sequentially, then get the maximum visit number per patient to represent the total number of consultations they had. To do this, we will group_by() patient ID:

dat <- dat %>%
  
  # First we make sure blank cells are coded as NA:
  mutate(
    outc_end        = na_if(outc_end, ""),
    outc_end_reason = na_if(outc_end_reason, "")
    ) %>% 
  
  # Now we can group by patient ID:
  group_by(pat_id) %>%
  
  # Then we calculate the number of visits per patient:
  mutate(
    # Here we use n() to create a sequential number for each visit
    visit_num = 1:n(),
    # Then we get the maximum visit number for each patient:
    visit_num_last = max(visit_num, na.rm = TRUE)
    ) %>%
  
  # Now we ungroup as subsequent operations will not be grouped by patient:
  ungroup %>%
  
  # Finally, we use select to reorder the columns:
  select(pat_id, redcap_event_name, visit_num, visit_num_last, everything())

2.2 Dealing with data from multiple consultations

How do we deal with all the data from the different follow-up consultations? This depends on the desired analysis.

Suppose we are only interested in the final outcome at the end of patient follow-up. This means we can filter the data and keep only the rows that correspond to the last visit for each patient:

# Make a new subset of the data:
dat_last_consultation <- dat %>%
  # Keep only the rows corresponding to the last visit for each patient:
  filter(visit_num == visit_num_last) %>%
  # Select a few key variables for summarizing:
  select(pat_id, 
         redcap_event_name, 
         visit_num, 
         visit_num_last, 
         outc_end, 
         outc_end_reason)


# Check whether patients reached the end of their follow-up:
dat_last_consultation %>%
  tabyl(outc_end) %>%
  qflextable

outc_end	n	percent	valid_percent
N	1	0.002169197	0.002178649
Y	458	0.993492408	0.997821351
	2	0.004338395

# Summarise reasons for exiting the program:
dat_last_consultation %>%
  tabyl(outc_end_reason) %>%
  qflextable

outc_end_reason	n	percent	valid_percent
C	396	0.859002169	0.864628821
D	19	0.041214751	0.041484716
L	23	0.049891540	0.050218341
S	2	0.004338395	0.004366812
TO	18	0.039045553	0.039301310
	3	0.006507592

# Combine whether program complete and reasons for exit
dat_last_consultation %>%
  tabyl(outc_end_reason, outc_end) %>%
  qflextable

outc_end_reason	N	Y	NA_
C	0	396	0
D	0	19	0
L	0	23	0
S	0	2	0
TO	0	18	0
	1	0	2

It may also be useful to summarise information from the first consultation (at enrollment into the program):

# Create subset for first consultation
dat_first_consultation <- dat %>%
  # Keep only rows for first visit for each patient:
  filter(visit_num == 1) %>% 
  # Select patient ID and site of enrollment:
  select(pat_id, pat_enrolment_site)

# Create frequency table of enrollment sites:
dat_first_consultation %>%
  tabyl(pat_enrolment_site) %>%
  qflextable

pat_enrolment_site	n	percent
3	460	0.997830803
43	1	0.002169197

Now we have two different subsets, but both of them should contain the same patient IDs. We can join the two datasets together to create a summary table with information from both the first and the last consultation.

There are a couple of different ways to join data sets. We could use bind_cols() (which will literally just bind the columns from the two data sets together). However, beware: this strategy won’t check to see if the two data sets are in the same order. There is a risk that information from two different patients could be combined in the same row, if the data sets are not in exactly the same order. Have a look at the help for more information (type ?bind_cols in your console).

Reminder:

Check that:

The two data sets are sorted in the same order
The two data sets have the same number of observations
The two data sets contain the same patient IDs

# Create a new data set by binding the columns of the two data sets together:
dat_last_first_1 <-
  bind_cols(
    dat_last_consultation,
    dat_first_consultation
    )

A better way to combine the two data sets would be to use the dplyr::left_join() function. This function uses an identifier column that is common to both data sets to match up the rows between them. In this case, we have a patient ID column in both data sets which we can use to match up the rows and join the two data sets together. We choose a left_join() because this this will retain everything in the primary data set (first one in the join) and only rows from the second data set that have a match by patient ID in the first. Type ?left_join in your console to see the help page and learn more. You can also have a look at the Epidemiologist R handbook chapter on joining data.

# Create a new data set by joining the two data sets together:
dat_last_first_2 <- left_join(
  dat_last_consultation,  # This is the primary data set
  dat_first_consultation, # This is the secondary data set
  by = "pat_id")          # This is the column to match rows with

# Create a frequency table of patient enrollment site from the joined data:
dat_last_first_2 %>%
  tabyl(pat_enrolment_site) %>%
  qflextable

pat_enrolment_site	n	percent
3	460	0.997830803
43	1	0.002169197

# Create a frequency table of patient enrollment site by outcome status:
dat_last_first_2 %>%
  tabyl(pat_enrolment_site, outc_end) %>%
  qflextable

pat_enrolment_site	N	Y	NA_
3	1	457	2
43	0	1	0

# Use gtsummary package instead to make the same cross-tabulation:
dat_last_first_2 %>%
  # Select columns to use in summary table:
  select(pat_enrolment_site, outc_end) %>%
  tbl_summary(
    # Stratify by outcome status:
    by = outc_end,
    # Include row percentages:
    percent = "row",
    # Make nice label for enrollment site:
    label = list(pat_enrolment_site = "Enrolment site")) %>%
  # Add totals:
  add_overall() %>%
  # Update the header to reflect the statistics used:
  modify_header(
    update = list(all_stat_cols() ~ "**{level}**<br>N = {n}",
                  stat_0 ~ "**Overall**<br>N = {N}")) %>%
  # Make the labels bold:
  bold_labels()

Characteristic	Overall N = 459¹	N N = 1¹	Y N = 458¹
Enrolment site
3	458 (100%)	1 (0.2%)	457 (100%)
43	1 (100%)	0 (0%)	1 (100%)
¹ n (%)

2.3 Dealing with data on multiple excel sheets

In the above section, we looked at how to combine two data sets column-wise. However, for our second problem (referral data), we have data for each month in a different excel sheet. These data need to be combined by row (they should all have the same column headers). The function we can use for this is row_bind() (type ?row_bind in your console to learn more):

# create database of consultations for patient 20000
dat_20000 <- dat %>% filter(pat_id == 20000)

# Create database of consultations for patient 20001
dat_20001 <- dat %>% filter(pat_id == 20001)

# Bind the two data sets together by row:
dat_20000_20001 <- bind_rows(
  dat_20000, # database 1
  dat_20001, # database 2
  .id = "df_origin" # this column identifies the source database for each row
)

3 Exercise

Your turn:

In the data folder for this session, you will find some additional data sets; one is for people living with HIV and has a very similar structure to the Hepatitis C data (lots of follow-up visits for each patient). The other data sets are the referral data introduced earlier.

Using the HIV data in the HIV Tout Survey sub-folder, make summary tables of the first and last consultations (hint: use the same strategies that were used for the Hepatitis C data above)
Using the referral data in the Referral pathway sub-folder (choose one of the two files) combine all the data for each month into a single data.frame representing one year (hint: use bind_rows() after importing the data)
Bonus task: what would be a more efficient way of:
1. Importing data from many sheets at once and
2. Binding together many data sets at once by row? (hint: read the Epidemiologist R handbook section on using purrr::map() here for suggestions on how to do this)

R coaching session 3 - pivots and joins

Gregoire Falq

02 April 2024