In this tutorial, we will be looking at how to reshape data with complex structure, so that it is in tidy (long) format and ready for analysis. You will be working with two examples:
Patient follow-up data: challenges
Referral data: challenges
As usual, we will begin by loading the packages needed for this session:
# Check if pacman is installed, install and load if not:
if (!require("pacman")) {
install.packages("pacman") }
# Use pacman to check for and load the other required packages:
pacman::p_load(
here,
rio,
skimr,
janitor,
gtsummary,
flextable,
tidyverse)
Now we can import the Hepatitis C data:
# Import the hepatitis C patient follow-up data:
dat <- import(here("data", "Hep C", "MultiCentricCohortHepC_R_session_3.csv"))
We can see that it is a large data set, but it would be useful to briefly check the structure and see what it looks like in the viewer:
# Get a summary of the data structure:
str(dat)
## 'data.frame': 4416 obs. of 302 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pat_id : int 20000 20000 20000 20000 20000 20000 20000 20000 20000 20000 ...
## $ redcap_event_name : chr "initial_arm_1" "02_arm_1" "03_arm_1" "04_arm_1" ...
## $ visit_num : int 1 2 3 4 5 6 7 8 9 10 ...
## $ visit_num_last : int 13 13 13 13 13 13 13 13 13 13 ...
## $ outc_end : chr "N" "N" "N" "N" ...
## $ outc_end_reason : chr NA NA NA NA ...
## $ pat_id_key_calc : int 68 NA NA NA NA NA NA NA NA NA ...
## $ pat_id_key : int 68 NA NA NA NA NA NA NA NA NA ...
## $ pat_enrolment_dt : IDate, format: "2016-03-21" NA ...
## $ pat_enrolment_site : int 3 NA NA NA NA NA NA NA NA NA ...
## $ pat_cst_sign : chr "NS" "" "" "" ...
## $ pat_cst_phone : chr "NS" "" "" "" ...
## $ pat_cst_home : chr "NS" "" "" "" ...
## $ pat_cst_contact : chr "NS" "" "" "" ...
## $ pat_id_rh_src : chr "SF" "" "" "" ...
## $ pat_id_rh_src_x : logi NA NA NA NA NA NA ...
## $ pat_id_rh : chr "HTA000773" "" "" "" ...
## $ pat_id_rx_src : chr "" "" "" "" ...
## $ pat_id_rx : chr "" "" "" "" ...
## $ chs_ser_dt : chr "18-05-2015" "" "" "" ...
## $ chs_ser_type : chr "OQK" "" "" "" ...
## $ chs_ser_result : chr "P" "" "" "" ...
## $ chs_vir_dt : chr "01-10-2015" "" "" "" ...
## $ chs_vir_result : int 1270000 NA NA NA NA NA NA NA NA NA ...
## $ spe_pak_vlqlt : logi NA NA NA NA NA NA ...
## $ spe_uzb_vl_type : logi NA NA NA NA NA NA ...
## $ dem_dob_know : int 0 NA NA NA NA NA NA NA NA NA ...
## $ dem_dob : IDate, format: NA NA ...
## $ dem_age_val : int 42 NA NA NA NA NA NA NA NA NA ...
## $ dem_age : num 42 NA NA NA NA NA NA NA NA NA ...
## $ dem_sex : int 1 NA NA NA NA NA NA NA NA NA ...
## $ dem_nationality : chr "MMR" "" "" "" ...
## $ dem_nationality_x : logi NA NA NA NA NA NA ...
## $ dem_work : chr "U" "" "" "" ...
## $ dem_work_x : chr "" "" "" "" ...
## $ spe_daw_work : logi NA NA NA NA NA NA ...
## $ spe_daw_work_x : logi NA NA NA NA NA NA ...
## $ spe_yan_work : chr "" "" "" "" ...
## $ spe_yan_work_x : chr "" "" "" "" ...
## $ dem_marital : chr "M" "" "" "" ...
## $ dem_marital_x : chr "" "" "" "" ...
## $ bsl_height : chr "159" "" "" "" ...
## $ bsl_hcv_tx_done : chr "N" "" "" "" ...
## $ bsl_hcv_tx_completion_dt_y: logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_peg_ifn : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_ribavirin : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_sofosbuvir : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_daclatasvir : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_ledipasvir : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_telaprevir : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_boceprevir : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_oth_1 : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_oth_1_x : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_oth_2 : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_oth_2_x : logi NA NA NA NA NA NA ...
## $ bsl_hcv_tx_outc : logi NA NA NA NA NA NA ...
## $ bsl_hbv_ser_vacci : chr "DN" "" "" "" ...
## $ bsl_hbv_ser_done : chr "Y" "" "" "" ...
## $ bsl_hbv_ser_dt : chr "10-2015" "" "" "" ...
## $ bsl_hbv_ser_result : chr "N" "" "" "" ...
## $ cme_hiv_test_done : chr "Y" "" "" "" ...
## $ cme_hiv_test_year : chr "2006" "" "" "" ...
## $ cme_hiv_test_result : chr "P" "" "" "" ...
## $ cme_hiv_followed : chr "Y" "" "" "" ...
## $ cme_hiv_cd4_dt : chr "09-2009" "" "" "" ...
## $ cme_hiv_cd4_result : chr "114" "" "" "" ...
## $ cme_hiv_stage : chr "3" "" "" "" ...
## $ cme_hiv_tx_done : chr "Y" "" "" "" ...
## $ cme_hiv_tx_dt_art : chr "11-2009" "" "" "" ...
## $ spe_uzb_hiv_dt : logi NA NA NA NA NA NA ...
## $ spe_uzb_hiv_vl : logi NA NA NA NA NA NA ...
## $ cme_hiv_tx_dt : chr "01-2013" "" "" "" ...
## $ cme_hiv_tx_regimen : chr "X" "" "" "" ...
## $ cme_hiv_tx_drug_1 : int 1 NA NA NA NA NA NA NA NA NA ...
## $ cme_hiv_tx_drug_2 : int 13 NA NA NA NA NA NA NA NA NA ...
## $ cme_hiv_tx_drug_3 : int 4 NA NA NA NA NA NA NA NA NA ...
## $ cme_hiv_tx_drug_4 : int 90 NA NA NA NA NA NA NA NA NA ...
## $ cme_oth_tb : chr "N" "" "" "" ...
## $ cme_oth_hbv : logi NA NA NA NA NA NA ...
## $ cme_oth_card : logi NA NA NA NA NA NA ...
## $ cme_oth_hbp : logi NA NA NA NA NA NA ...
## $ cme_oth_thalassemia : chr "N" "" "" "" ...
## $ cme_oth_fluke : chr "N" "" "" "" ...
## $ cme_oth_renal_ipmt : chr "N" "" "" "" ...
## $ cme_oth_diabete : chr "N" "" "" "" ...
## $ cme_dep_phq9 : chr "2" "" "" "" ...
## $ rsk_fc_invasive_proc : chr "N" "" "" "" ...
## $ rsk_fc_blood_trs : chr "N" "" "" "" ...
## $ spe_pak_inj : logi NA NA NA NA NA NA ...
## $ rsk_fc_worker_health : chr "N" "" "" "" ...
## $ rsk_fc_prisoner : chr "YP" "" "" "" ...
## $ rsk_fc_partner_hcv : chr "DK" "" "" "" ...
## $ spe_pak_child : logi NA NA NA NA NA NA ...
## $ rsk_fc_transgender : chr "N" "" "" "" ...
## $ rsk_fc_msm : chr "N" "" "" "" ...
## $ rsk_fc_worker_sex : chr "N" "" "" "" ...
## $ spe_pak_cage : logi NA NA NA NA NA NA ...
## $ add_alc_audit : chr "4" "" "" "" ...
## [list output truncated]
# List the column names:
names(dat)
## [1] "V1" "pat_id"
## [3] "redcap_event_name" "visit_num"
## [5] "visit_num_last" "outc_end"
## [7] "outc_end_reason" "pat_id_key_calc"
## [9] "pat_id_key" "pat_enrolment_dt"
## [11] "pat_enrolment_site" "pat_cst_sign"
## [13] "pat_cst_phone" "pat_cst_home"
## [15] "pat_cst_contact" "pat_id_rh_src"
## [17] "pat_id_rh_src_x" "pat_id_rh"
## [19] "pat_id_rx_src" "pat_id_rx"
## [21] "chs_ser_dt" "chs_ser_type"
## [23] "chs_ser_result" "chs_vir_dt"
## [25] "chs_vir_result" "spe_pak_vlqlt"
## [27] "spe_uzb_vl_type" "dem_dob_know"
## [29] "dem_dob" "dem_age_val"
## [31] "dem_age" "dem_sex"
## [33] "dem_nationality" "dem_nationality_x"
## [35] "dem_work" "dem_work_x"
## [37] "spe_daw_work" "spe_daw_work_x"
## [39] "spe_yan_work" "spe_yan_work_x"
## [41] "dem_marital" "dem_marital_x"
## [43] "bsl_height" "bsl_hcv_tx_done"
## [45] "bsl_hcv_tx_completion_dt_y" "bsl_hcv_tx_peg_ifn"
## [47] "bsl_hcv_tx_ribavirin" "bsl_hcv_tx_sofosbuvir"
## [49] "bsl_hcv_tx_daclatasvir" "bsl_hcv_tx_ledipasvir"
## [51] "bsl_hcv_tx_telaprevir" "bsl_hcv_tx_boceprevir"
## [53] "bsl_hcv_tx_oth_1" "bsl_hcv_tx_oth_1_x"
## [55] "bsl_hcv_tx_oth_2" "bsl_hcv_tx_oth_2_x"
## [57] "bsl_hcv_tx_outc" "bsl_hbv_ser_vacci"
## [59] "bsl_hbv_ser_done" "bsl_hbv_ser_dt"
## [61] "bsl_hbv_ser_result" "cme_hiv_test_done"
## [63] "cme_hiv_test_year" "cme_hiv_test_result"
## [65] "cme_hiv_followed" "cme_hiv_cd4_dt"
## [67] "cme_hiv_cd4_result" "cme_hiv_stage"
## [69] "cme_hiv_tx_done" "cme_hiv_tx_dt_art"
## [71] "spe_uzb_hiv_dt" "spe_uzb_hiv_vl"
## [73] "cme_hiv_tx_dt" "cme_hiv_tx_regimen"
## [75] "cme_hiv_tx_drug_1" "cme_hiv_tx_drug_2"
## [77] "cme_hiv_tx_drug_3" "cme_hiv_tx_drug_4"
## [79] "cme_oth_tb" "cme_oth_hbv"
## [81] "cme_oth_card" "cme_oth_hbp"
## [83] "cme_oth_thalassemia" "cme_oth_fluke"
## [85] "cme_oth_renal_ipmt" "cme_oth_diabete"
## [87] "cme_dep_phq9" "rsk_fc_invasive_proc"
## [89] "rsk_fc_blood_trs" "spe_pak_inj"
## [91] "rsk_fc_worker_health" "rsk_fc_prisoner"
## [93] "rsk_fc_partner_hcv" "spe_pak_child"
## [95] "rsk_fc_transgender" "rsk_fc_msm"
## [97] "rsk_fc_worker_sex" "spe_pak_cage"
## [99] "add_alc_audit" "spe_man_alcohol"
## [101] "add_drg_use" "add_drg_opioid"
## [103] "add_drg_coke" "add_drg_amphet"
## [105] "add_drg_other" "add_drg_how"
## [107] "spe_uzb_drg_how" "spe_man_assist"
## [109] "add_drg_ost" "add_drg_syr"
## [111] "interview_complete" "vst_id_key"
## [113] "vst_site" "vst_type___c"
## [115] "vst_type___b" "vst_type___r"
## [117] "vst_dt" "vst_dt_bio"
## [119] "vst_interviewer" "cli_pulse"
## [121] "cli_weight" "cli_height_child"
## [123] "cli_bmi" "cli_liver_ascite"
## [125] "cli_liver_encephalo" "cmn_drug_consumpt"
## [127] "cmn_ost" "spe_man_oh_cons"
## [129] "cmn_new" "cmn_hiv_infection"
## [131] "cmn_hbv_infection" "cmn_tb"
## [133] "cmn_diabete" "cmn_malaria"
## [135] "cmn_dengue" "cmn_other"
## [137] "cmn_other_x" "bio_result_hemato"
## [139] "bio_result_hemato_dt" "bio_bld_rbc"
## [141] "bio_bld_wbc" "bio_bld_neutro_val"
## [143] "bio_bld_neutro_unit" "bio_bld_neutro"
## [145] "bio_bld_platelet" "bio_bld_hb_val"
## [147] "bio_bld_hb_unit" "bio_bld_hb"
## [149] "bio_bld_coag" "bio_result_bld_dt"
## [151] "bio_bld_qt" "bio_bld_inr"
## [153] "bio_bld_inr_cat" "bio_bld_bioch"
## [155] "bio_result_bc_dt" "bio_bc_alt"
## [157] "bio_bc_ast" "bio_bc_ast_uln"
## [159] "bio_bc_alb_val" "bio_bc_alb_unit"
## [161] "bio_bc_alb" "bio_bc_alb_cat"
## [163] "bio_bc_bilit_val" "bio_bc_bilit_unit"
## [165] "bio_bc_bilit" "bio_bc_bilit_cat"
## [167] "bio_bc_bilid_val" "bio_bc_bilid_unit"
## [169] "bio_bc_bilid" "bio_bc_glu_val"
## [171] "bio_bc_glu_unit" "bio_bc_glu"
## [173] "bio_bc_crea_val" "bio_bc_crea_unit"
## [175] "bio_bc_crea" "bio_bc_crcl"
## [177] "bio_bc_tsh_val" "bio_bc_tsh_unit"
## [179] "bio_bc_tsh" "bio_hc_results"
## [181] "bio_hc_vl_dt" "bio_hc_vl"
## [183] "bio_hc_vl_und" "bio_hc_gen_dt"
## [185] "bio_hc_gen_type" "bio_hc_gen_type_sub"
## [187] "bio_hc_fs_dt" "bio_hc_fs_kpa"
## [189] "bio_hc_fs_valid" "bio_hc_fs_stage"
## [191] "bio_hc_lb_dt" "bio_hc_lb_score"
## [193] "bio_hc_liver_apri" "bio_hc_liver_apri_calc"
## [195] "bio_hc_liver_plug" "bio_hc_liver_plug_calc"
## [197] "bio_hiv_results" "bio_hiv_dt"
## [199] "bio_ht_cd4" "bio_ht_vl"
## [201] "bio_ht_vl_und" "bio_ht"
## [203] "bio_var_results" "bio_mc_pregnancy_dt"
## [205] "bio_mc_pregnancy" "bio_mc_proteinurie_dt"
## [207] "bio_mc_proteinurie" "bio_mc_hbs_ag_dt"
## [209] "bio_mc_hbs_ag" "bio_mc_hbsab_dt"
## [211] "bio_mc_hbsab" "bio_mc_hbv_hbe_dt"
## [213] "bio_mc_hbv_hbe" "bio_mc_hbv_vl_dt"
## [215] "bio_mc_hbv_vl" "bio_mc_hcv_afp_dt"
## [217] "bio_mc_hcv_afp" "bio_mc_us_dt"
## [219] "bio_mc_us_nl" "bio_mc_us_hcc"
## [221] "bio_mc_endo" "bio_mc_endo_var"
## [223] "bio_mc_endo_bb" "hiv_care_positive"
## [225] "hiv_care_stage" "hiv_care_outcome"
## [227] "hiv_care_tx_regimen" "hiv_care_tx_drug_1"
## [229] "hiv_care_tx_drug_2" "hiv_care_tx_drug_3"
## [231] "hiv_care_tx_drug_4" "outc_eligible"
## [233] "outc_eligible_reason" "outc_eligible_reason_x"
## [235] "outc_tx_decision" "outc_refill_warning"
## [237] "outc_modif_reason_x" "outc_stop_reason_x"
## [239] "outc_protocole" "outc_protocole_reason"
## [241] "outc_protocole_x" "outc_tx_drug_1"
## [243] "outc_tx_drug_2" "outc_tx_drug_3"
## [245] "outc_child_drug_1" "outc_child_drug1_dos"
## [247] "outc_child_drug_2" "outc_child_drug2_dos"
## [249] "outc_child_drug_3" "outc_child_drug3_dos"
## [251] "outc_tx_drug_wk" "outc_tx_drug_wk_x"
## [253] "outc_tx_pill" "spe_yan_pill"
## [255] "spe_yan_pill_cs" "ade_np_depression"
## [257] "ade_np_mania" "ade_np_headache"
## [259] "ade_oc_eye_disorder" "ade_oc_loss_hearing"
## [261] "ade_gi_jaundice" "ade_gi_diarrhea"
## [263] "ade_gi_nausea" "ade_sd_rash"
## [265] "ade_sd_photo" "ade_sd_hairloss"
## [267] "ade_mr_cought" "ade_mr_arthralgia"
## [269] "ade_mr_cardio" "ade_em_diabete"
## [271] "ade_em_thyroid" "ade_bt_anemia"
## [273] "ade_bt_thrombocytopenia" "ade_bt_neutropenia"
## [275] "ade_bt_alt" "ade_mr_other"
## [277] "ade_mr_other_other" "ade_preg"
## [279] "ade_sae" "spe_hbv_ttt"
## [281] "spe_hbv_tx_decision" "spe_hbv_modif_reason"
## [283] "spe_hbv_stop_reason" "spe_hbv_ttt_drug"
## [285] "spe_hbv_ttt_other" "outc_next_visit"
## [287] "spe_man_cond_couns" "spe_man_couns_foc"
## [289] "spe_man_fn" "spe_man_adh"
## [291] "spe_man_dep" "spe_man_couns"
## [293] "spe_man_cond_priority" "spe_man_priority_dt"
## [295] "spe_man_priority" "spe_pp_mobility"
## [297] "spe_pp_selfcare" "spe_pp_activities"
## [299] "spe_pp_pain" "spe_pp_anxiety"
## [301] "spe_pp_health" "visit_complete"
# View the data in the viewer:
View(head(dat))
In this section, we will reshape, count and summarise the data in some tables.
It would be useful to know how many consultations each patient had. We can number each row for the same patient sequentially, then get the maximum visit number per patient to represent the total number of consultations they had. To do this, we will group_by()
patient ID:
dat <- dat %>%
# First we make sure blank cells are coded as NA:
mutate(
outc_end = na_if(outc_end, ""),
outc_end_reason = na_if(outc_end_reason, "")
) %>%
# Now we can group by patient ID:
group_by(pat_id) %>%
# Then we calculate the number of visits per patient:
mutate(
# Here we use n() to create a sequential number for each visit
visit_num = 1:n(),
# Then we get the maximum visit number for each patient:
visit_num_last = max(visit_num, na.rm = TRUE)
) %>%
# Now we ungroup as subsequent operations will not be grouped by patient:
ungroup %>%
# Finally, we use select to reorder the columns:
select(pat_id, redcap_event_name, visit_num, visit_num_last, everything())
How do we deal with all the data from the different follow-up consultations? This depends on the desired analysis.
Suppose we are only interested in the final outcome at the end of patient follow-up. This means we can filter the data and keep only the rows that correspond to the last visit for each patient:
# Make a new subset of the data:
dat_last_consultation <- dat %>%
# Keep only the rows corresponding to the last visit for each patient:
filter(visit_num == visit_num_last) %>%
# Select a few key variables for summarizing:
select(pat_id,
redcap_event_name,
visit_num,
visit_num_last,
outc_end,
outc_end_reason)
# Check whether patients reached the end of their follow-up:
dat_last_consultation %>%
tabyl(outc_end) %>%
qflextable
outc_end | n | percent | valid_percent |
---|---|---|---|
N | 1 | 0.002169197 | 0.002178649 |
Y | 458 | 0.993492408 | 0.997821351 |
2 | 0.004338395 |
# Summarise reasons for exiting the program:
dat_last_consultation %>%
tabyl(outc_end_reason) %>%
qflextable
outc_end_reason | n | percent | valid_percent |
---|---|---|---|
C | 396 | 0.859002169 | 0.864628821 |
D | 19 | 0.041214751 | 0.041484716 |
L | 23 | 0.049891540 | 0.050218341 |
S | 2 | 0.004338395 | 0.004366812 |
TO | 18 | 0.039045553 | 0.039301310 |
3 | 0.006507592 |
# Combine whether program complete and reasons for exit
dat_last_consultation %>%
tabyl(outc_end_reason, outc_end) %>%
qflextable
outc_end_reason | N | Y | NA_ |
---|---|---|---|
C | 0 | 396 | 0 |
D | 0 | 19 | 0 |
L | 0 | 23 | 0 |
S | 0 | 2 | 0 |
TO | 0 | 18 | 0 |
1 | 0 | 2 |
It may also be useful to summarise information from the first consultation (at enrollment into the program):
# Create subset for first consultation
dat_first_consultation <- dat %>%
# Keep only rows for first visit for each patient:
filter(visit_num == 1) %>%
# Select patient ID and site of enrollment:
select(pat_id, pat_enrolment_site)
# Create frequency table of enrollment sites:
dat_first_consultation %>%
tabyl(pat_enrolment_site) %>%
qflextable
pat_enrolment_site | n | percent |
---|---|---|
3 | 460 | 0.997830803 |
43 | 1 | 0.002169197 |
Now we have two different subsets, but both of them should contain the same patient IDs. We can join the two datasets together to create a summary table with information from both the first and the last consultation.
There are a couple of different ways to join data sets. We could use bind_cols()
(which will literally just bind the columns from the two data sets together). However, beware: this strategy won’t check to see if the two data sets are in the same order. There is a risk that information from two different patients could be combined in the same row, if the data sets are not in exactly the same order. Have a look at the help for more information (type ?bind_cols
in your console).
Reminder:
Check that:
# Create a new data set by binding the columns of the two data sets together:
dat_last_first_1 <-
bind_cols(
dat_last_consultation,
dat_first_consultation
)
A better way to combine the two data sets would be to use the dplyr::left_join()
function. This function uses an identifier column that is common to both data sets to match up the rows between them. In this case, we have a patient ID column in both data sets which we can use to match up the rows and join the two data sets together. We choose a left_join()
because this this will retain everything in the primary data set (first one in the join) and only rows from the second data set that have a match by patient ID in the first. Type ?left_join
in your console to see the help page and learn more. You can also have a look at the Epidemiologist R handbook chapter on joining data.
# Create a new data set by joining the two data sets together:
dat_last_first_2 <- left_join(
dat_last_consultation, # This is the primary data set
dat_first_consultation, # This is the secondary data set
by = "pat_id") # This is the column to match rows with
# Create a frequency table of patient enrollment site from the joined data:
dat_last_first_2 %>%
tabyl(pat_enrolment_site) %>%
qflextable
pat_enrolment_site | n | percent |
---|---|---|
3 | 460 | 0.997830803 |
43 | 1 | 0.002169197 |
# Create a frequency table of patient enrollment site by outcome status:
dat_last_first_2 %>%
tabyl(pat_enrolment_site, outc_end) %>%
qflextable
pat_enrolment_site | N | Y | NA_ |
---|---|---|---|
3 | 1 | 457 | 2 |
43 | 0 | 1 | 0 |
# Use gtsummary package instead to make the same cross-tabulation:
dat_last_first_2 %>%
# Select columns to use in summary table:
select(pat_enrolment_site, outc_end) %>%
tbl_summary(
# Stratify by outcome status:
by = outc_end,
# Include row percentages:
percent = "row",
# Make nice label for enrollment site:
label = list(pat_enrolment_site = "Enrolment site")) %>%
# Add totals:
add_overall() %>%
# Update the header to reflect the statistics used:
modify_header(
update = list(all_stat_cols() ~ "**{level}**<br>N = {n}",
stat_0 ~ "**Overall**<br>N = {N}")) %>%
# Make the labels bold:
bold_labels()
Characteristic | Overall N = 4591 |
N N = 11 |
Y N = 4581 |
---|---|---|---|
Enrolment site | |||
3 | 458 (100%) | 1 (0.2%) | 457 (100%) |
43 | 1 (100%) | 0 (0%) | 1 (100%) |
1 n (%) |
In the above section, we looked at how to combine two data sets column-wise. However, for our second problem (referral data), we have data for each month in a different excel sheet. These data need to be combined by row (they should all have the same column headers). The function we can use for this is row_bind()
(type ?row_bind
in your console to learn more):
# create database of consultations for patient 20000
dat_20000 <- dat %>% filter(pat_id == 20000)
# Create database of consultations for patient 20001
dat_20001 <- dat %>% filter(pat_id == 20001)
# Bind the two data sets together by row:
dat_20000_20001 <- bind_rows(
dat_20000, # database 1
dat_20001, # database 2
.id = "df_origin" # this column identifies the source database for each row
)
Your turn:
In the data
folder for this session, you will find some additional data sets; one is for people living with HIV and has a very similar structure to the Hepatitis C data (lots of follow-up visits for each patient). The other data sets are the referral data introduced earlier.
HIV Tout Survey
sub-folder, make summary tables of the first and last consultations (hint: use the same strategies that were used for the Hepatitis C data above)Referral pathway
sub-folder (choose one of the two files) combine all the data for each month into a single data.frame representing one year (hint: use bind_rows()
after importing the data)purrr::map()
here for suggestions on how to do this)