1 Introduction:

In this session, we will work with a dataset from the Epidemiologist R handbook chapter on univariable and multivariable regression, to first create some descriptive statistics in summary tables in different ways, then perform univariable and multivariable regression on a binary outcome measure (patient status at exit: recovered or died) to determine risk factors for death.

1.1 Packages and functions required:

You will need the following packages. In the project folder, we have also included some handy wrapper functions for creating publication-ready summary tables and plots of the results of your statistical analysis. You can import these R functions to your environment by using the source() command as shown in the chunk below:

# Ensures the package "pacman" is installed
if (!require("pacman")) install.packages("pacman")

# Check for, install and load the remaining required packages:
pacman::p_load(
  
  # Project and file management
  #############################
  here,     # file paths relative to R project root folder
  rio,      # import/export of many types of data
  openxlsx, # special functions for handling Excel workbooks
  
  # Package install and management
  ################################
  remotes,  # install from github
  
  # General data management
  #########################
  skimr,        # data exploration
  epikit,       # useful epi functions
  zscorer,      # Calculate z scores from weight and height
  
  # Tables and statistics  
  #######################
  janitor,      # tables and data cleaning
  gtsummary,    # making descriptive and statistical tables
  corrr,        # correlation matrices
  caret,        # other useful correlation functions
  broom,        # extract model results summary and convert to a table
  broom.mixed,  # more functions to extract and tidy model results
  parameters,   # Alternative package for tidying up model results 
  ztable,       # Make heatmap table
  
  # Models
  #########
  glmnet,         # needed to carry out variable selection
  lme4,           # for multivariable regression with random effects
  lmtest,         # for calculating likelihood ratios
  
  
  # Plots - general
  #################
  #ggplot2,         # included in tidyverse
  cowplot,          # combining plots  
  RColorBrewer,     # color scales
  gghighlight,      # highlight a subset
  ggrepel,          # smart labels
  ggExtra,          # fancy plots  
  tsibble,          # epiweeks
  viridis,          # colorblind-friendly scales
  scales,           # helper functions
  apyramid,         # age and sex pyramids
  see,              # forest plots and other statistical plots
  ggforestplot,     # Plot a forest plot of model results
  
  # Routine reports
  #################
  rmarkdown,        # produce PDFs, Word, Powerpoints, & HTML files
  
  # Tables for presentation
  #########################
  knitr,            # R Markdown report generation and html tables
  flextable,        # Publication-ready tables
  
  # Tidyverse:
  # dplyr, tidyr, tibble, readr, purrr, forcats, stringr, lubridate, ggplot2
  ##############
  tidyverse        # 9 packages for tidy data wrangling and presentation
  #tidylog         # Logs changes in structure of data in pipe chains
  
)        

# Install latex with tinytex package:
#tinytex::install_tinytex()


##########################################
# Set all flextable backgrounds to white:
set_flextable_defaults(background.color = "white")

1.2 Importing data:

You can import the dataset for this session from the data sub-folder in this project with rio and here packages as shown below. Note that the data is already clean, so we can dive straight into some descriptive statistics. Because the data is saved in .rds format, the column classes in the clean data are preserved.

# Import data:
linelist <- import(here("data", "linelist_cleaned.rds"))

1.3 Data exploration:

The data set is a line listing of COVID-19 patients. The outcome we are interested in is whether they had recovered or died by the time they exited the hospital (recorded in the column called outcome). There are a number of potential risk factors for death as an outcome that we can look at;

demographics (age or age group, sex, hospital)
patient characteristics (height and weight or BMI)
time between symptom onset and hospital admission (days_onset_hosp)
symptoms (e.g. fever, cough, temperature)
PCR cycle threshold as a proxy for viral load (smaller number = higher load)
Transmission context (source or ID of infector, generation time)

First of all, we can familiarise ourselves with the data using the skim() function:

# Summarise dataset with skim:
skim(linelist)

(#tab:data_exploration)Data summary
Name	linelist
Number of rows	5888
Number of columns	30
_______________________
Column type frequency:
character	13
Date	4
factor	2
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
case_id	0	1.00	6	6	5888
outcome	1323	0.78	5	7	2
gender	278	0.95	1	1	2
age_unit	0	1.00	5	6	2
hospital	0	1.00	5	36	6
infector	2088	0.65	6	6	2697
source	2088	0.65	5	7	2
fever	249	0.96	2	3	2
chills	249	0.96	2	3	2
cough	249	0.96	2	3	2
aches	249	0.96	2	3	2
vomit	249	0.96	2	3	2
time_admission	765	0.87	5	5	1072

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date_infection	2087	0.65	2014-03-19	2015-04-27	2014-10-11	359
date_onset	256	0.96	2014-04-07	2015-04-30	2014-10-23	367
date_hospitalisation	0	1.00	2014-04-17	2015-04-30	2014-10-23	363
date_outcome	936	0.84	2014-04-19	2015-06-04	2014-11-01	371

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
age_cat	86	0.99	FALSE	8	0-4: 1095, 5-9: 1095, 20-: 1073, 10-: 941
age_cat5	86	0.99	FALSE	17	0-4: 1095, 5-9: 1095, 10-: 941, 15-: 743

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
generation	0	1.00	16.56	5.79	0.00	13.00	16.00	20.00	37.00	▁▆▇▂▁
age	86	0.99	16.07	12.62	0.00	6.00	13.00	23.00	84.00	▇▅▁▁▁
age_years	86	0.99	16.02	12.64	0.00	6.00	13.00	23.00	84.00	▇▅▁▁▁
lon	0	1.00	-13.23	0.02	-13.27	-13.25	-13.23	-13.22	-13.21	▅▃▃▆▇
lat	0	1.00	8.47	0.01	8.45	8.46	8.47	8.48	8.49	▅▇▇▇▆
wt_kg	0	1.00	52.64	18.58	-11.00	41.00	54.00	66.00	111.00	▁▃▇▅▁
ht_cm	0	1.00	124.96	49.52	4.00	91.00	129.00	159.00	295.00	▂▅▇▂▁
ct_blood	0	1.00	21.21	1.69	16.00	20.00	22.00	22.00	26.00	▁▃▇▃▁
temp	149	0.97	38.56	0.98	35.20	38.20	38.80	39.20	40.80	▁▂▂▇▁
bmi	0	1.00	46.89	55.39	-1200.00	24.56	32.12	50.01	1250.00	▁▁▇▁▁
days_onset_hosp	256	0.96	2.06	2.26	0.00	1.00	1.00	3.00	22.00	▇▁▁▁▁

We can see that symptoms are currently recorded as character variables, but would be better represented as binary logical variables:

# Convert symptoms to binary variables:
linelist <- linelist %>% 
  # Convert no and yes to FALSE and TRUE, respectively:
  mutate(across(.cols = c(fever, chills, cough, aches, vomit), 
                .fns = ~case_when(.x == "yes" ~ TRUE, 
                                  .x == "no" ~ FALSE,
                                  .default = NA)))

We can also take a quick look at our outcome variable, outcome with the janitor::tabyl() function:

# Summary table of outcome:
tabyl(linelist, outcome)

##  outcome    n   percent valid_percent
##    Death 2582 0.4385190     0.5656079
##  Recover 1983 0.3367867     0.4343921
##     <NA> 1323 0.2246943            NA

We can see there are some missing values for outcome. These can be removed:

# Remove rows with missing values for outcome:
linelist <- linelist %>% 
  
  # Remove rows where outcome is NA:
  filter(!is.na(outcome))

The data is pretty clean already, but there are a few other things we could do to prepare it for analysis:

Missing hospitals have been coded as missing but should be NA
Gender is coded as f and m but would be better as female and male
Create a binary case definition, where ‘death’ is the outcome of interest

linelist <- linelist %>% 
  
  # Recode missing hospital as NA:
  mutate(hospital = na_if(hospital, "Missing")) %>% 

  # Recode gender:
  mutate(gender = case_match(
    gender, 
    "f" ~ "Female", 
    "m" ~ "Male", 
    .default = gender)) %>% 
  
  # Case definition:
  mutate(casedef = case_when(
    outcome == "Death" ~ TRUE, 
    outcome == "Recover" ~ FALSE,
    .default = NA
  ))

Now that our data is ready, we can begin to summarise it with some basic statistics.

2 Statistical analysis

2.1 Exercise 1

Descriptive tables with tabyl() and flextable()

We have already seen how easy it is to get a quick summary table with the tabyl() function from the {janitor} package. The function will provide us with the number and proportion of cases for each value in the variable.

However, we would probably want to tidy the basic output up before putting such a table in a report. Fortunately the {janitor} package has a number of adorn_ functions that do exactly this. We can convert the proportions to percentages, adjust how many decimal places are shown, and add row and column totals. We can also suppress missing values.

With the {flextable} package, we can convert our table to a nicely formatted table ready for inclusion in a report. We can adjust the way the header rows appear, give the column names nice labels, merge cells, make font bold or italic, change the background colour of cells, add borders and many more features. For a detailed introduction to using flextable, see the Tables for presentation section in the Epidemiologist R handbook.

Here is an example with the outcome variable:

# Create basic summary table with tabyl:
outcome_tab <- linelist %>% 
  
  # Basic table with missing values supressed:
  tabyl(outcome, show_na = FALSE) %>% 
  
  # Add Column totals (the default):
  adorn_totals() %>% 
  
  # Format proportions as a percentage with two digits after the decimal point:
  adorn_pct_formatting(digits = 2) %>% 
  
  # Change column names for merging later:
  rename(
    Outcome = outcome,
    Cases_n = n, 
    Cases_percent = percent) %>% 
  
  # Convert to a flextable:
  qflextable() %>% 
  
  # Change the theme to something nicer:
  theme_booktabs() %>% 
  
  # Add a line above the 'total' row (under second row):
  hline(i = 2, border = fp_border_default()) %>% 

  # Separate header:
  separate_header(
    opts = c("span-top", "center-hspan")
  ) %>% 
  
  # Give the column names nice labels:
  set_header_labels(
    Outcome = "Outcome",
    Cases_n = "n",
    Cases_percent = "%"
  ) %>%
  
  # Make the header row bold:
  bold(part = "header") %>% 
  
  # Right-align case numbers and percent columns:
  align(j = 2:3, align = "right", part = "all") %>%
  
  # Center align cases header title:
  align(i = 1, j = 2:3, align = "center", part = "header") %>% 
  
  # Make bottom row bold:
  bold(i = 3) %>% 
  
  # Make bottom row shaded light grey:
  bg(i = 3, bg = "lightgrey") %>% 
  
  # Autofit column width to size of header text:
  set_table_properties(layout = "autofit")


# Print the table:
outcome_tab

Outcome	Cases
Outcome	n	%
Death	2,582	56.56%
Recover	1,983	43.44%
Total	4,565	100.00%

Your turn:

Using the code above as a guide, try creating a summary table with:

the pecentage of cases in each hospital stratified by gender.
format the percentages so that there are no digits after the decimal point.
Add row totals so that the column headers are hospital, female, male, total.
Make the background colour of the row with other hospitals orange.

Click for hints

You can cross-tabulate with tabyl() by adding two column names instead of one.
To convert the nunbers in the cells to percent, use adorn_percent() before formatting.
To conditionally format a row, provide a logical condition after the tilde, e,g. bold(i = ~ Hospital == "Interesting value").
Don’t forget to count the number of rows (i = ...) or columns (j = ...) to indicate which ones you want to format .
Use ranges to refer to a group of columns or rows, e.g. 1:5

Click for solution

# Create basic summary table with tabyl:
hospgender_tab <- linelist %>% 
  
  # Basic crosstab of hospital stratified by gender with NAs supressed:
  tabyl(hospital, gender, show_na = FALSE) %>% 
  
  # Add row totals:
  adorn_totals(where = "col") %>%
  
  # Convert numbers to percent:
  adorn_percentages() %>% 
  
  # Format proportions as a percentage with two digits after the decimal point:
  adorn_pct_formatting(digits = 0) %>% 
  
  # Change column names for merging later:
  rename(
    Hospital = hospital,
    Cases_female = Female, 
    Cases_male = Male) %>% 
  
  # Convert to a flextable:
  qflextable() %>% 
  
  # Change the theme to something nicer:
  theme_booktabs() %>% 
  
  # Separate header:
  separate_header(
    opts = c("span-top", "center-hspan")
  ) %>% 
  
  # Give the column names nice labels:
  set_header_labels(
    Hospital = "Hospital",
    Cases_female = "Female",
    Cases_male = "Male"
  ) %>%
  
  # Make the header row bold:
  bold(part = "header") %>% 
  
  # Right-align case numbers and percent columns:
  align(j = 2:4, align = "right", part = "all") %>%
  
  # Center align cases header title:
  align(i = 1, j = 2:4, align = "center", part = "header") %>% 
  
  # Make bottom row bold:
  bold(i = 3) %>% 
  
  # Make bottom row shaded light grey:
  bg(i = ~ Hospital == "Other", bg = "orange") %>% 
  
  # Autofit column width to size of header text:
  set_table_properties(layout = "autofit")


# Print the table:
hospgender_tab

Hospital	Cases		Total
Hospital	Female	Male	Total
Central Hospital	47%	53%	100%
Military Hospital	49%	51%	100%
Other	48%	52%	100%
Port Hospital	52%	48%	100%
St. Mark's Maternity Hospital (SMMH)	52%	48%	100%

2.2 Exercise 2

Descriptive summary statistics with {dplyr}

Sometimes counts and percentages aren’t enough. If you need to calculate other simple summary metrics, such as medians or means for numeric variables, this can easily be done with the {dplyr} function summarise(). This is particularly useful when you want to calculate summary statistics for groups within your data.

Suppose we want to calculate the earliest, median and latest onset date by gender. We could do this with dplyr as below (once again we will use {flextable} to tidy up the aesthetics):

# Create summary statistics by gender:
gendersumtab <- linelist %>% 
  
  # Rename gender:
  rename(Sex = gender) %>% 
  
  # Drop missing values:
  drop_na(Sex, date_onset) %>% 
  
  # Define contents of summary table:
  summarise(
    # Group results by gender:
    .by = Sex,
    # Calculate number of cases:
    Cases = n(),
    # Calculate min onset date:
    Onset_min = min(date_onset, na.rm = TRUE),
    # Calculate median onset date:
    Onset_med = median(date_onset, na.rm = TRUE),
    # Calculate maximum onset date:
    Onset_max = max(date_onset, na.rm = TRUE)
    ) %>% 
  
  # Convert to flextable:
  qflextable() %>% 
  
  # Change theme:
  theme_booktabs() %>% 
  
  # Separate header:
  separate_header(
    opts = c("span-top", "center-hspan")
  ) %>% 
  
  # Give the column names nice labels:
  set_header_labels(
    Onset_min = "Earliest",
    Onset_med = "Median",
    Onset_max = "Latest"
  ) %>% 
  
  # Autofit column width to size of header text:
  set_table_properties(layout = "autofit")

# Print table:
gendersumtab

Sex	Cases	Onset
Sex	Cases	Earliest	Median	Latest
Female	2,076	2014-04-21	2014-10-21	2015-04-30
Male	2,077	2014-04-21	2014-10-21	2015-04-30

Notes:

For anyone who is more used to using american spelling, both the american spellings (often with a ‘z’) and the English spellings (often with an ‘s’) will work for function calls in tidyverse r packages interchangeably. Thus, summarize() and summarise() will both work.
In the code above, we passed the name of the column we want to stratify by (gender in this case) to the .by = ... argument of the summarise() function. Alternatively, we could use the group_by() function before summarise(). This will have the same effect on the summary as using the .by argument, but beware - the whole data set will remain grouped by the stratifying variable until you use the ungroup() function. See below for example:

# Create summary statistics by gender:
groupedll <- linelist %>% 
  
  # Drop missing values:
  drop_na(gender, date_onset) %>% 
  
  # Create a new variable, max age before grouping by anything:
  mutate(maxage = max(age_years, na.rm = TRUE)) %>% 
  
  # Now group the data by gender:
  group_by(gender) %>% 
  
  # Now create the new variable max age again, this time after the grouping:
  mutate(maxagegrouped = max(age_years, na.rm = TRUE))


# Are the two variables the same? 
# No, because one is calculated after grouping by gender and the other is not:

# Here is what max age before grouping looks like:
groupedll %>% tabyl(maxage, gender)

##  maxage Female Male
##      84   2076 2077

# And here is what it looks like after grouping by gender:
groupedll %>% tabyl(maxagegrouped, gender)

##  maxagegrouped Female Male
##             52   2076    0
##             84      0 2077

Here we can see that before grouping, there is only one maximum age for the whole data set (84 years). After grouping by gender, we see that the maximum age for females is 52 years, and for males is 84 years. If we now use the ungroup() function and try creating the same variable again, we should return to a single value (84 years) for the whole data set:

ungroupedll <- groupedll %>% 
  
  # Remove the grouping from the data:
  ungroup() %>% 
  
  # Now create the max age variable again:
  mutate(maxageungrouped = max(age_years, na.rm = TRUE))

  # Now create the table again:
  ungroupedll %>% tabyl(maxageungrouped, gender)

##  maxageungrouped Female Male
##               84   2076 2077

  # We can also show that the two variables 
  # before grouping and after un-grouping are identical:
  identical(ungroupedll$maxage, ungroupedll$maxageungrouped)

## [1] TRUE

The group_by() function can be very useful if you want to calculate something based on grouping by another variable, and you are intending to add the calculations as a new variable to your original data set. If on the other hand, you are creating a summary table, it may be safer to use the .by = ... argument of the summarise() function as this won’t affect the data set itself, only the new table of data created by the summarise() function.

Your turn:

Using the code above as a guide, try creating a summary table with:

the number of cases per hospital
exclude (filter out) hospitals that are other (name not specified)
the earliest, median and latest onset date by hospital
the median age by hospital.

Click for hints

You can use the filter() command to filter out hospitals: filter(!hospital %in% "Other")
Remeber to drop_na() for all the variables you will use to make the table (hospital, date_onset, age)
Use hospital as the grouping variable in the .by argument of summarise()
Give summary statistics that are based on the same variable (date_onset in this case) the same prefix so that the separate_header() function can be used to group them automatically in flextable

Click for solution

# Create summary statistics by gender:
hospsumtab <- linelist %>% 
  
  # Filter out hospital = 'other'
  filter(!hospital %in% "Other") %>% 

  # Rename Hospital:
  rename(Hospital = hospital) %>% 
  
  # Drop missing values:
  drop_na(Hospital, date_onset, age_years) %>% 
  
  # Define contents of summary table:
  summarise(
    # Group results by gender:
    .by = Hospital,
    # Calculate number of cases:
    Cases = n(),
    # Calculate min onset date:
    Onset_min = min(date_onset, na.rm = TRUE),
    # Calculate median onset date:
    Onset_med = median(date_onset, na.rm = TRUE),
    # Calculate maximum onset date:
    Onset_max = max(date_onset, na.rm = TRUE),
    # Calculate median age:
    Age_med = median(age_years, na.rm = TRUE)
    ) %>% 
  
  # Convert to flextable:
  qflextable() %>% 
  
  # Change theme:
  theme_booktabs() %>% 
  
  # Separate header:
  separate_header(
    opts = c("span-top", "center-hspan")
  ) %>% 
  
  # Give the column names nice labels:
  set_header_labels(
    Onset_min = "Earliest",
    Onset_med = "Median",
    Onset_max = "Latest",
    Age_med = "Median"
  ) %>% 
  
  # Autofit column width to size of header text:
  set_table_properties(layout = "autofit")

# Print table:
hospsumtab

Hospital	Cases	Onset			Age
Hospital	Cases	Earliest	Median	Latest	Median
St. Mark's Maternity Hospital (SMMH)	312	2014-05-01	2014-10-25	2015-04-27	12
Military Hospital	672	2014-05-09	2014-10-27	2015-04-28	14
Port Hospital	1,287	2014-05-03	2014-10-24	2015-04-30	13
Central Hospital	337	2014-05-01	2014-10-26	2015-04-28	13

2.3 Exercise 3

Demonstration of {gtsummary} for summary statistics

Often, the summary statistics we want to create are presented in the same format. The {gtsummary} package was designed to make tabulating summary statistics in a ready-to-publish format really easy.

The package will automatically calculate appropriate summary statistics depending on whether the variable to summarise is numeric, categorical or logical. To demonstrate this, we will make a table stratified by gender with a summary of age in years (numeric variable), hospital (categorical variable) and fever (a logical variable).

As for the other techniques, there are built in arguments to handle missing values.

We will make a descriptive summary table using the tbl_summary() function:

# Create summary table:
gtsumtab <- linelist %>% 
  
  # Filter out hospital = 'other'
  filter(!hospital %in% "Other") %>% 
  
  # Create the summary table:
  tbl_summary(
    # Select which variables to include:
    include = c(gender, age_years, hospital, fever), 
    # Select which variable to stratify by:
    by = gender, 
    # Specify how to handle missing values:
    missing = "no", 
    # Specify labels for columns:
    label = list(
      age_years ~ "Age (years)",
      hospital ~ "Hospital", 
      fever ~ "Has fever"
    ),
    # Specify sorting by frequency:
    sort = all_categorical() ~ "frequency"
  ) %>% 
  
  # Add overall summary figures:
  add_overall() %>% 
  
  # Format variable labels:
  bold_labels() %>% 
  italicize_labels() %>% 
  
  # Tidy up column headers:
  modify_header(
    label = "**Characteristic**",
    stat_0 = "**Overall**\nN = {N}",
    stat_1 = "**Female**\nN = {n}",
    stat_2 = "**Male**\nN = {n}"
    ) %>% 
  
  # Format as flextable:
  as_flex_table()
  

# Print the table:
gtsumtab

Characteristic	Overall N = 37061	Female N = 18661	Male N = 18401
Age (years)	13 (6, 23)	10 (5, 18)	17 (8, 28)
Hospital
Port Hospital	1,306 (50%)	673 (51%)	633 (48%)
Military Hospital	677 (26%)	329 (25%)	348 (27%)
Central Hospital	343 (13%)	162 (12%)	181 (14%)
St. Mark's Maternity Hospital (SMMH)	311 (12%)	162 (12%)	149 (11%)
Has fever	2,902 (82%)	1,453 (82%)	1,449 (82%)
1Median (IQR); n (%)

2.4 Excercise 4

Univariable analysis with {gtsummary}

In the dataset, we have a number of potential risk factors for death as an outcome. We can calculate odds ratios for individual risk factors and display the results collectively with the gtsummary::tbl_uvregression() function.

This function is actually a wrapper for the base r glm() function (to create generalised linear models). It is worth understanding how the base r model functions work, since we will need to use them later to create a multivariable model. The syntax for the glm() function for a univariable model (one outcome measure regressed against one explanatory variable) is as follows:

formula = outcome_measure ~ explanatory_variable
family = error_distribution(link = model_link_function) (to see the different options, type ?family())
data = name_of_dataset

Note that the tilde ~ is used to regress the outcome measure against explanatory variables. You will find the tilde on your keyboard in different places, depending on your keyboard language and layout. A good source for checking where to find it for a given keyboard layout is here.

Lets have a look at how to perform a GLM to obtain the odds of dying if the patient had a cough, compared to patients with no cough. We will use the binary logical case definition we prepared earlier ( casedef death = TRUE, recover = FALSE) as our outcome measure. We construct the arguments of the glm() function as follows:

formula = casedef ~ cough
family = binomial(link = "logit") (this will give odds ratios)
data = linelist (or alternatively, we can pipe the data set in as below)

Note: When the data set is not the first argument that a function takes, you need to name all the other arguments in the function (that don’t have default values or that you want to specify a value for) in order for R to understand what to do with the dataset when you pipe it in. In the example below, we already specified the formula and family arguments by name, so R automatically assigns the piped in data set to the data argument as that is the only mandatory argument left for which a value had not been assigned yet.

# Specify the data set:
model_cough <- linelist %>% 
  # Creat the model:
  glm(formula = casedef ~ cough, family = binomial(link = "logit"))

# Look at a summary of the results:
summary(model_cough)

## 
## Call:
## glm(formula = casedef ~ cough, family = binomial(link = "logit"), 
##     data = .)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.15569    0.08156   1.909   0.0563 .
## coughTRUE    0.11680    0.08795   1.328   0.1842  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5978.8  on 4363  degrees of freedom
## Residual deviance: 5977.0  on 4362  degrees of freedom
##   (201 observations deleted due to missingness)
## AIC: 5981
## 
## Number of Fisher Scoring iterations: 3

We can see from the p value (last column in the summary results) that having a cough is not significantly associated with death as an outcome (p > 0.05). However, the output is a little difficult to read, since the estimates are not expressed as odds ratios and we also don’t have confidence intervals (though we could work them out from the standard errors). In order to obtain odds ratios, we need to get the exponent of the estimate. We can do this by applying the exp() function to the second model coefficient (we are interested in the coefficient for coughTRUE, not the intercept):

# Convert the estimate for coughTRUE to odds ratio:
exp(model_cough$coefficients[2])

## coughTRUE 
##    1.1239

This tells us that the odds of dying with a cough are 1.1239 but we still need to obtain the 95% confidence intervals (we already know that they will cross 1 because the p value is not significant). The {broom} package has functions that tidy output from many different types of models available in base R into a table. This can include 95% confidence intervals, and also gives the option to display the exponent of the estimates (i.e. to display odds ratios).

The broom::tidy() function for GLM models takes the following arguments (see ?broom::tidy.glm):

x - the model object that we created with glm (in this case model_cough)
conf.int - logical, do we want to include confidence intervals (we do)
conf.level - what level of confidence interval to include (default is 95%)
exponentiate - logical, do we want the exponent of the estimates (we do)

Here is how the tidy table of results looks for our model:

mctidy <- model_cough %>% 
  
  tidy(
    conf.int = TRUE,    # We do want confidence intervals in the output
    conf.level = 0.95,  # We want 95% confidence intervals (the default)
    exponentiate = TRUE # We want the exponents (display results as odds ratios)
  )

# Print the tidied up model results:
mctidy

## # A tibble: 2 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)     1.17    0.0816      1.91  0.0563    0.996      1.37
## 2 coughTRUE       1.12    0.0880      1.33  0.184     0.946      1.34

Now we have everything we need:

The odds of dying if you have a cough are 1.12
The 95% CI for these odds are 0.946 - 1.34 (crosses 1, as predicted)
The p value for having a cough is 0.184 (not significant)

This is already a much more helpful way of reviewing the model results. However, what if we want to conduct univariable analysis for several different risk factors at once, and put all the results in the same table? This is where the gtsummary::tbl_uvregression() function comes into its own. It does the following things:

takes a list of explanatory variables as input and iterates through them;
uses the base r glm() function to create the model for each variable;
tidies up the results into a table using broom::tidy()
combines the results for each explanatory variable into a single table;
formats the table into a publication-ready format with {flextable}

As an example, we will calculate the odds of dying for four risk factors:

different age categories age_cat (this is a multi-level factor)
gender (this is a factor with two levels)
fever (this is a binary logical variable)
chills (this is a binary logical variable)

Since tbl_uvregression() combines several different functions into one, the arguments from each of the separate functions are available, but they are given different names. The arguments we need are listed below, along with the values we will choose:

method = glm: run a generalised linear model
y = casedef: use casedef (death TRUE/FALSE) as the outcome variable
family = binomial(link = "logit"): this will compute odds ratios
exponentiate = TRUE: this will antilog the coefficients so that they are returned as odds ratios
hidden_n = TRUE: this will hide the totals column (not needed as we will join it to the summary table later)
label = varlabels: give variables nice names

For binary logical variables (TRUE/FALSE), FALSE will be used as the reference value and odds ratios will be calculated for when the value is TRUE.

For two-level factors (such as gender), the first factor level (by alphabetic order this is Female) will be used as the reference value and the other value (Male) will be compared to the reference. For multi-level factors (such as age category), the first level will be used as a reference (in this case 0-4 year olds) and all the other factor levels will be compared to it.

If you want to change which factor level is used as the reference, you can use the stats::relevel() function (part of base R) and specify which level you want to use with the ref = ... argument, e.g. to use the 70+ age group as a reference, within a piped workflow, we would do:

linelist %>% mutate(age_cat = relevel(ref = "70+"))

# Create logical case definition:
uvtab <- linelist %>% 
  
  # Select variables to calculate OR on:
  select(casedef, age_cat, gender, fever, chills) %>% 
  
  # Drop missing values:
  drop_na() %>% 
  
  # Run the univariable analysis:
  tbl_uvregression(                                   
      method = glm,
      y = casedef,                 
      method.args = list(family = binomial(link = "logit")), 
      exponentiate = TRUE, 
      add_estimate_to_reference_rows = TRUE,
      hide_n = TRUE,
      label = list(
        age_cat ~ "Age category (years)",
        gender ~ "Sex",
        fever ~ "Has fever", 
        chills ~ "Has chills"
      )
    ) %>%
    bold_p() %>% 
    modify_table_styling(
      columns = ci,
      rows = reference_row %in% TRUE,
      missing_symbol = "Ref."
    )

# Print the table:
uvtab

Characteristic	OR¹	95% CI¹	p-value
Age category (years)
0-4	1.00	Ref.
5-9	0.94	0.77, 1.15	0.5
10-14	1.15	0.93, 1.42	0.2
15-19	0.99	0.80, 1.24	>0.9
20-29	1.03	0.84, 1.26	0.8
30-49	1.07	0.85, 1.33	0.6
50-69	0.68	0.41, 1.13	0.13
70+	0.53	0.07, 3.20	0.5
Sex
Female	1.00	Ref.
Male	1.00	0.88, 1.13	>0.9
Has fever
FALSE	1.00	Ref.
TRUE	1.00	0.85, 1.17	>0.9
Has chills
FALSE	1.00	Ref.
TRUE	1.03	0.89, 1.21	0.7
¹ OR = Odds Ratio, CI = Confidence Interval

Your turn:

Using the code above as a guide, perform univariable analysis and try creating a table of the results using gtsummary::tbl_uvregression() with:

case definition as above (death as outcome)
calculate odds ratios for all symptoms in the data set
what symptoms are most associated with death as an outcome?

Click for hints

You can select different columns to include by changing the list of columns within the select() command
Remember to change the variable labels too;
Remember to remove missing values before piping in to tbl_uvregression()

Click for solution

symuvtab <- linelist %>% 
  
  # Select variables to calculate OR on:
  select(casedef, fever, chills, cough, aches, vomit, temp) %>% 
  
  # Drop missing values:
  drop_na() %>% 
  
  # Run the univariable analysis:
  tbl_uvregression(                                   
      method = glm,
      y = casedef,                 
      method.args = list(family = binomial(link = "logit")), 
      exponentiate = TRUE, 
      add_estimate_to_reference_rows = TRUE,
      hide_n = TRUE,
      label = list(
        fever ~ "Has fever", 
        chills ~ "Has chills", 
        cough ~ "Has cough",
        aches ~ "Has myalgia",
        vomit ~ "Has vomited", 
        temp ~ "Temperature (degrees centigrade)"
      )
    ) %>%
    bold_p() %>% 
    modify_table_styling(
      columns = ci,
      rows = reference_row %in% TRUE,
      missing_symbol = "Ref."
    )


# Print the table:
symuvtab

Characteristic	OR¹	95% CI¹	p-value
Has fever
FALSE	1.00	Ref.
TRUE	1.05	0.90, 1.22	0.6
Has chills
FALSE	1.00	Ref.
TRUE	1.08	0.93, 1.26	0.3
Has cough
FALSE	1.00	Ref.
TRUE	1.12	0.94, 1.34	0.2
Has myalgia
FALSE	1.00	Ref.
TRUE	0.95	0.78, 1.17	0.7
Has vomited
FALSE	1.00	Ref.
TRUE	1.10	0.98, 1.24	0.12
Temperature (degrees centigrade)	1.01	0.95, 1.08	0.7
¹ OR = Odds Ratio, CI = Confidence Interval

The results show that none of the symptoms listed significantly increase the odds of death occurring over recovery.

2.5 Exercise 5

Multivariable analysis with {gtsummary}

So far, we have explored how to perform a univariable analysis on individual risk factors in our data set. However, it may be more useful to look at the impact of specific risk factors when other risk factors are also taken into account. To assess this, we can construct a multivariable model.

The {gtsummary} package doesn’t have a single wrapper function to create multivariable models from scratch, because the individual variables can be combined in the model in different ways. The main combinations you will come across are:

linear relationships between the variables (where they are added together with a + symbol) and
interaction terms (where two variables are not completely independent and their interaction with each other is represented by an asterisk *).

For example, if we wanted to include height and weight in our model, we know they are not completely independent of each other, so it would be better to represent them with an interaction term. Suppose we wanted to look at age, gender, height and weight (where age and gender are independent of each other but height and weight are not) we would do:

formula = casedef ~ age + gender + (wt_kg * ht_cm)

This will show us how age, gender and weight for height impact on the odds of dying:

# Create a model based on patient demographics:
model_demographics <- linelist %>% 
  
  # Specify the model formula, type and link function:
  glm(
    formula = casedef ~ age + gender + (wt_kg * ht_cm), 
    family = binomial(link = "logit")
    ) %>% 
  
  # Tidy up and display the model results in a table:
    tidy(
    conf.int = TRUE,    # We do want confidence intervals in the output
    conf.level = 0.95,  # We want 95% confidence intervals (the default)
    exponentiate = TRUE # We want the exponents (display results as odds ratios)
  )

# Print the results:
model_demographics

## # A tibble: 6 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)    1.14  0.168         0.755   0.450    0.817      1.58
## 2 age            0.991 0.00682      -1.28    0.201    0.978      1.00
## 3 genderMale     0.986 0.0681       -0.205   0.838    0.863      1.13
## 4 wt_kg          1.00  0.00449       0.796   0.426    0.995      1.01
## 5 ht_cm          1.00  0.00211       0.219   0.827    0.996      1.00
## 6 wt_kg:ht_cm    1.00  0.0000362     0.108   0.914    1.00       1.00

As you can see, variables separated by a plus symbol have individual results. Variables combined in an interaction term have both individual results and a combined result for the interaction.

Now that we have the syntax for creating a multivariable model, lets try to make a model with variables that are more likely to have an impact on death as an outcome:

age (include apriori - impacts other things such as BMI, exposures etc.)
gender (include apriori - potentially impacts other things)
bmi (are people with very low or very high BMI more likely to die?)
hospital (did some hospitals have worse outcomes than others?)
source (did exposure at funerals increase the odds of death?)
ct_blood (are people with high viral load - early CT - more likely to die?)
temp (are people with higher temperatures more likely to die?)
days_onset_hosp (are people who were hospitalised later more likely to die?)

First, we construct the model using the glm() function. Because we are not looking at any interaction terms, we can use simpler syntax, piping in the variables we want to include in the model and using the dot . as a placeholder for each of the variables, which will all be added to the model with the default approach (additive, represented by the + symbol):

# Pipe in the data set:
model_rf <- linelist %>% 
  
  # Filter out hospital = 'other'
  filter(!hospital %in% "Other") %>% 
  
  # Change the reference value for source to "other":
  mutate(source = relevel(x = factor(source), ref = "other")) %>% 
  
  # Select variables to put in the model:
  select(
    casedef, 
    age_years, 
    gender, 
    bmi,
    hospital, 
    source, 
    ct_blood, 
    temp, 
    days_onset_hosp
    ) %>% 
  
  # Construct the model:
  glm(formula = casedef ~ ., family = binomial(link = "logit"))

To view the results, we will use the gtsummary::tbl_regression() function:

# Pipe in the model:
mrf_res <- model_rf %>% 
  
  # Create the table of results:
  tbl_regression(
    # Exponentiate to get Odds Ratios
    exponentiate = TRUE, 
    # Add estimate to reference values
    add_estimate_to_reference_rows = TRUE, 
    # Specify labels for columns:
    label = list(
      age_years ~ "Age (years)",
      gender ~ "Sex",
      bmi ~ "Body mass index",
      hospital ~ "Hospital", 
      source = "Exposure event",
      ct_blood ~ "Virus PCR cycle threshold",
      temp ~ "Temperature (degrees centegrade)",
      days_onset_hosp ~ "Time to hospitalisation (days)"
    )) %>% 
  
  # Make significant results bold:
  bold_p() %>%
  
  # Add a custom label to indicate which value was used as a reference:
  modify_table_styling(
    columns = ci,
    rows = reference_row %in% TRUE,
    missing_symbol = "Ref."
    )

# Print the results:
mrf_res

Characteristic	OR¹	95% CI¹	p-value
Age (years)	0.99	0.98, 1.00	0.2
Sex
Female	1.00	Ref.
Male	1.14	0.92, 1.40	0.2
Body mass index	1.00	1.00, 1.00	0.2
Hospital
Central Hospital	1.00	Ref.
Military Hospital	1.03	0.74, 1.43	0.9
Port Hospital	1.00	0.74, 1.36	>0.9
St. Mark's Maternity Hospital (SMMH)	1.52	1.02, 2.28	0.042
Exposure event
other	1.00	Ref.
funeral	0.91	0.69, 1.21	0.5
Virus PCR cycle threshold	0.92	0.85, 0.99	0.025
Temperature (degrees centegrade)	1.04	0.94, 1.15	0.5
Time to hospitalisation (days)	0.87	0.82, 0.92	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

The table shows that we have three significant results from our model, one of which is a risk factor (odds ratio above 1) and two of which are protective (odds ratio less than 1):

hospital: Being at St Mark’s maternity hospital increases the odds of dying
ct_blood: having a higher PCR cycle threshold (=lower viral load) is protective against death
days_onset_hosp: longer periods between symptom onset and hospitalisation are protective against death

The first result (being at St Mark’s maternity hospital) might also be a proxy for another risk factor: pregnancy. We don’t have data on whether or not patients were pregnant, but we could investigate if this is likely by creating a new variable to indicate which patients are adult women of child-bearing age:

linelist <- linelist %>% 
  
  # Create variable to show whether patient is a woman of child bearing age
  # (female and aged between 12 and 45 years)
  mutate(wcba = case_when(
    gender == "Female" & between(x = age_years, left = 12, right = 45) ~ TRUE, 
    gender == "Female" & !between(x = age_years, left = 12, right = 45) ~ FALSE,
    gender == "Male" ~ FALSE, 
    .default = NA
  ))

# Create table of results:
wcbatab <- linelist %>% 

  tabyl(hospital, wcba) %>% 
  
  adorn_percentages() %>% 
  
  adorn_pct_formatting()

# Print the table:
wcbatab

##                              hospital FALSE  TRUE  NA_
##                      Central Hospital 75.4% 20.4% 4.2%
##                     Military Hospital 73.6% 22.0% 4.4%
##                                 Other 73.0% 22.2% 4.8%
##                         Port Hospital 73.5% 22.2% 4.3%
##  St. Mark's Maternity Hospital (SMMH) 72.6% 23.1% 4.3%
##                                  <NA> 73.5% 21.5% 5.0%

Surprisingly, the proportion of patients that are women of child-bearing age at St Mark’s maternity hospital is only slightly higher than the other (non-maternity) hospitals. If we view the data for that hospital, we can see that there are a number of patients with unlikely demographics, such as adult males. It is possible that the hospital was repurposed during the crisis and had other (general) wards as well - or alternatively this could be an artefact because we are after all looking at fake data ;-) Other reasons for a particular hospital being flagged might be if their procedures for dealing with very ill cases were not as good as the other hospitals, if they received more patients earlier on in the epidemic when procedures were not as well known, or due to the hospital being a referral centre for more acutely ill cases.

The second result (higher PCR cycle threshold being protective against death) is as we would expect, since high Ct is a proxy for low viral load and we would expect people with higher viral load to be more at risk of death.

The third significant result (reaching the hospital later in the episode of illness being protective against death) is a bit more nuanced - it is possible that people with very short periods between onset and hospitalisation were more accutely ill and therefore more at risk of dying, than those hospitalised later in their illness. If this were a real analysis, this result would need some further investigation and consideration of potential confounding factors (particularly one that could indicate severity of illness).

For now, there is one more thing that we can do to present our results in a way that would make them easier to interpret; we can use the gtsummary::tbl_merge() function to merge a table of descriptive statistics with analytic results. First we create the table of descriptive statistics for the variables in our model, as we did in exercise 3:

# Create summary table:
mrfdesctab <- linelist %>% 
  
  # Filter out hospital = 'other'
  filter(!hospital %in% "Other") %>% 
  
  # Change the reference value for source to "other":
  mutate(source = relevel(x = factor(source), ref = "other")) %>% 
  
  # Create the summary table:
  tbl_summary(
    # Select which variables to include:
    include = c(
      casedef, 
      age_years, 
      gender, 
      bmi,
      hospital, 
      source, 
      ct_blood, 
      temp, 
      days_onset_hosp), 
    # Select which variable to stratify by:
    by = casedef, 
    # Specify how to handle missing values:
    missing = "no", 
    # Specify labels for columns:
    label = list(
      age_years ~ "Age (years)",
      gender ~ "Sex",
      bmi ~ "Body mass index",
      hospital ~ "Hospital", 
      source = "Exposure event",
      ct_blood ~ "Virus PCR cycle threshold",
      temp ~ "Temperature (degrees centegrade)",
      days_onset_hosp ~ "Time to hospitalisation (days)"
    ),
    # Specify sorting by frequency:
    sort = all_categorical() ~ "frequency"
  ) %>% 
  
  # Add overall summary figures:
  add_overall() %>% 
  
  # Format variable labels:
  bold_labels() %>% 
  italicize_labels() %>% 
  
  # Tidy up column headers:
  modify_header(
    label = "**Characteristic**",
    stat_0 = "**Overall**\nN = {N}",
    stat_1 = "**Recovered**\nN = {n}",
    stat_2 = "**Died**\nN = {n}"
    )


# Print the table:
mrfdesctab

Characteristic	Overall N = 3880¹	Recovered N = 1693¹	Died N = 2187¹
Age (years)	13 (6, 23)	13 (6, 23)	13 (6, 23)
Sex
Female	1,866 (50%)	822 (51%)	1,044 (50%)
Male	1,840 (50%)	804 (49%)	1,036 (50%)
Body mass index	32 (25, 50)	32 (25, 50)	32 (25, 50)
Hospital
Port Hospital	1,364 (50%)	579 (49%)	785 (50%)
Military Hospital	708 (26%)	309 (26%)	399 (25%)
Central Hospital	358 (13%)	165 (14%)	193 (12%)
St. Mark's Maternity Hospital (SMMH)	325 (12%)	126 (11%)	199 (13%)
Exposure event
other	2,136 (85%)	915 (84%)	1,221 (86%)
funeral	375 (15%)	177 (16%)	198 (14%)
Virus PCR cycle threshold	22.00 (20.00, 22.00)	21.00 (20.00, 22.00)	22.00 (20.00, 22.00)
Temperature (degrees centegrade)	38.80 (38.20, 39.20)	38.80 (38.20, 39.23)	38.80 (38.20, 39.20)
Time to hospitalisation (days)	1.00 (1.00, 3.00)	1.00 (1.00, 3.00)	1.00 (1.00, 3.00)
¹ Median (IQR); n (%)

Now we can merge the two tables:

# Create the merged table:
final_restab <- tbl_merge(
  # List the tables to merge:
  tbls = list(mrfdesctab, mrf_res),
  # Give each table its own heading:
  tab_spanner = c("**Descriptive summary**", "**Multivariable model**")
)

# Print the merged table:
final_restab

Characteristic	Descriptive summary			Multivariable model
Characteristic	Overall N = 3880¹	Recovered N = 1693¹	Died N = 2187¹	OR²	95% CI²	p-value
Age (years)	13 (6, 23)	13 (6, 23)	13 (6, 23)	0.99	0.98, 1.00	0.2
Sex
Female	1,866 (50%)	822 (51%)	1,044 (50%)	1.00	Ref.
Male	1,840 (50%)	804 (49%)	1,036 (50%)	1.14	0.92, 1.40	0.2
Body mass index	32 (25, 50)	32 (25, 50)	32 (25, 50)	1.00	1.00, 1.00	0.2
Hospital
Port Hospital	1,364 (50%)	579 (49%)	785 (50%)	1.00	0.74, 1.36	>0.9
Military Hospital	708 (26%)	309 (26%)	399 (25%)	1.03	0.74, 1.43	0.9
Central Hospital	358 (13%)	165 (14%)	193 (12%)	1.00	Ref.
St. Mark's Maternity Hospital (SMMH)	325 (12%)	126 (11%)	199 (13%)	1.52	1.02, 2.28	0.042
Exposure event
other	2,136 (85%)	915 (84%)	1,221 (86%)	1.00	Ref.
funeral	375 (15%)	177 (16%)	198 (14%)	0.91	0.69, 1.21	0.5
Virus PCR cycle threshold	22.00 (20.00, 22.00)	21.00 (20.00, 22.00)	22.00 (20.00, 22.00)	0.92	0.85, 0.99	0.025
Temperature (degrees centegrade)	38.80 (38.20, 39.20)	38.80 (38.20, 39.23)	38.80 (38.20, 39.20)	1.04	0.94, 1.15	0.5
Time to hospitalisation (days)	1.00 (1.00, 3.00)	1.00 (1.00, 3.00)	1.00 (1.00, 3.00)	0.87	0.82, 0.92	<0.001
¹ Median (IQR); n (%)
² OR = Odds Ratio, CI = Confidence Interval

3 Further reading

For more detail on how to calculate and display descriptive statistics, see the following chapters of the Epidemiologist R handbook:

For details on how to conduct univariate and multivariable regression analysis, see:

Chapter 19: Univariate and multivariable regression

R coaching session 4 - statistics

Amy Mikhail

02 April 2024

1 Introduction:

1.1 Packages and functions required:

1.2 Importing data:

1.3 Data exploration:

2 Statistical analysis

2.1 Exercise 1

2.2 Exercise 2

2.3 Exercise 3

2.4 Excercise 4

2.5 Exercise 5

3 Further reading