In this session, we will work with a dataset from the Epidemiologist R handbook chapter on univariable and multivariable regression, to first create some descriptive statistics in summary tables in different ways, then perform univariable and multivariable regression on a binary outcome measure (patient status at exit: recovered or died) to determine risk factors for death.
You will need the following packages. In the project folder, we have also included some handy wrapper functions for creating publication-ready summary tables and plots of the results of your statistical analysis. You can import these R functions to your environment by using the source()
command as shown in the chunk below:
# Ensures the package "pacman" is installed
if (!require("pacman")) install.packages("pacman")
# Check for, install and load the remaining required packages:
pacman::p_load(
# Project and file management
#############################
here, # file paths relative to R project root folder
rio, # import/export of many types of data
openxlsx, # special functions for handling Excel workbooks
# Package install and management
################################
remotes, # install from github
# General data management
#########################
skimr, # data exploration
epikit, # useful epi functions
zscorer, # Calculate z scores from weight and height
# Tables and statistics
#######################
janitor, # tables and data cleaning
gtsummary, # making descriptive and statistical tables
corrr, # correlation matrices
caret, # other useful correlation functions
broom, # extract model results summary and convert to a table
broom.mixed, # more functions to extract and tidy model results
parameters, # Alternative package for tidying up model results
ztable, # Make heatmap table
# Models
#########
glmnet, # needed to carry out variable selection
lme4, # for multivariable regression with random effects
lmtest, # for calculating likelihood ratios
# Plots - general
#################
#ggplot2, # included in tidyverse
cowplot, # combining plots
RColorBrewer, # color scales
gghighlight, # highlight a subset
ggrepel, # smart labels
ggExtra, # fancy plots
tsibble, # epiweeks
viridis, # colorblind-friendly scales
scales, # helper functions
apyramid, # age and sex pyramids
see, # forest plots and other statistical plots
ggforestplot, # Plot a forest plot of model results
# Routine reports
#################
rmarkdown, # produce PDFs, Word, Powerpoints, & HTML files
# Tables for presentation
#########################
knitr, # R Markdown report generation and html tables
flextable, # Publication-ready tables
# Tidyverse:
# dplyr, tidyr, tibble, readr, purrr, forcats, stringr, lubridate, ggplot2
##############
tidyverse # 9 packages for tidy data wrangling and presentation
#tidylog # Logs changes in structure of data in pipe chains
)
# Install latex with tinytex package:
#tinytex::install_tinytex()
##########################################
# Set all flextable backgrounds to white:
set_flextable_defaults(background.color = "white")
You can import the dataset for this session from the data
sub-folder in this project with rio
and here
packages as shown below. Note that the data is already clean, so we can dive straight into some descriptive statistics. Because the data is saved in .rds
format, the column classes in the clean data are preserved.
# Import data:
linelist <- import(here("data", "linelist_cleaned.rds"))
The data set is a line listing of COVID-19 patients. The outcome we are interested in is whether they had recovered or died by the time they exited the hospital (recorded in the column called outcome
). There are a number of potential risk factors for death as an outcome that we can look at;
days_onset_hosp
)First of all, we can familiarise ourselves with the data using the skim()
function:
# Summarise dataset with skim:
skim(linelist)
Name | linelist |
Number of rows | 5888 |
Number of columns | 30 |
_______________________ | |
Column type frequency: | |
character | 13 |
Date | 4 |
factor | 2 |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
case_id | 0 | 1.00 | 6 | 6 | 0 | 5888 | 0 |
outcome | 1323 | 0.78 | 5 | 7 | 0 | 2 | 0 |
gender | 278 | 0.95 | 1 | 1 | 0 | 2 | 0 |
age_unit | 0 | 1.00 | 5 | 6 | 0 | 2 | 0 |
hospital | 0 | 1.00 | 5 | 36 | 0 | 6 | 0 |
infector | 2088 | 0.65 | 6 | 6 | 0 | 2697 | 0 |
source | 2088 | 0.65 | 5 | 7 | 0 | 2 | 0 |
fever | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
chills | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
cough | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
aches | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
vomit | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
time_admission | 765 | 0.87 | 5 | 5 | 0 | 1072 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date_infection | 2087 | 0.65 | 2014-03-19 | 2015-04-27 | 2014-10-11 | 359 |
date_onset | 256 | 0.96 | 2014-04-07 | 2015-04-30 | 2014-10-23 | 367 |
date_hospitalisation | 0 | 1.00 | 2014-04-17 | 2015-04-30 | 2014-10-23 | 363 |
date_outcome | 936 | 0.84 | 2014-04-19 | 2015-06-04 | 2014-11-01 | 371 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
age_cat | 86 | 0.99 | FALSE | 8 | 0-4: 1095, 5-9: 1095, 20-: 1073, 10-: 941 |
age_cat5 | 86 | 0.99 | FALSE | 17 | 0-4: 1095, 5-9: 1095, 10-: 941, 15-: 743 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
generation | 0 | 1.00 | 16.56 | 5.79 | 0.00 | 13.00 | 16.00 | 20.00 | 37.00 | ▁▆▇▂▁ |
age | 86 | 0.99 | 16.07 | 12.62 | 0.00 | 6.00 | 13.00 | 23.00 | 84.00 | ▇▅▁▁▁ |
age_years | 86 | 0.99 | 16.02 | 12.64 | 0.00 | 6.00 | 13.00 | 23.00 | 84.00 | ▇▅▁▁▁ |
lon | 0 | 1.00 | -13.23 | 0.02 | -13.27 | -13.25 | -13.23 | -13.22 | -13.21 | ▅▃▃▆▇ |
lat | 0 | 1.00 | 8.47 | 0.01 | 8.45 | 8.46 | 8.47 | 8.48 | 8.49 | ▅▇▇▇▆ |
wt_kg | 0 | 1.00 | 52.64 | 18.58 | -11.00 | 41.00 | 54.00 | 66.00 | 111.00 | ▁▃▇▅▁ |
ht_cm | 0 | 1.00 | 124.96 | 49.52 | 4.00 | 91.00 | 129.00 | 159.00 | 295.00 | ▂▅▇▂▁ |
ct_blood | 0 | 1.00 | 21.21 | 1.69 | 16.00 | 20.00 | 22.00 | 22.00 | 26.00 | ▁▃▇▃▁ |
temp | 149 | 0.97 | 38.56 | 0.98 | 35.20 | 38.20 | 38.80 | 39.20 | 40.80 | ▁▂▂▇▁ |
bmi | 0 | 1.00 | 46.89 | 55.39 | -1200.00 | 24.56 | 32.12 | 50.01 | 1250.00 | ▁▁▇▁▁ |
days_onset_hosp | 256 | 0.96 | 2.06 | 2.26 | 0.00 | 1.00 | 1.00 | 3.00 | 22.00 | ▇▁▁▁▁ |
We can see that symptoms are currently recorded as character variables, but would be better represented as binary logical variables:
# Convert symptoms to binary variables:
linelist <- linelist %>%
# Convert no and yes to FALSE and TRUE, respectively:
mutate(across(.cols = c(fever, chills, cough, aches, vomit),
.fns = ~case_when(.x == "yes" ~ TRUE,
.x == "no" ~ FALSE,
.default = NA)))
We can also take a quick look at our outcome variable, outcome
with the janitor::tabyl()
function:
# Summary table of outcome:
tabyl(linelist, outcome)
## outcome n percent valid_percent
## Death 2582 0.4385190 0.5656079
## Recover 1983 0.3367867 0.4343921
## <NA> 1323 0.2246943 NA
We can see there are some missing values for outcome. These can be removed:
# Remove rows with missing values for outcome:
linelist <- linelist %>%
# Remove rows where outcome is NA:
filter(!is.na(outcome))
The data is pretty clean already, but there are a few other things we could do to prepare it for analysis:
NA
f
and m
but would be better as female
and male
linelist <- linelist %>%
# Recode missing hospital as NA:
mutate(hospital = na_if(hospital, "Missing")) %>%
# Recode gender:
mutate(gender = case_match(
gender,
"f" ~ "Female",
"m" ~ "Male",
.default = gender)) %>%
# Case definition:
mutate(casedef = case_when(
outcome == "Death" ~ TRUE,
outcome == "Recover" ~ FALSE,
.default = NA
))
Now that our data is ready, we can begin to summarise it with some basic statistics.
Descriptive tables with tabyl()
and flextable()
We have already seen how easy it is to get a quick summary table with the tabyl()
function from the {janitor}
package. The function will provide us with the number and proportion of cases for each value in the variable.
However, we would probably want to tidy the basic output up before putting such a table in a report. Fortunately the {janitor}
package has a number of adorn_
functions that do exactly this. We can convert the proportions to percentages, adjust how many decimal places are shown, and add row and column totals. We can also suppress missing values.
With the {flextable}
package, we can convert our table to a nicely formatted table ready for inclusion in a report. We can adjust the way the header rows appear, give the column names nice labels, merge cells, make font bold or italic, change the background colour of cells, add borders and many more features. For a detailed introduction to using flextable, see the Tables for presentation section in the Epidemiologist R handbook.
Here is an example with the outcome
variable:
# Create basic summary table with tabyl:
outcome_tab <- linelist %>%
# Basic table with missing values supressed:
tabyl(outcome, show_na = FALSE) %>%
# Add Column totals (the default):
adorn_totals() %>%
# Format proportions as a percentage with two digits after the decimal point:
adorn_pct_formatting(digits = 2) %>%
# Change column names for merging later:
rename(
Outcome = outcome,
Cases_n = n,
Cases_percent = percent) %>%
# Convert to a flextable:
qflextable() %>%
# Change the theme to something nicer:
theme_booktabs() %>%
# Add a line above the 'total' row (under second row):
hline(i = 2, border = fp_border_default()) %>%
# Separate header:
separate_header(
opts = c("span-top", "center-hspan")
) %>%
# Give the column names nice labels:
set_header_labels(
Outcome = "Outcome",
Cases_n = "n",
Cases_percent = "%"
) %>%
# Make the header row bold:
bold(part = "header") %>%
# Right-align case numbers and percent columns:
align(j = 2:3, align = "right", part = "all") %>%
# Center align cases header title:
align(i = 1, j = 2:3, align = "center", part = "header") %>%
# Make bottom row bold:
bold(i = 3) %>%
# Make bottom row shaded light grey:
bg(i = 3, bg = "lightgrey") %>%
# Autofit column width to size of header text:
set_table_properties(layout = "autofit")
# Print the table:
outcome_tab
Outcome | Cases | |
---|---|---|
n | % | |
Death | 2,582 | 56.56% |
Recover | 1,983 | 43.44% |
Total | 4,565 | 100.00% |
Your turn:
Using the code above as a guide, try creating a summary table with:
hospital
stratified by gender
.other
hospitals orange.tabyl()
by adding two column names instead of one.adorn_percent()
before formatting.bold(i = ~ Hospital == "Interesting value")
.i = ...
) or columns (j = ...
) to indicate which ones you want to format .1:5
# Create basic summary table with tabyl:
hospgender_tab <- linelist %>%
# Basic crosstab of hospital stratified by gender with NAs supressed:
tabyl(hospital, gender, show_na = FALSE) %>%
# Add row totals:
adorn_totals(where = "col") %>%
# Convert numbers to percent:
adorn_percentages() %>%
# Format proportions as a percentage with two digits after the decimal point:
adorn_pct_formatting(digits = 0) %>%
# Change column names for merging later:
rename(
Hospital = hospital,
Cases_female = Female,
Cases_male = Male) %>%
# Convert to a flextable:
qflextable() %>%
# Change the theme to something nicer:
theme_booktabs() %>%
# Separate header:
separate_header(
opts = c("span-top", "center-hspan")
) %>%
# Give the column names nice labels:
set_header_labels(
Hospital = "Hospital",
Cases_female = "Female",
Cases_male = "Male"
) %>%
# Make the header row bold:
bold(part = "header") %>%
# Right-align case numbers and percent columns:
align(j = 2:4, align = "right", part = "all") %>%
# Center align cases header title:
align(i = 1, j = 2:4, align = "center", part = "header") %>%
# Make bottom row bold:
bold(i = 3) %>%
# Make bottom row shaded light grey:
bg(i = ~ Hospital == "Other", bg = "orange") %>%
# Autofit column width to size of header text:
set_table_properties(layout = "autofit")
# Print the table:
hospgender_tab
Hospital | Cases | Total | |
---|---|---|---|
Female | Male | ||
Central Hospital | 47% | 53% | 100% |
Military Hospital | 49% | 51% | 100% |
Other | 48% | 52% | 100% |
Port Hospital | 52% | 48% | 100% |
St. Mark's Maternity Hospital (SMMH) | 52% | 48% | 100% |
Descriptive summary statistics with {dplyr}
Sometimes counts and percentages aren’t enough. If you need to calculate other simple summary metrics, such as medians or means for numeric variables, this can easily be done with the {dplyr}
function summarise()
. This is particularly useful when you want to calculate summary statistics for groups within your data.
Suppose we want to calculate the earliest, median and latest onset date by gender. We could do this with dplyr as below (once again we will use {flextable}
to tidy up the aesthetics):
# Create summary statistics by gender:
gendersumtab <- linelist %>%
# Rename gender:
rename(Sex = gender) %>%
# Drop missing values:
drop_na(Sex, date_onset) %>%
# Define contents of summary table:
summarise(
# Group results by gender:
.by = Sex,
# Calculate number of cases:
Cases = n(),
# Calculate min onset date:
Onset_min = min(date_onset, na.rm = TRUE),
# Calculate median onset date:
Onset_med = median(date_onset, na.rm = TRUE),
# Calculate maximum onset date:
Onset_max = max(date_onset, na.rm = TRUE)
) %>%
# Convert to flextable:
qflextable() %>%
# Change theme:
theme_booktabs() %>%
# Separate header:
separate_header(
opts = c("span-top", "center-hspan")
) %>%
# Give the column names nice labels:
set_header_labels(
Onset_min = "Earliest",
Onset_med = "Median",
Onset_max = "Latest"
) %>%
# Autofit column width to size of header text:
set_table_properties(layout = "autofit")
# Print table:
gendersumtab
Sex | Cases | Onset | ||
---|---|---|---|---|
Earliest | Median | Latest | ||
Female | 2,076 | 2014-04-21 | 2014-10-21 | 2015-04-30 |
Male | 2,077 | 2014-04-21 | 2014-10-21 | 2015-04-30 |
Notes:
summarize()
and summarise()
will both work..by = ...
argument of the summarise()
function. Alternatively, we could use the group_by()
function before summarise()
. This will have the same effect on the summary as using the .by
argument, but beware - the whole data set will remain grouped by the stratifying variable until you use the ungroup()
function. See below for example:# Create summary statistics by gender:
groupedll <- linelist %>%
# Drop missing values:
drop_na(gender, date_onset) %>%
# Create a new variable, max age before grouping by anything:
mutate(maxage = max(age_years, na.rm = TRUE)) %>%
# Now group the data by gender:
group_by(gender) %>%
# Now create the new variable max age again, this time after the grouping:
mutate(maxagegrouped = max(age_years, na.rm = TRUE))
# Are the two variables the same?
# No, because one is calculated after grouping by gender and the other is not:
# Here is what max age before grouping looks like:
groupedll %>% tabyl(maxage, gender)
## maxage Female Male
## 84 2076 2077
# And here is what it looks like after grouping by gender:
groupedll %>% tabyl(maxagegrouped, gender)
## maxagegrouped Female Male
## 52 2076 0
## 84 0 2077
Here we can see that before grouping, there is only one maximum age for the whole data set (84 years). After grouping by gender, we see that the maximum age for females is 52 years, and for males is 84 years. If we now use the ungroup()
function and try creating the same variable again, we should return to a single value (84 years) for the whole data set:
ungroupedll <- groupedll %>%
# Remove the grouping from the data:
ungroup() %>%
# Now create the max age variable again:
mutate(maxageungrouped = max(age_years, na.rm = TRUE))
# Now create the table again:
ungroupedll %>% tabyl(maxageungrouped, gender)
## maxageungrouped Female Male
## 84 2076 2077
# We can also show that the two variables
# before grouping and after un-grouping are identical:
identical(ungroupedll$maxage, ungroupedll$maxageungrouped)
## [1] TRUE
The group_by()
function can be very useful if you want to calculate something based on grouping by another variable, and you are intending to add the calculations as a new variable to your original data set. If on the other hand, you are creating a summary table, it may be safer to use the .by = ...
argument of the summarise()
function as this won’t affect the data set itself, only the new table of data created by the summarise()
function.
Your turn:
Using the code above as a guide, try creating a summary table with:
hospital
other
(name not specified)filter()
command to filter out hospitals: filter(!hospital %in% "Other")
drop_na()
for all the variables you will use to make the table (hospital, date_onset, age).by
argument of summarise()
date_onset
in this case) the same prefix so that the separate_header()
function can be used to group them automatically in flextable
# Create summary statistics by gender:
hospsumtab <- linelist %>%
# Filter out hospital = 'other'
filter(!hospital %in% "Other") %>%
# Rename Hospital:
rename(Hospital = hospital) %>%
# Drop missing values:
drop_na(Hospital, date_onset, age_years) %>%
# Define contents of summary table:
summarise(
# Group results by gender:
.by = Hospital,
# Calculate number of cases:
Cases = n(),
# Calculate min onset date:
Onset_min = min(date_onset, na.rm = TRUE),
# Calculate median onset date:
Onset_med = median(date_onset, na.rm = TRUE),
# Calculate maximum onset date:
Onset_max = max(date_onset, na.rm = TRUE),
# Calculate median age:
Age_med = median(age_years, na.rm = TRUE)
) %>%
# Convert to flextable:
qflextable() %>%
# Change theme:
theme_booktabs() %>%
# Separate header:
separate_header(
opts = c("span-top", "center-hspan")
) %>%
# Give the column names nice labels:
set_header_labels(
Onset_min = "Earliest",
Onset_med = "Median",
Onset_max = "Latest",
Age_med = "Median"
) %>%
# Autofit column width to size of header text:
set_table_properties(layout = "autofit")
# Print table:
hospsumtab
Hospital | Cases | Onset | Age | ||
---|---|---|---|---|---|
Earliest | Median | Latest | Median | ||
St. Mark's Maternity Hospital (SMMH) | 312 | 2014-05-01 | 2014-10-25 | 2015-04-27 | 12 |
Military Hospital | 672 | 2014-05-09 | 2014-10-27 | 2015-04-28 | 14 |
Port Hospital | 1,287 | 2014-05-03 | 2014-10-24 | 2015-04-30 | 13 |
Central Hospital | 337 | 2014-05-01 | 2014-10-26 | 2015-04-28 | 13 |
Demonstration of {gtsummary}
for summary statistics
Often, the summary statistics we want to create are presented in the same format. The {gtsummary}
package was designed to make tabulating summary statistics in a ready-to-publish format really easy.
The package will automatically calculate appropriate summary statistics depending on whether the variable to summarise is numeric, categorical or logical. To demonstrate this, we will make a table stratified by gender with a summary of age in years (numeric variable), hospital (categorical variable) and fever (a logical variable).
As for the other techniques, there are built in arguments to handle missing values.
We will make a descriptive summary table using the tbl_summary()
function:
# Create summary table:
gtsumtab <- linelist %>%
# Filter out hospital = 'other'
filter(!hospital %in% "Other") %>%
# Create the summary table:
tbl_summary(
# Select which variables to include:
include = c(gender, age_years, hospital, fever),
# Select which variable to stratify by:
by = gender,
# Specify how to handle missing values:
missing = "no",
# Specify labels for columns:
label = list(
age_years ~ "Age (years)",
hospital ~ "Hospital",
fever ~ "Has fever"
),
# Specify sorting by frequency:
sort = all_categorical() ~ "frequency"
) %>%
# Add overall summary figures:
add_overall() %>%
# Format variable labels:
bold_labels() %>%
italicize_labels() %>%
# Tidy up column headers:
modify_header(
label = "**Characteristic**",
stat_0 = "**Overall**\nN = {N}",
stat_1 = "**Female**\nN = {n}",
stat_2 = "**Male**\nN = {n}"
) %>%
# Format as flextable:
as_flex_table()
# Print the table:
gtsumtab
Characteristic | Overall | Female | Male |
---|---|---|---|
Age (years) | 13 (6, 23) | 10 (5, 18) | 17 (8, 28) |
Hospital | |||
Port Hospital | 1,306 (50%) | 673 (51%) | 633 (48%) |
Military Hospital | 677 (26%) | 329 (25%) | 348 (27%) |
Central Hospital | 343 (13%) | 162 (12%) | 181 (14%) |
St. Mark's Maternity Hospital (SMMH) | 311 (12%) | 162 (12%) | 149 (11%) |
Has fever | 2,902 (82%) | 1,453 (82%) | 1,449 (82%) |
1Median (IQR); n (%) |
Univariable analysis with {gtsummary}
In the dataset, we have a number of potential risk factors for death as an outcome. We can calculate odds ratios for individual risk factors and display the results collectively with the gtsummary::tbl_uvregression()
function.
This function is actually a wrapper for the base r glm()
function (to create generalised linear models). It is worth understanding how the base r model functions work, since we will need to use them later to create a multivariable model. The syntax for the glm()
function for a univariable model (one outcome measure regressed against one explanatory variable) is as follows:
formula = outcome_measure ~ explanatory_variable
family = error_distribution(link = model_link_function)
(to see the different options, type ?family()
)data = name_of_dataset
Note that the tilde ~
is used to regress the outcome measure against explanatory variables. You will find the tilde on your keyboard in different places, depending on your keyboard language and layout. A good source for checking where to find it for a given keyboard layout is here.
Lets have a look at how to perform a GLM to obtain the odds of dying if the patient had a cough, compared to patients with no cough. We will use the binary logical case definition we prepared earlier ( casedef
death = TRUE, recover = FALSE) as our outcome measure. We construct the arguments of the glm()
function as follows:
formula = casedef ~ cough
family = binomial(link = "logit")
(this will give odds ratios)data = linelist
(or alternatively, we can pipe the data set in as below)Note: When the data set is not the first argument that a function takes, you need to name all the other arguments in the function (that don’t have default values or that you want to specify a value for) in order for R to understand what to do with the dataset when you pipe it in. In the example below, we already specified the formula
and family
arguments by name, so R automatically assigns the piped in data set to the data
argument as that is the only mandatory argument left for which a value had not been assigned yet.
# Specify the data set:
model_cough <- linelist %>%
# Creat the model:
glm(formula = casedef ~ cough, family = binomial(link = "logit"))
# Look at a summary of the results:
summary(model_cough)
##
## Call:
## glm(formula = casedef ~ cough, family = binomial(link = "logit"),
## data = .)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.15569 0.08156 1.909 0.0563 .
## coughTRUE 0.11680 0.08795 1.328 0.1842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5978.8 on 4363 degrees of freedom
## Residual deviance: 5977.0 on 4362 degrees of freedom
## (201 observations deleted due to missingness)
## AIC: 5981
##
## Number of Fisher Scoring iterations: 3
We can see from the p value (last column in the summary results) that having a cough is not significantly associated with death as an outcome (p > 0.05). However, the output is a little difficult to read, since the estimates are not expressed as odds ratios and we also don’t have confidence intervals (though we could work them out from the standard errors). In order to obtain odds ratios, we need to get the exponent of the estimate. We can do this by applying the exp()
function to the second model coefficient (we are interested in the coefficient for coughTRUE
, not the intercept):
# Convert the estimate for coughTRUE to odds ratio:
exp(model_cough$coefficients[2])
## coughTRUE
## 1.1239
This tells us that the odds of dying with a cough are 1.1239
but we still need to obtain the 95% confidence intervals (we already know that they will cross 1 because the p value is not significant). The {broom}
package has functions that tidy output from many different types of models available in base R into a table. This can include 95% confidence intervals, and also gives the option to display the exponent of the estimates (i.e. to display odds ratios).
The broom::tidy()
function for GLM models takes the following arguments (see ?broom::tidy.glm
):
x
- the model object that we created with glm (in this case model_cough
)conf.int
- logical, do we want to include confidence intervals (we do)conf.level
- what level of confidence interval to include (default is 95%)exponentiate
- logical, do we want the exponent of the estimates (we do)Here is how the tidy table of results looks for our model:
mctidy <- model_cough %>%
tidy(
conf.int = TRUE, # We do want confidence intervals in the output
conf.level = 0.95, # We want 95% confidence intervals (the default)
exponentiate = TRUE # We want the exponents (display results as odds ratios)
)
# Print the tidied up model results:
mctidy
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.17 0.0816 1.91 0.0563 0.996 1.37
## 2 coughTRUE 1.12 0.0880 1.33 0.184 0.946 1.34
Now we have everything we need:
1.12
0.946 - 1.34
(crosses 1, as predicted)0.184
(not significant)This is already a much more helpful way of reviewing the model results. However, what if we want to conduct univariable analysis for several different risk factors at once, and put all the results in the same table? This is where the gtsummary::tbl_uvregression()
function comes into its own. It does the following things:
glm()
function to create the model for each variable;broom::tidy()
{flextable}
As an example, we will calculate the odds of dying for four risk factors:
age_cat
(this is a multi-level factor)gender
(this is a factor with two levels)fever
(this is a binary logical variable)chills
(this is a binary logical variable)Since tbl_uvregression()
combines several different functions into one, the arguments from each of the separate functions are available, but they are given different names. The arguments we need are listed below, along with the values we will choose:
method = glm
: run a generalised linear modely = casedef
: use casedef (death TRUE/FALSE) as the outcome variablefamily = binomial(link = "logit")
: this will compute odds ratiosexponentiate = TRUE
: this will antilog the coefficients so that they are returned as odds ratioshidden_n = TRUE
: this will hide the totals column (not needed as we will join it to the summary table later)label = varlabels
: give variables nice namesFor binary logical variables (TRUE/FALSE), FALSE
will be used as the reference value and odds ratios will be calculated for when the value is TRUE
.
For two-level factors (such as gender), the first factor level (by alphabetic order this is Female
) will be used as the reference value and the other value (Male
) will be compared to the reference. For multi-level factors (such as age category), the first level will be used as a reference (in this case 0-4 year olds) and all the other factor levels will be compared to it.
If you want to change which factor level is used as the reference, you can use the stats::relevel()
function (part of base R) and specify which level you want to use with the ref = ...
argument, e.g. to use the 70+ age group as a reference, within a piped workflow, we would do:
linelist %>% mutate(age_cat = relevel(ref = "70+"))
# Create logical case definition:
uvtab <- linelist %>%
# Select variables to calculate OR on:
select(casedef, age_cat, gender, fever, chills) %>%
# Drop missing values:
drop_na() %>%
# Run the univariable analysis:
tbl_uvregression(
method = glm,
y = casedef,
method.args = list(family = binomial(link = "logit")),
exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE,
hide_n = TRUE,
label = list(
age_cat ~ "Age category (years)",
gender ~ "Sex",
fever ~ "Has fever",
chills ~ "Has chills"
)
) %>%
bold_p() %>%
modify_table_styling(
columns = ci,
rows = reference_row %in% TRUE,
missing_symbol = "Ref."
)
# Print the table:
uvtab
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
Age category (years) | |||
0-4 | 1.00 | Ref. | |
5-9 | 0.94 | 0.77, 1.15 | 0.5 |
10-14 | 1.15 | 0.93, 1.42 | 0.2 |
15-19 | 0.99 | 0.80, 1.24 | >0.9 |
20-29 | 1.03 | 0.84, 1.26 | 0.8 |
30-49 | 1.07 | 0.85, 1.33 | 0.6 |
50-69 | 0.68 | 0.41, 1.13 | 0.13 |
70+ | 0.53 | 0.07, 3.20 | 0.5 |
Sex | |||
Female | 1.00 | Ref. | |
Male | 1.00 | 0.88, 1.13 | >0.9 |
Has fever | |||
FALSE | 1.00 | Ref. | |
TRUE | 1.00 | 0.85, 1.17 | >0.9 |
Has chills | |||
FALSE | 1.00 | Ref. | |
TRUE | 1.03 | 0.89, 1.21 | 0.7 |
1 OR = Odds Ratio, CI = Confidence Interval |
Your turn:
Using the code above as a guide, perform univariable analysis and try creating a table of the results using gtsummary::tbl_uvregression()
with:
select()
commandtbl_uvregression()
symuvtab <- linelist %>%
# Select variables to calculate OR on:
select(casedef, fever, chills, cough, aches, vomit, temp) %>%
# Drop missing values:
drop_na() %>%
# Run the univariable analysis:
tbl_uvregression(
method = glm,
y = casedef,
method.args = list(family = binomial(link = "logit")),
exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE,
hide_n = TRUE,
label = list(
fever ~ "Has fever",
chills ~ "Has chills",
cough ~ "Has cough",
aches ~ "Has myalgia",
vomit ~ "Has vomited",
temp ~ "Temperature (degrees centigrade)"
)
) %>%
bold_p() %>%
modify_table_styling(
columns = ci,
rows = reference_row %in% TRUE,
missing_symbol = "Ref."
)
# Print the table:
symuvtab
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
Has fever | |||
FALSE | 1.00 | Ref. | |
TRUE | 1.05 | 0.90, 1.22 | 0.6 |
Has chills | |||
FALSE | 1.00 | Ref. | |
TRUE | 1.08 | 0.93, 1.26 | 0.3 |
Has cough | |||
FALSE | 1.00 | Ref. | |
TRUE | 1.12 | 0.94, 1.34 | 0.2 |
Has myalgia | |||
FALSE | 1.00 | Ref. | |
TRUE | 0.95 | 0.78, 1.17 | 0.7 |
Has vomited | |||
FALSE | 1.00 | Ref. | |
TRUE | 1.10 | 0.98, 1.24 | 0.12 |
Temperature (degrees centigrade) | 1.01 | 0.95, 1.08 | 0.7 |
1 OR = Odds Ratio, CI = Confidence Interval |
The results show that none of the symptoms listed significantly increase the odds of death occurring over recovery.
Multivariable analysis with {gtsummary}
So far, we have explored how to perform a univariable analysis on individual risk factors in our data set. However, it may be more useful to look at the impact of specific risk factors when other risk factors are also taken into account. To assess this, we can construct a multivariable model.
The {gtsummary}
package doesn’t have a single wrapper function to create multivariable models from scratch, because the individual variables can be combined in the model in different ways. The main combinations you will come across are:
+
symbol) and*
).For example, if we wanted to include height and weight in our model, we know they are not completely independent of each other, so it would be better to represent them with an interaction term. Suppose we wanted to look at age, gender, height and weight (where age and gender are independent of each other but height and weight are not) we would do:
formula = casedef ~ age + gender + (wt_kg * ht_cm)
This will show us how age, gender and weight for height impact on the odds of dying:
# Create a model based on patient demographics:
model_demographics <- linelist %>%
# Specify the model formula, type and link function:
glm(
formula = casedef ~ age + gender + (wt_kg * ht_cm),
family = binomial(link = "logit")
) %>%
# Tidy up and display the model results in a table:
tidy(
conf.int = TRUE, # We do want confidence intervals in the output
conf.level = 0.95, # We want 95% confidence intervals (the default)
exponentiate = TRUE # We want the exponents (display results as odds ratios)
)
# Print the results:
model_demographics
## # A tibble: 6 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.14 0.168 0.755 0.450 0.817 1.58
## 2 age 0.991 0.00682 -1.28 0.201 0.978 1.00
## 3 genderMale 0.986 0.0681 -0.205 0.838 0.863 1.13
## 4 wt_kg 1.00 0.00449 0.796 0.426 0.995 1.01
## 5 ht_cm 1.00 0.00211 0.219 0.827 0.996 1.00
## 6 wt_kg:ht_cm 1.00 0.0000362 0.108 0.914 1.00 1.00
As you can see, variables separated by a plus symbol have individual results. Variables combined in an interaction term have both individual results and a combined result for the interaction.
Now that we have the syntax for creating a multivariable model, lets try to make a model with variables that are more likely to have an impact on death as an outcome:
First, we construct the model using the glm()
function. Because we are not looking at any interaction terms, we can use simpler syntax, piping in the variables we want to include in the model and using the dot .
as a placeholder for each of the variables, which will all be added to the model with the default approach (additive, represented by the +
symbol):
# Pipe in the data set:
model_rf <- linelist %>%
# Filter out hospital = 'other'
filter(!hospital %in% "Other") %>%
# Change the reference value for source to "other":
mutate(source = relevel(x = factor(source), ref = "other")) %>%
# Select variables to put in the model:
select(
casedef,
age_years,
gender,
bmi,
hospital,
source,
ct_blood,
temp,
days_onset_hosp
) %>%
# Construct the model:
glm(formula = casedef ~ ., family = binomial(link = "logit"))
To view the results, we will use the gtsummary::tbl_regression()
function:
# Pipe in the model:
mrf_res <- model_rf %>%
# Create the table of results:
tbl_regression(
# Exponentiate to get Odds Ratios
exponentiate = TRUE,
# Add estimate to reference values
add_estimate_to_reference_rows = TRUE,
# Specify labels for columns:
label = list(
age_years ~ "Age (years)",
gender ~ "Sex",
bmi ~ "Body mass index",
hospital ~ "Hospital",
source = "Exposure event",
ct_blood ~ "Virus PCR cycle threshold",
temp ~ "Temperature (degrees centegrade)",
days_onset_hosp ~ "Time to hospitalisation (days)"
)) %>%
# Make significant results bold:
bold_p() %>%
# Add a custom label to indicate which value was used as a reference:
modify_table_styling(
columns = ci,
rows = reference_row %in% TRUE,
missing_symbol = "Ref."
)
# Print the results:
mrf_res
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
Age (years) | 0.99 | 0.98, 1.00 | 0.2 |
Sex | |||
Female | 1.00 | Ref. | |
Male | 1.14 | 0.92, 1.40 | 0.2 |
Body mass index | 1.00 | 1.00, 1.00 | 0.2 |
Hospital | |||
Central Hospital | 1.00 | Ref. | |
Military Hospital | 1.03 | 0.74, 1.43 | 0.9 |
Port Hospital | 1.00 | 0.74, 1.36 | >0.9 |
St. Mark's Maternity Hospital (SMMH) | 1.52 | 1.02, 2.28 | 0.042 |
Exposure event | |||
other | 1.00 | Ref. | |
funeral | 0.91 | 0.69, 1.21 | 0.5 |
Virus PCR cycle threshold | 0.92 | 0.85, 0.99 | 0.025 |
Temperature (degrees centegrade) | 1.04 | 0.94, 1.15 | 0.5 |
Time to hospitalisation (days) | 0.87 | 0.82, 0.92 | <0.001 |
1 OR = Odds Ratio, CI = Confidence Interval |
The table shows that we have three significant results from our model, one of which is a risk factor (odds ratio above 1) and two of which are protective (odds ratio less than 1):
hospital
: Being at St Mark’s maternity hospital increases the odds of dyingct_blood
: having a higher PCR cycle threshold (=lower viral load) is protective against deathdays_onset_hosp
: longer periods between symptom onset and hospitalisation are protective against deathThe first result (being at St Mark’s maternity hospital) might also be a proxy for another risk factor: pregnancy. We don’t have data on whether or not patients were pregnant, but we could investigate if this is likely by creating a new variable to indicate which patients are adult women of child-bearing age:
linelist <- linelist %>%
# Create variable to show whether patient is a woman of child bearing age
# (female and aged between 12 and 45 years)
mutate(wcba = case_when(
gender == "Female" & between(x = age_years, left = 12, right = 45) ~ TRUE,
gender == "Female" & !between(x = age_years, left = 12, right = 45) ~ FALSE,
gender == "Male" ~ FALSE,
.default = NA
))
# Create table of results:
wcbatab <- linelist %>%
tabyl(hospital, wcba) %>%
adorn_percentages() %>%
adorn_pct_formatting()
# Print the table:
wcbatab
## hospital FALSE TRUE NA_
## Central Hospital 75.4% 20.4% 4.2%
## Military Hospital 73.6% 22.0% 4.4%
## Other 73.0% 22.2% 4.8%
## Port Hospital 73.5% 22.2% 4.3%
## St. Mark's Maternity Hospital (SMMH) 72.6% 23.1% 4.3%
## <NA> 73.5% 21.5% 5.0%
Surprisingly, the proportion of patients that are women of child-bearing age at St Mark’s maternity hospital is only slightly higher than the other (non-maternity) hospitals. If we view the data for that hospital, we can see that there are a number of patients with unlikely demographics, such as adult males. It is possible that the hospital was repurposed during the crisis and had other (general) wards as well - or alternatively this could be an artefact because we are after all looking at fake data ;-) Other reasons for a particular hospital being flagged might be if their procedures for dealing with very ill cases were not as good as the other hospitals, if they received more patients earlier on in the epidemic when procedures were not as well known, or due to the hospital being a referral centre for more acutely ill cases.
The second result (higher PCR cycle threshold being protective against death) is as we would expect, since high Ct is a proxy for low viral load and we would expect people with higher viral load to be more at risk of death.
The third significant result (reaching the hospital later in the episode of illness being protective against death) is a bit more nuanced - it is possible that people with very short periods between onset and hospitalisation were more accutely ill and therefore more at risk of dying, than those hospitalised later in their illness. If this were a real analysis, this result would need some further investigation and consideration of potential confounding factors (particularly one that could indicate severity of illness).
For now, there is one more thing that we can do to present our results in a way that would make them easier to interpret; we can use the gtsummary::tbl_merge()
function to merge a table of descriptive statistics with analytic results. First we create the table of descriptive statistics for the variables in our model, as we did in exercise 3:
# Create summary table:
mrfdesctab <- linelist %>%
# Filter out hospital = 'other'
filter(!hospital %in% "Other") %>%
# Change the reference value for source to "other":
mutate(source = relevel(x = factor(source), ref = "other")) %>%
# Create the summary table:
tbl_summary(
# Select which variables to include:
include = c(
casedef,
age_years,
gender,
bmi,
hospital,
source,
ct_blood,
temp,
days_onset_hosp),
# Select which variable to stratify by:
by = casedef,
# Specify how to handle missing values:
missing = "no",
# Specify labels for columns:
label = list(
age_years ~ "Age (years)",
gender ~ "Sex",
bmi ~ "Body mass index",
hospital ~ "Hospital",
source = "Exposure event",
ct_blood ~ "Virus PCR cycle threshold",
temp ~ "Temperature (degrees centegrade)",
days_onset_hosp ~ "Time to hospitalisation (days)"
),
# Specify sorting by frequency:
sort = all_categorical() ~ "frequency"
) %>%
# Add overall summary figures:
add_overall() %>%
# Format variable labels:
bold_labels() %>%
italicize_labels() %>%
# Tidy up column headers:
modify_header(
label = "**Characteristic**",
stat_0 = "**Overall**\nN = {N}",
stat_1 = "**Recovered**\nN = {n}",
stat_2 = "**Died**\nN = {n}"
)
# Print the table:
mrfdesctab
Characteristic | Overall N = 38801 | Recovered N = 16931 | Died N = 21871 |
---|---|---|---|
Age (years) | 13 (6, 23) | 13 (6, 23) | 13 (6, 23) |
Sex | |||
Female | 1,866 (50%) | 822 (51%) | 1,044 (50%) |
Male | 1,840 (50%) | 804 (49%) | 1,036 (50%) |
Body mass index | 32 (25, 50) | 32 (25, 50) | 32 (25, 50) |
Hospital | |||
Port Hospital | 1,364 (50%) | 579 (49%) | 785 (50%) |
Military Hospital | 708 (26%) | 309 (26%) | 399 (25%) |
Central Hospital | 358 (13%) | 165 (14%) | 193 (12%) |
St. Mark's Maternity Hospital (SMMH) | 325 (12%) | 126 (11%) | 199 (13%) |
Exposure event | |||
other | 2,136 (85%) | 915 (84%) | 1,221 (86%) |
funeral | 375 (15%) | 177 (16%) | 198 (14%) |
Virus PCR cycle threshold | 22.00 (20.00, 22.00) | 21.00 (20.00, 22.00) | 22.00 (20.00, 22.00) |
Temperature (degrees centegrade) | 38.80 (38.20, 39.20) | 38.80 (38.20, 39.23) | 38.80 (38.20, 39.20) |
Time to hospitalisation (days) | 1.00 (1.00, 3.00) | 1.00 (1.00, 3.00) | 1.00 (1.00, 3.00) |
1 Median (IQR); n (%) |
Now we can merge the two tables:
# Create the merged table:
final_restab <- tbl_merge(
# List the tables to merge:
tbls = list(mrfdesctab, mrf_res),
# Give each table its own heading:
tab_spanner = c("**Descriptive summary**", "**Multivariable model**")
)
# Print the merged table:
final_restab
Characteristic | Descriptive summary | Multivariable model | ||||
---|---|---|---|---|---|---|
Overall N = 38801 | Recovered N = 16931 | Died N = 21871 | OR2 | 95% CI2 | p-value | |
Age (years) | 13 (6, 23) | 13 (6, 23) | 13 (6, 23) | 0.99 | 0.98, 1.00 | 0.2 |
Sex | ||||||
Female | 1,866 (50%) | 822 (51%) | 1,044 (50%) | 1.00 | Ref. | |
Male | 1,840 (50%) | 804 (49%) | 1,036 (50%) | 1.14 | 0.92, 1.40 | 0.2 |
Body mass index | 32 (25, 50) | 32 (25, 50) | 32 (25, 50) | 1.00 | 1.00, 1.00 | 0.2 |
Hospital | ||||||
Port Hospital | 1,364 (50%) | 579 (49%) | 785 (50%) | 1.00 | 0.74, 1.36 | >0.9 |
Military Hospital | 708 (26%) | 309 (26%) | 399 (25%) | 1.03 | 0.74, 1.43 | 0.9 |
Central Hospital | 358 (13%) | 165 (14%) | 193 (12%) | 1.00 | Ref. | |
St. Mark's Maternity Hospital (SMMH) | 325 (12%) | 126 (11%) | 199 (13%) | 1.52 | 1.02, 2.28 | 0.042 |
Exposure event | ||||||
other | 2,136 (85%) | 915 (84%) | 1,221 (86%) | 1.00 | Ref. | |
funeral | 375 (15%) | 177 (16%) | 198 (14%) | 0.91 | 0.69, 1.21 | 0.5 |
Virus PCR cycle threshold | 22.00 (20.00, 22.00) | 21.00 (20.00, 22.00) | 22.00 (20.00, 22.00) | 0.92 | 0.85, 0.99 | 0.025 |
Temperature (degrees centegrade) | 38.80 (38.20, 39.20) | 38.80 (38.20, 39.23) | 38.80 (38.20, 39.20) | 1.04 | 0.94, 1.15 | 0.5 |
Time to hospitalisation (days) | 1.00 (1.00, 3.00) | 1.00 (1.00, 3.00) | 1.00 (1.00, 3.00) | 0.87 | 0.82, 0.92 | <0.001 |
1 Median (IQR); n (%) | ||||||
2 OR = Odds Ratio, CI = Confidence Interval |
For more detail on how to calculate and display descriptive statistics, see the following chapters of the Epidemiologist R handbook:
For details on how to conduct univariate and multivariable regression analysis, see: