Also plays well with labelled data
The gtsummary
package in R creates amazing publication / presentation / whatever-you-need-it-for ready tables of summary statistics. Try it out!
A colleague learning R just told me that he spent 45 minutes searching for a summary table function and couldn’t quite find anything that met his needs. When I showed him gtsummary
in 5 minutes, his reaction was all
This blog post is to promote gtsummary
and make it more searchable for those still seeking the one table to rule them all. The gtsummary
documentation is excellent so I won’t cover all of its awesome functionality, but I will add a bit of my specific experience.
If you are still searching for your favorite table package, here are two round up resources:
How to make beautiful tables in R by R for the Rest of Us (2019).
My favourite R package for: summarising data by Dabbling with data (2018)
I’ll demonstrate with the Youth Risk Behavior Surveillance System (YRBSS) data; my previous post Leveraging labelled data in R has more background details. You can download the .sav
data from my github repo.
This material was developed using:
Software / package | Version |
---|---|
R | 4.0.5 |
RStudio | 1.4.1103 |
tidyverse |
1.3.0 |
here |
1.0.1 |
haven |
2.3.1 |
labelled |
2.5.0 |
gtsummary |
1.3.5 |
writexl |
1.4.0 |
First, I import the data with haven::read_sav()
.
Then I keep just two years of data for comparison on three question.
Here is a quick preview of the data.
# print data for viewing----
dat
# A tibble: 30,389 x 5
record year stheight q12 q69
<dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
1 1349684 2015 NA NA 3 [Stay the same weight]
2 1349685 2015 NA 1 [0 days] 3 [Stay the same weight]
3 1349686 2015 NA NA NA
4 1349687 2015 NA 1 [0 days] 1 [Lose weight]
5 1349688 2015 NA NA 2 [Gain weight]
6 1349689 2015 NA 1 [0 days] 3 [Stay the same weight]
7 1349690 2015 NA NA 2 [Gain weight]
8 1349691 2015 NA 1 [0 days] 1 [Lose weight]
9 1349692 2015 NA 3 [2 or 3 days] 1 [Lose weight]
10 1349693 2015 NA 1 [0 days] 3 [Stay the same weight]
# ... with 30,379 more rows
I recommend reading Leveraging labelled data in R for more details on labelled data, but for now here is a quick preview of the metadata.
# create data dictionary ----
dictionary <- labelled::look_for(dat, details = TRUE)
# view select items from dictionary ----
dictionary %>%
dplyr::select(variable, label, value_labels) %>%
knitr::kable()
variable | label | value_labels |
---|---|---|
record | Record ID | |
year | 4-digit Year of survey | |
stheight | Height in meters | |
q12 | Weapon carrying | [1] 0 days; [2] 1 day; [3] 2 or 3 days; [4] 4 or 5 days; [5] 6 or more days |
q69 | Weight loss | [1] Lose weight; [2] Gain weight; [3] Stay the same weight; [4] Not trying to do anything |
To get a quick summary table, first retain only the variables you want to see in the summary table. In this case, I removed record
to avoid summary statistics of an ID. Next, because the imported data has value labels from the .sav
data file, I converted those variables to a factor for summarizing.
dat %>%
# remove from summary table ----
dplyr::select(-record) %>%
# convert labelled values to a factor ----
mutate_at(vars(matches("q")), haven::as_factor) %>%
# create a table with columns for year ----
gtsummary::tbl_summary(by = year)
Characteristic | 2015, N = 15,6241 | 2017, N = 14,7651 |
---|---|---|
Height in meters | 1.68 (1.60, 1.75) | 1.68 (1.60, 1.75) |
Unknown | 1,266 | 1,619 |
Weapon carrying | ||
0 days | 11,897 (82%) | 10,027 (85%) |
1 day | 494 (3.4%) | 349 (3.0%) |
2 or 3 days | 580 (4.0%) | 398 (3.4%) |
4 or 5 days | 202 (1.4%) | 166 (1.4%) |
6 or more days | 1,250 (8.7%) | 798 (6.8%) |
Unknown | 1,201 | 3,027 |
Weight loss | ||
Lose weight | 6,676 (48%) | 5,462 (47%) |
Gain weight | 2,439 (18%) | 2,247 (19%) |
Stay the same weight | 2,354 (17%) | 1,833 (16%) |
Not trying to do anything | 2,375 (17%) | 1,999 (17%) |
Unknown | 1,780 | 3,224 |
1
Statistics presented: Median (IQR); n (%)
|
So easy to obtain, and so readable!
And wait - did you see that?! The raw data had variable names of q12
, stheight
, and q69
but the table printed the variable label! (I previously tweeted about the awesome package pairing of haven
and gtsummary
.) If your data does not come with handy labels, you can create them with the label
option in tbl_summary
or with the var_label
function in the labelled
package.
Here are a few modifications you might be interested in trying to customize your table, including adding an overall column, custom statistic formatting, and table styling. Note that there is an overall N that corresponds to the number of observations, and each each variable can have its own N that corresponds to the number of non-missing observations for that variable.
dat %>%
# remove from summary table
dplyr::select(-record) %>%
# covert labelled values to a factor ----
mutate_at(vars(matches("q")), haven::as_factor) %>%
tbl_summary(by = year,
# custom statistic formats ----
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{p}% ({n} / {N})")
) %>%
add_overall() %>% # add an overall column
add_n() %>% # add column with total number of non-missing observations
add_p() %>% # test for a difference between groups
bold_labels()
Characteristic | N | Overall, N = 30,3891 | 2015, N = 15,6241 | 2017, N = 14,7651 | p-value2 |
---|---|---|---|---|---|
Height in meters | 27,504 | 1.69 (0.10) | 1.69 (0.10) | 1.69 (0.10) | 0.12 |
Unknown | 2,885 | 1,266 | 1,619 | ||
Weapon carrying | 26,161 | <0.001 | |||
0 days | 84% (21,924 / 26,161) | 82% (11,897 / 14,423) | 85% (10,027 / 11,738) | ||
1 day | 3.2% (843 / 26,161) | 3.4% (494 / 14,423) | 3.0% (349 / 11,738) | ||
2 or 3 days | 3.7% (978 / 26,161) | 4.0% (580 / 14,423) | 3.4% (398 / 11,738) | ||
4 or 5 days | 1.4% (368 / 26,161) | 1.4% (202 / 14,423) | 1.4% (166 / 11,738) | ||
6 or more days | 7.8% (2,048 / 26,161) | 8.7% (1,250 / 14,423) | 6.8% (798 / 11,738) | ||
Unknown | 4,228 | 1,201 | 3,027 | ||
Weight loss | 25,385 | <0.001 | |||
Lose weight | 48% (12,138 / 25,385) | 48% (6,676 / 13,844) | 47% (5,462 / 11,541) | ||
Gain weight | 18% (4,686 / 25,385) | 18% (2,439 / 13,844) | 19% (2,247 / 11,541) | ||
Stay the same weight | 16% (4,187 / 25,385) | 17% (2,354 / 13,844) | 16% (1,833 / 11,541) | ||
Not trying to do anything | 17% (4,374 / 25,385) | 17% (2,375 / 13,844) | 17% (1,999 / 11,541) | ||
Unknown | 5,004 | 1,780 | 3,224 | ||
1
Statistics presented: Mean (SD); % (n / N)
2
Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence
|
In addition, gtsummary
makes an educated guess on how to summarize your data and which statistical test to use. Pay attention to the footnote on the statistical tests performed and adjust if needed with the test
argument in the add_p
function..
One default I frequently correct is treatment of discrete numeric values. For example, consider a rating scale with possible values of 1, 2, 3, … 7, but in which respondents only select values of 3, 4, 5. Making an educated guess and only seeing three unique values, gtsummary
will treat this as a categorical variable and return frequencies of those values; however, you may still want a mean. You can override these potentially undesirable defaults in gtsummary
. 🙌
This data set does not have a great example of this, so I’ll make one. I am going to pretend that q12
is numeric for demonstration,
dat %>%
# keep q12 for summary table ----
dplyr::select(q12) %>%
# convert labelled values to numeric by removing value labels ----
mutate_at(vars(q12), haven::zap_labels) %>%
# subtract one to make pretend example more realistic ----
mutate_at(vars(q12), ~ . - 1) %>%
# create a table with columns for year ----
gtsummary::tbl_summary()
Characteristic | N = 30,3891 |
---|---|
Weapon carrying | |
0 | 21,924 (84%) |
1 | 843 (3.2%) |
2 | 978 (3.7%) |
3 | 368 (1.4%) |
4 | 2,048 (7.8%) |
Unknown | 4,228 |
1
Statistics presented: n (%)
|
In this table, a 4 represents carrying a weapon to school 4 days in the past month. Notice the default is to provide a frequency summary as if it is a categorical variable due the few values present. If instead you want a mean to summarize this variable, specify the variable as continuous with the type
argument in tbl_summary
.
dat %>%
# keep q12 for summary table ----
dplyr::select(q12) %>%
# convert labelled values to numeric by removing value labels ----
mutate_at(vars(q12), haven::zap_labels) %>%
# subtract one to make pretend example more realistic ----
mutate_at(vars(q12), ~ . - 1) %>%
# create a table with columns for year ----
gtsummary::tbl_summary(
# treat q12 variable as continuous ----
type = list(q12 ~ "continuous"),
# custom statistic format ----
statistic = list(all_continuous() ~ "{mean} ({sd})")
)
Characteristic | N = 30,3891 |
---|---|
Weapon carrying | 0.46 (1.16) |
Unknown | 4,228 |
1
Statistics presented: Mean (SD)
|
Before we part, let’s make sure we understand how the package handles missing data and our options.
dat %>%
dplyr::select(q69) %>%
# convert labelled values to a factor ----
mutate_at(vars(matches("q")), haven::as_factor) %>%
gtsummary::tbl_summary()
Characteristic | N = 30,3891 |
---|---|
Weight loss | |
Lose weight | 12,138 (48%) |
Gain weight | 4,686 (18%) |
Stay the same weight | 4,187 (16%) |
Not trying to do anything | 4,374 (17%) |
Unknown | 5,004 |
1
Statistics presented: n (%)
|
This summary table tells us that there are 30,389 records in the years 2015 and 2017; 25,385 of them have non-missing values and 5,004 have missing values. The percents shown in the table are the percent of the non-missing base N; that is, 12,138 / 25,385 is 48%.
You can suppress printing the count of unknown values if you like, which does not change any of other numbers in your table.
dat %>%
dplyr::select(q69) %>%
# convert labelled values to a factor ----
mutate_at(vars(q69), haven::as_factor) %>%
gtsummary::tbl_summary(
# supress printing count of unknown values ----
missing = "no"
)
Characteristic | N = 30,3891 |
---|---|
Weight loss | |
Lose weight | 12,138 (48%) |
Gain weight | 4,686 (18%) |
Stay the same weight | 4,187 (16%) |
Not trying to do anything | 4,374 (17%) |
1
Statistics presented: n (%)
|
If you want the N at the top of the column to reflect the N for non-missing observations, I would remove those in your data cleaning process.
dat %>%
dplyr::select(q69) %>%
# convert labelled values to a factor ----
mutate_at(vars(q69), haven::as_factor) %>%
# remove all observations with missing values ----
drop_na() %>%
gtsummary::tbl_summary()
Characteristic | N = 25,3851 |
---|---|
Weight loss | |
Lose weight | 12,138 (48%) |
Gain weight | 4,686 (18%) |
Stay the same weight | 4,187 (16%) |
Not trying to do anything | 4,374 (17%) |
1
Statistics presented: n (%)
|
On the other hand, if the missing data represents a valid category that you want counted, you could replace the missing values (which would shift the percentages).
dat %>%
dplyr::select(q69) %>%
# convert labelled values to a factor ----
mutate_at(vars(q69), haven::as_factor) %>%
# replace missing values for factor ----
mutate_at(vars(q69), forcats::fct_explicit_na, "Missing weight action") %>%
gtsummary::tbl_summary()
Characteristic | N = 30,3891 |
---|---|
Weight loss | |
Lose weight | 12,138 (40%) |
Gain weight | 4,686 (15%) |
Stay the same weight | 4,187 (14%) |
Not trying to do anything | 4,374 (14%) |
Missing weight action | 5,004 (16%) |
1
Statistics presented: n (%)
|
gtsummay
has a variety of supported outputs, including html and word. I work with many who prefer to see tables in excel, and this an output not directly supported in gtsummary
. (No shade here, I’m waiting patiently for this to happen in the gt
package. 😁) However, gtsummary
does have a work around available with the gtsummary::as_tibble()
function. Save the table as an object, convert to a tibble, and then export.
And now you can export to excel.
# 3 - export to excel ----
writexl::write_xlsx(gt_table, here::here( "_posts", "2021-07-14-polished-summary-tables-in-r-with-gtsummary", "example_gtsummary.xlsx"))
You can download the exported output example_gtsummary.xlsx.
The gtsummary
package has readable output that is easily customizable. My quest for the best presentation ready R table is over! 🥳
Daniel Sjoberg + team rock for their work on gtsummary
! Thanks to my colleague Patrick Freeman for reviewing this post and providing feedback, and thanks to new learneRs who inspire me to write. 💜
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Pileggi (2021, July 14). PIPING HOT DATA: Polished summary tables in R with gtsummary. Retrieved from https://www.pipinghotdata.com/posts/2021-07-14-polished-summary-tables-in-r-with-gtsummary/
BibTeX citation
@misc{pileggi2021polished, author = {Pileggi, Shannon}, title = {PIPING HOT DATA: Polished summary tables in R with gtsummary}, url = {https://www.pipinghotdata.com/posts/2021-07-14-polished-summary-tables-in-r-with-gtsummary/}, year = {2021} }