Exercises
4 Grouping and tables
Download the materials for this week’s exercises here.
Yup, still the same old package and data, though we’ll be introducing another package today!
library(tidyverse)
nlsy <- read_csv("nlsy_cc.csv")
4.1 Using pipes
Compare these two ways of making the same dataset:
nlsy2 <- mutate(nlsy,
only = case_when(
nsibs == 0 ~ "yes",
TRUE ~ "no")
)
nlsy3 <- select(nlsy2,
id, contains("sleep"), only)
only_kids <- filter(nlsy3,
only == "yes")
vs:
only_kids <- filter(
select(
mutate(nlsy,
only = case_when(
nsibs == 0 ~ "yes",
TRUE ~ "no"
)
),
id, contains("sleep"), only
),
only == "yes"
)
vs:
only_kids <- nlsy %>%
mutate(only = case_when(
nsibs == 0 ~ "yes",
TRUE ~ "no"
)
) %>%
select(id, contains("sleep"), only) %>%
filter(only == "yes")
All 3 techniques – making new datasets, nesting functions, and using pipes – can be useful, and which one is easier to write and read depends on the situation. Today we’ll practice using pipes.
Translate code
Consider the following instructions for making a dataset:
- Create a variable called
slp_cat_wkday
categorizing weekday sleep. - Subset the dataset to only include people with missing values.
- Remove non-sleep-related variables from the dataset.
- Call your finished dataset
missing_sleep
.
I’ve followed these instructions here.
nlsy2 <- mutate(nlsy, slp_cat_wkdy = case_when(
sleep_wkdy < 5 ~ "little",
sleep_wkdy < 7 ~ "some",
sleep_wkdy < 9 ~ "ideal",
sleep_wkdy < 12 ~ "lots",
TRUE ~ NA_character_))
missing_sleep <- filter(nlsy2, is.na(slp_cat_wkdy))
missing_sleep <- select(missing_sleep, starts_with("slp"), contains("sleep"))
missing_sleep
#> # A tibble: 3 x 3
#> slp_cat_wkdy sleep_wkdy sleep_wknd
#> <chr> <dbl> <dbl>
#> 1 <NA> 12 12
#> 2 <NA> 12 12
#> 3 <NA> 13 13
Now follow the instructions using pipes.
missing_sleep <- nlsy %>% ...
Order of operations
Experiment with switching up the “order of operations”. Can you complete the instructions in a different order and get the same result? Can you think of a situation when you might not be able to do so?
4.2 Summary statistics
We can use the summarize()
function to calculate summary statistics:
summarize(nlsy,
med_age_bir = median(age_bir),
cor_sleep = cor(sleep_wkdy, sleep_wknd),
ten_pctle_inc = quantile(income, probs = 0.1),
ninety_pctle_inc = quantile(income, probs = 0.9))
#> # A tibble: 1 x 4
#> med_age_bir cor_sleep ten_pctle_inc ninety_pctle_inc
#> <dbl> <dbl> <dbl> <dbl>
#> 1 22 0.710 3177. 33024.
It always helps to read the documentation when you learn a new function! Run this in your RStudio session.
help(summarize)
Note from the documentation that you can also spell the function summarise()
. Even though I consistently use American spellings in every other aspect of my life, you will find me spelling it like this often!
nlsy %>% summarise(q.1 = quantile(age_bir, probs = 0.1),
q.2 = quantile(age_bir, probs = 0.2),
q.3 = quantile(age_bir, probs = 0.3),
q.4 = quantile(age_bir, probs = 0.4),
q.5 = quantile(age_bir, probs = 0.5))
#> # A tibble: 1 x 5
#> q.1 q.2 q.3 q.4 q.5
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 17 18 20 21 22
You can use summarize()
at the end (or the beginning, or middle, depending on what you’re doing!) of a pipe:
nlsy %>%
mutate(age_bir_stand = (age_bir - mean(age_bir)) / sd(age_bir)) %>%
filter(sex == 1) %>%
summarize(mean_men = mean(age_bir_stand))
#> # A tibble: 1 x 1
#> mean_men
#> <dbl>
#> 1 0.283
Correlation
Among the only children, find the correlation between hours of sleep on weekdays and weekends.
Proportions
Create a variable that is 1 if an observation has an income between 20,000 and 30,000, and 0 otherwise. Calculate the proportion of people in the dataset who fit that criterion.
summary()
reimagined
Recreate the summary()
function using summarize()
(i.e., produce all the same statistics) for a variable of your choice.
4.3 Stratified summary statistics
Note the grouping information that gets printed out when you use group_by()
:
# grouped dataset
nlsy %>%
mutate(income_stand = (income - mean(income))/sd(income)) %>%
select(id, region, income_stand, race_eth, sex) %>%
group_by(race_eth)
#> # A tibble: 1,205 x 5
#> # Groups: race_eth [3]
#> id region income_stand race_eth sex
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 1 0.533 3 2
#> 2 6 1 1.48 3 1
#> 3 8 1 -0.605 3 2
#> 4 16 1 2.45 3 2
#> 5 18 3 -0.809 3 1
#> 6 20 1 2.60 3 2
#> 7 27 1 0.353 3 2
#> 8 49 1 0.646 3 2
#> 9 57 1 0.600 3 2
#> 10 67 1 1.48 3 1
#> # … with 1,195 more rows
Now functions within summarize()
will be applied within the groups:
nlsy %>%
mutate(income_stand = (income - mean(income))/sd(income)) %>%
group_by(region) %>%
summarize(mean_inc = mean(income_stand),
sd_inc = sd(income_stand))
#> # A tibble: 4 x 3
#> region mean_inc sd_inc
#> <dbl> <dbl> <dbl>
#> 1 1 0.186 1.17
#> 2 2 0.106 0.958
#> 3 3 -0.0891 1.03
#> 4 4 -0.145 0.810
(The message is just telling us that the new dataset is no longer grouped, and that we can change that using the .groups =
argument.)
You can group by multiple variables:
nlsy %>%
group_by(region, sex) %>%
summarize(mean_inc = mean(income),
sd_inc = sd(income))
#> # A tibble: 8 x 4
#> # Groups: region [4]
#> region sex mean_inc sd_inc
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 20393. 16929.
#> 2 1 2 15928. 14442.
#> 3 2 1 19488. 14885.
#> 4 2 2 14699. 10613.
#> 5 3 1 15137. 14592.
#> 6 3 2 13293. 12879.
#> 7 4 1 12315. 9244.
#> 8 4 2 14001. 11621.
When you group by multiple variables and then summarize, the default is to “peel off” the last grouping layer and leave the rest, as the message describes. This can be confusing/annoying if you don’t realize it, so you might want to use .groups = "drop"
.
Here are three different ways to do the same thing (a common “problem” in R!):
nlsy %>%
group_by(sex) %>%
summarize(n = n())
#> # A tibble: 2 x 2
#> sex n
#> <dbl> <int>
#> 1 1 501
#> 2 2 704
nlsy %>%
group_by(sex) %>%
tally()
#> # A tibble: 2 x 2
#> sex n
#> <dbl> <int>
#> 1 1 501
#> 2 2 704
nlsy %>%
count(sex)
#> # A tibble: 2 x 2
#> sex n
#> <dbl> <int>
#> 1 1 501
#> 2 2 704
You can use any one of them to calculate proportions:
nlsy %>%
group_by(sex) %>%
summarize(n = n()) %>%
mutate(prop = n / sum(n))
#> # A tibble: 2 x 3
#> sex n prop
#> <dbl> <int> <dbl>
#> 1 1 501 0.416
#> 2 2 704 0.584
By region
Find the median income per region. Before doing so, make sure that you’ve made region into a factor variable with appropriate names so we can easily read your results.
By a new variable
Calculate and compare the median income for people who sleep at least 8 hours on weekdays and those who sleep less.
Proportions by region
Among the women (sex = 2
), calculate the proportion who live in each region.
4.4 Tables with the tableone
package
If you have never installed the tableone
package, you’ll need to do that first:
install.packages("tableone")
Then load the library and create a table 1!
library(tableone) # in general, put at the top of your script
tab1 <- CreateTableOne(
data = nlsy,
vars = c("eyesight", "nsibs", "race_eth",
"sex", "region", "income", "age_bir"),
strata = "glasses",
factorVars = c("eyesight", "race_eth", "sex", "region")
)
tab1
#> Stratified by glasses
#> 0 1 p test
#> n 581 624
#> eyesight (%) 0.333
#> 1 220 (37.9) 254 (40.7)
#> 2 185 (31.8) 200 (32.1)
#> 3 122 (21.0) 127 (20.4)
#> 4 46 ( 7.9) 32 ( 5.1)
#> 5 8 ( 1.4) 11 ( 1.8)
#> nsibs (mean (SD)) 4.03 (2.65) 3.85 (2.47) 0.210
#> race_eth (%) <0.001
#> 1 103 (17.7) 108 (17.3)
#> 2 177 (30.5) 130 (20.8)
#> 3 301 (51.8) 386 (61.9)
#> sex = 2 (%) 301 (51.8) 403 (64.6) <0.001
#> region (%) 0.230
#> 1 102 (17.6) 104 (16.7)
#> 2 146 (25.1) 187 (30.0)
#> 3 211 (36.3) 200 (32.1)
#> 4 122 (21.0) 133 (21.3)
#> income (mean (SD)) 13852.78 (12011.51) 16626.22 (14329.16) <0.001
#> age_bir (mean (SD)) 23.07 (5.91) 23.80 (6.05) 0.036
Not only are there many possible arguments to the CreateTableOne()
function, but also for the print()
function when applied to a TableOne
object. You can look up the documentation for both:
print(tab1, catDigits = 2, contDigits = 2, test = FALSE, smd = TRUE)
#> Stratified by glasses
#> 0 1 SMD
#> n 581 624
#> eyesight (%) 0.123
#> 1 220 (37.87) 254 (40.71)
#> 2 185 (31.84) 200 (32.05)
#> 3 122 (21.00) 127 (20.35)
#> 4 46 ( 7.92) 32 ( 5.13)
#> 5 8 ( 1.38) 11 ( 1.76)
#> nsibs (mean (SD)) 4.03 (2.65) 3.85 (2.47) 0.072
#> race_eth (%) 0.234
#> 1 103 (17.73) 108 (17.31)
#> 2 177 (30.46) 130 (20.83)
#> 3 301 (51.81) 386 (61.86)
#> sex = 2 (%) 301 (51.81) 403 (64.58) 0.261
#> region (%) 0.120
#> 1 102 (17.56) 104 (16.67)
#> 2 146 (25.13) 187 (29.97)
#> 3 211 (36.32) 200 (32.05)
#> 4 122 (21.00) 133 (21.31)
#> income (mean (SD)) 13852.78 (12011.51) 16626.22 (14329.16) 0.210
#> age_bir (mean (SD)) 23.07 (5.91) 23.80 (6.05) 0.121
help("CreateTableOne")
help("print.TableOne")
You can use this code to help get your table 1 in an easy-to-share format:
table1_to_share <- tab1 %>%
print() %>%
as_tibble(rownames = "id")
write_csv(table1_to_share, "table1.csv")
Check out some other packages for tables as well: https://github.com/kaz-yos/tableone#similar-or-complementary-projects.
Prepare a dataset
Apply the following inclusion criteria to the NLSY dataset to make a analysis dataset: from region 1 or 4, with at least 2 siblings, and doesn’t wear glasses.
Make a stratification variable
Make a variable in that dataset that categorizes people in quartiles (split at 25th, 50th, and 75% percentiles of the new dataset) by income. Make sure the categories have descriptive names.
Make table 1
Make a table 1 for this new dataset, stratified by the new income variable. Make sure they show up in the correct order in your table. Include p-values testing across strata but only print 2 digits for them. Perform an exact test for the region comparison.