Exercises

4 Grouping and tables

Download the materials for this week’s exercises here.

Yup, still the same old package and data, though we’ll be introducing another package today!

library(tidyverse)
nlsy <- read_csv("nlsy_cc.csv")

4.1 Using pipes

Compare these two ways of making the same dataset:

nlsy2 <- mutate(nlsy, 
                only = case_when(
                  nsibs == 0 ~ "yes",
                  TRUE ~ "no")
                )
nlsy3 <- select(nlsy2, 
                id, contains("sleep"), only)
only_kids <- filter(nlsy3, 
                    only == "yes")

vs:

only_kids <- filter(
  select(
    mutate(nlsy,
      only = case_when(
        nsibs == 0 ~ "yes",
        TRUE ~ "no"
      )
    ),
    id, contains("sleep"), only
  ),
  only == "yes"
)

vs:

only_kids <- nlsy %>%
  mutate(only = case_when(
                nsibs == 0 ~ "yes",
                TRUE ~ "no"
                )
         ) %>%
  select(id, contains("sleep"), only) %>%
  filter(only == "yes")

All 3 techniques – making new datasets, nesting functions, and using pipes – can be useful, and which one is easier to write and read depends on the situation. Today we’ll practice using pipes.

Translate code

Consider the following instructions for making a dataset:

  1. Create a variable called slp_cat_wkday categorizing weekday sleep.
  2. Subset the dataset to only include people with missing values.
  3. Remove non-sleep-related variables from the dataset.
  4. Call your finished dataset missing_sleep.

I’ve followed these instructions here.

nlsy2 <- mutate(nlsy, slp_cat_wkdy = case_when(
  sleep_wkdy < 5 ~ "little",
  sleep_wkdy < 7 ~ "some",
  sleep_wkdy < 9 ~ "ideal",
  sleep_wkdy < 12 ~ "lots",
  TRUE ~ NA_character_))
missing_sleep <- filter(nlsy2, is.na(slp_cat_wkdy))
missing_sleep <- select(missing_sleep, starts_with("slp"), contains("sleep"))
missing_sleep
#>  # A tibble: 3 x 3
#>    slp_cat_wkdy sleep_wkdy sleep_wknd
#>    <chr>             <dbl>      <dbl>
#>  1 <NA>                 12         12
#>  2 <NA>                 12         12
#>  3 <NA>                 13         13

Now follow the instructions using pipes.

missing_sleep <- nlsy %>% ...

Order of operations

Experiment with switching up the “order of operations”. Can you complete the instructions in a different order and get the same result? Can you think of a situation when you might not be able to do so?

4.2 Summary statistics

We can use the summarize() function to calculate summary statistics:

summarize(nlsy, 
          med_age_bir = median(age_bir),
          cor_sleep = cor(sleep_wkdy, sleep_wknd),
          ten_pctle_inc = quantile(income, probs = 0.1),
          ninety_pctle_inc = quantile(income, probs = 0.9))
#>  # A tibble: 1 x 4
#>    med_age_bir cor_sleep ten_pctle_inc ninety_pctle_inc
#>          <dbl>     <dbl>         <dbl>            <dbl>
#>  1          22     0.710         3177.           33024.

It always helps to read the documentation when you learn a new function! Run this in your RStudio session.

help(summarize)

Note from the documentation that you can also spell the function summarise(). Even though I consistently use American spellings in every other aspect of my life, you will find me spelling it like this often!

nlsy %>% summarise(q.1 = quantile(age_bir, probs = 0.1),
                   q.2 = quantile(age_bir, probs = 0.2),
                   q.3 = quantile(age_bir, probs = 0.3),
                   q.4 = quantile(age_bir, probs = 0.4),
                   q.5 = quantile(age_bir, probs = 0.5))
#>  # A tibble: 1 x 5
#>      q.1   q.2   q.3   q.4   q.5
#>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1    17    18    20    21    22

You can use summarize() at the end (or the beginning, or middle, depending on what you’re doing!) of a pipe:

nlsy %>%
  mutate(age_bir_stand = (age_bir - mean(age_bir)) / sd(age_bir)) %>%
  filter(sex == 1) %>%
  summarize(mean_men = mean(age_bir_stand))
#>  # A tibble: 1 x 1
#>    mean_men
#>       <dbl>
#>  1    0.283

Correlation

Among the only children, find the correlation between hours of sleep on weekdays and weekends.

Proportions

Create a variable that is 1 if an observation has an income between 20,000 and 30,000, and 0 otherwise. Calculate the proportion of people in the dataset who fit that criterion.

summary() reimagined

Recreate the summary() function using summarize() (i.e., produce all the same statistics) for a variable of your choice.

4.3 Stratified summary statistics

Note the grouping information that gets printed out when you use group_by():

# grouped dataset
nlsy %>%
  mutate(income_stand = (income - mean(income))/sd(income)) %>%
  select(id, region, income_stand, race_eth, sex) %>%
  group_by(race_eth)
#>  # A tibble: 1,205 x 5
#>  # Groups:   race_eth [3]
#>        id region income_stand race_eth   sex
#>     <dbl>  <dbl>        <dbl>    <dbl> <dbl>
#>   1     3      1        0.533        3     2
#>   2     6      1        1.48         3     1
#>   3     8      1       -0.605        3     2
#>   4    16      1        2.45         3     2
#>   5    18      3       -0.809        3     1
#>   6    20      1        2.60         3     2
#>   7    27      1        0.353        3     2
#>   8    49      1        0.646        3     2
#>   9    57      1        0.600        3     2
#>  10    67      1        1.48         3     1
#>  # … with 1,195 more rows

Now functions within summarize() will be applied within the groups:

nlsy %>%
  mutate(income_stand = (income - mean(income))/sd(income)) %>%
  group_by(region) %>%
  summarize(mean_inc = mean(income_stand),
            sd_inc = sd(income_stand))
#>  `summarise()` ungrouping output (override with `.groups` argument)
#>  # A tibble: 4 x 3
#>    region mean_inc sd_inc
#>     <dbl>    <dbl>  <dbl>
#>  1      1   0.186   1.17 
#>  2      2   0.106   0.958
#>  3      3  -0.0891  1.03 
#>  4      4  -0.145   0.810

(The message is just telling us that the new dataset is no longer grouped, and that we can change that using the .groups = argument.)

You can group by multiple variables:

nlsy %>%
  group_by(region, sex) %>%
  summarize(mean_inc = mean(income),
            sd_inc = sd(income))
#>  `summarise()` regrouping output by 'region' (override with `.groups` argument)
#>  # A tibble: 8 x 4
#>  # Groups:   region [4]
#>    region   sex mean_inc sd_inc
#>     <dbl> <dbl>    <dbl>  <dbl>
#>  1      1     1   20393. 16929.
#>  2      1     2   15928. 14442.
#>  3      2     1   19488. 14885.
#>  4      2     2   14699. 10613.
#>  5      3     1   15137. 14592.
#>  6      3     2   13293. 12879.
#>  7      4     1   12315.  9244.
#>  8      4     2   14001. 11621.

When you group by multiple variables and then summarize, the default is to “peel off” the last grouping layer and leave the rest, as the message describes. This can be confusing/annoying if you don’t realize it, so you might want to use .groups = "drop".

Here are three different ways to do the same thing (a common “problem” in R!):

nlsy %>%
  group_by(sex) %>%
  summarize(n = n())
#>  # A tibble: 2 x 2
#>      sex     n
#>    <dbl> <int>
#>  1     1   501
#>  2     2   704

nlsy %>%
  group_by(sex) %>%
  tally()
#>  # A tibble: 2 x 2
#>      sex     n
#>    <dbl> <int>
#>  1     1   501
#>  2     2   704

nlsy %>% 
  count(sex)
#>  # A tibble: 2 x 2
#>      sex     n
#>    <dbl> <int>
#>  1     1   501
#>  2     2   704

You can use any one of them to calculate proportions:

nlsy %>%
  group_by(sex) %>%
  summarize(n = n()) %>%
  mutate(prop = n / sum(n))
#>  # A tibble: 2 x 3
#>      sex     n  prop
#>    <dbl> <int> <dbl>
#>  1     1   501 0.416
#>  2     2   704 0.584

By region

Find the median income per region. Before doing so, make sure that you’ve made region into a factor variable with appropriate names so we can easily read your results.

By a new variable

Calculate and compare the median income for people who sleep at least 8 hours on weekdays and those who sleep less.

Proportions by region

Among the women (sex = 2), calculate the proportion who live in each region.

4.4 Tables with the tableone package

If you have never installed the tableone package, you’ll need to do that first:

install.packages("tableone")

Then load the library and create a table 1!

library(tableone) # in general, put at the top of your script

tab1 <- CreateTableOne(
  data = nlsy,
  vars = c("eyesight", "nsibs", "race_eth",
           "sex", "region", "income", "age_bir"),
  strata = "glasses",
  factorVars = c("eyesight", "race_eth", "sex", "region")
)

tab1
#>                       Stratified by glasses
#>                        0                   1                   p      test
#>    n                        581                 624                       
#>    eyesight (%)                                                 0.333     
#>       1                     220 (37.9)          254 (40.7)                
#>       2                     185 (31.8)          200 (32.1)                
#>       3                     122 (21.0)          127 (20.4)                
#>       4                      46 ( 7.9)           32 ( 5.1)                
#>       5                       8 ( 1.4)           11 ( 1.8)                
#>    nsibs (mean (SD))       4.03 (2.65)         3.85 (2.47)      0.210     
#>    race_eth (%)                                                <0.001     
#>       1                     103 (17.7)          108 (17.3)                
#>       2                     177 (30.5)          130 (20.8)                
#>       3                     301 (51.8)          386 (61.9)                
#>    sex = 2 (%)              301 (51.8)          403 (64.6)     <0.001     
#>    region (%)                                                   0.230     
#>       1                     102 (17.6)          104 (16.7)                
#>       2                     146 (25.1)          187 (30.0)                
#>       3                     211 (36.3)          200 (32.1)                
#>       4                     122 (21.0)          133 (21.3)                
#>    income (mean (SD))  13852.78 (12011.51) 16626.22 (14329.16) <0.001     
#>    age_bir (mean (SD))    23.07 (5.91)        23.80 (6.05)      0.036

Not only are there many possible arguments to the CreateTableOne() function, but also for the print() function when applied to a TableOne object. You can look up the documentation for both:

print(tab1, catDigits = 2, contDigits = 2, test = FALSE, smd = TRUE)
#>                       Stratified by glasses
#>                        0                   1                   SMD   
#>    n                        581                 624                  
#>    eyesight (%)                                                 0.123
#>       1                     220 (37.87)         254 (40.71)          
#>       2                     185 (31.84)         200 (32.05)          
#>       3                     122 (21.00)         127 (20.35)          
#>       4                      46 ( 7.92)          32 ( 5.13)          
#>       5                       8 ( 1.38)          11 ( 1.76)          
#>    nsibs (mean (SD))       4.03 (2.65)         3.85 (2.47)      0.072
#>    race_eth (%)                                                 0.234
#>       1                     103 (17.73)         108 (17.31)          
#>       2                     177 (30.46)         130 (20.83)          
#>       3                     301 (51.81)         386 (61.86)          
#>    sex = 2 (%)              301 (51.81)         403 (64.58)     0.261
#>    region (%)                                                   0.120
#>       1                     102 (17.56)         104 (16.67)          
#>       2                     146 (25.13)         187 (29.97)          
#>       3                     211 (36.32)         200 (32.05)          
#>       4                     122 (21.00)         133 (21.31)          
#>    income (mean (SD))  13852.78 (12011.51) 16626.22 (14329.16)  0.210
#>    age_bir (mean (SD))    23.07 (5.91)        23.80 (6.05)      0.121

help("CreateTableOne")
help("print.TableOne")

You can use this code to help get your table 1 in an easy-to-share format:

table1_to_share <- tab1 %>% 
  print() %>% 
  as_tibble(rownames = "id")
write_csv(table1_to_share, "table1.csv")

Check out some other packages for tables as well: https://github.com/kaz-yos/tableone#similar-or-complementary-projects.

Prepare a dataset

Apply the following inclusion criteria to the NLSY dataset to make a analysis dataset: from region 1 or 4, with at least 2 siblings, and doesn’t wear glasses.

Make a stratification variable

Make a variable in that dataset that categorizes people in quartiles (split at 25th, 50th, and 75% percentiles of the new dataset) by income. Make sure the categories have descriptive names.

Make table 1

Make a table 1 for this new dataset, stratified by the new income variable. Make sure they show up in the correct order in your table. Include p-values testing across strata but only print 2 digits for them. Perform an exact test for the region comparison.

Share your table 1

Using the code above as a starting point, save your table 1 as a csv file.

Open up the csv file in another program (Excel, Google Sheets, etc.) and check it out!