Lab

6 Functions

You can either download the lab as an RMarkdown file here, or copy and paste the code as we go into a .R script. Either way, save it into the 06-week folder where you completed the exercises! Since we worked on file structure this week, put it in a new labs folder nested within that folder.

6.1 File paths

We talked this week about making file paths work in R. This is especially important if you’re sharing your code with everyone. If you set your working directory via setwd() or hard-code another path into your code, it’s definitely not going to work on my computer.

A famous blog post about this (well, as famous as blog posts about R can be) can be read here.

How did you make these file paths work?

library(tidyverse)
library(knitr)
nlsy <- read_csv("nlsy.csv")
# this function adds a static image
include_graphics("figure1.jpg")
read_rds("table1.rds") %>%
  kable() # this function prints nicer tables in R Markdown

By the way, you can change the directory in which an R Markdown document is knitted using the arrow by the Knit button. You can also use the settings to change the default. But if you use the here package it won’t matter what you do!

One thing that is helpful to know when you’re working interactively, is how to navigate around file paths. If you need to go up a level, you use two dots to do so. So in the code below, we go down two levels from the top-level directory, then have to go up two levels and down two different levels to access the data. See why project directories and the here package make everything easier?!

proj_wd <- getwd()
setwd("figures/super important")
nlsy <- read_csv("../../data/raw/nlsy.csv")
# ahhh take us back to where we started from!
setwd(proj_wd)

6.2 Missing data

The nlsy.csv file includes all 12,687 participants of the NLSY-79. Read in the data once without specifying the values that indicate missingness. Explore the data and find them all. Then read in the data again, using the na = argument in read_csv() to read them in as NA’s.

# skip = 1 means to skip the first row, which were the original col names
nlsy <- read_csv(here::here("data", "raw", "nlsy.csv"), skip = 1, col_names = c(
  "glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
  "id", "nsibs", "samp", "race_eth", "sex", "region",
  "income", "res_1980", "res_2002", "age_bir"
))

How can we explore the data to get an idea of what values mean that variable is missing?

summary(nlsy)
#>      glasses          eyesight          sleep_wkdy       sleep_wknd    
#>   Min.   :-4.000   Min.   :-4.00000   Min.   :-4.000   Min.   :-4.000  
#>   1st Qu.:-4.000   1st Qu.:-4.00000   1st Qu.:-4.000   1st Qu.:-4.000  
#>   Median : 0.000   Median : 1.00000   Median : 6.000   Median : 6.000  
#>   Mean   :-1.026   Mean   : 0.04004   Mean   : 2.523   Mean   : 2.886  
#>   3rd Qu.: 1.000   3rd Qu.: 2.00000   3rd Qu.: 7.000   3rd Qu.: 8.000  
#>   Max.   : 1.000   Max.   : 5.00000   Max.   :19.000   Max.   :24.000  
#>         id            nsibs             samp           race_eth    
#>   Min.   :    1   Min.   :-3.000   Min.   : 1.000   Min.   :1.000  
#>   1st Qu.: 3172   1st Qu.: 2.000   1st Qu.: 5.000   1st Qu.:2.000  
#>   Median : 6344   Median : 3.000   Median : 9.000   Median :3.000  
#>   Mean   : 6344   Mean   : 3.844   Mean   : 8.174   Mean   :2.434  
#>   3rd Qu.: 9515   3rd Qu.: 5.000   3rd Qu.:12.000   3rd Qu.:3.000  
#>   Max.   :12686   Max.   :29.000   Max.   :20.000   Max.   :3.000  
#>        sex            region           income         res_1980      
#>   Min.   :1.000   Min.   :-4.000   Min.   :   -3   Min.   :-5.0000  
#>   1st Qu.:1.000   1st Qu.: 2.000   1st Qu.: 2853   1st Qu.:-4.0000  
#>   Median :1.000   Median : 3.000   Median : 7955   Median :-4.0000  
#>   Mean   :1.495   Mean   : 2.424   Mean   :11819   Mean   : 0.5317  
#>   3rd Qu.:2.000   3rd Qu.: 3.000   3rd Qu.:18000   3rd Qu.:11.0000  
#>   Max.   :2.000   Max.   : 4.000   Max.   :75001   Max.   :16.0000  
#>      res_2002         age_bir       
#>   Min.   :-5.000   Min.   :-998.00  
#>   1st Qu.:-5.000   1st Qu.:  -5.00  
#>   Median :11.000   Median :  -5.00  
#>   Mean   : 4.914   Mean   : -96.36  
#>   3rd Qu.:11.000   3rd Qu.:  23.00  
#>   Max.   :19.000   Max.   :  52.00
ggplot(nlsy) +
  geom_histogram(aes(age_bir))

count(nlsy, nsibs)
#>  # A tibble: 25 x 2
#>     nsibs     n
#>     <dbl> <int>
#>   1    -3    14
#>   2    -2     3
#>   3    -1     1
#>   4     0   362
#>   5     1  1611
#>   6     2  2530
#>   7     3  2412
#>   8     4  1783
#>   9     5  1198
#>  10     6   933
#>  # … with 15 more rows

Reread the data with the missing values specified:

nlsy <- read_csv(here::here("data", "raw", "nlsy.csv"), skip = 1, 
                 na = c("-998", "-5", "-4", "-3", "-2", "-1"),
                 col_names = c(
  "glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
  "id", "nsibs", "samp", "race_eth", "sex", "region",
  "income", "res_1980", "res_2002", "age_bir"
))

Check again to see if anything’s off:

summary(nlsy)
#>      glasses         eyesight       sleep_wkdy       sleep_wknd    
#>   Min.   :0.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
#>   1st Qu.:0.000   1st Qu.:1.000   1st Qu.: 6.000   1st Qu.: 6.000  
#>   Median :0.000   Median :2.000   Median : 7.000   Median : 7.000  
#>   Mean   :0.461   Mean   :2.064   Mean   : 6.567   Mean   : 7.163  
#>   3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.: 8.000   3rd Qu.: 8.000  
#>   Max.   :1.000   Max.   :5.000   Max.   :19.000   Max.   :24.000  
#>   NA's   :4236    NA's   :4242    NA's   :4860     NA's   :4867    
#>         id            nsibs             samp           race_eth    
#>   Min.   :    1   Min.   : 0.000   Min.   : 1.000   Min.   :1.000  
#>   1st Qu.: 3172   1st Qu.: 2.000   1st Qu.: 5.000   1st Qu.:2.000  
#>   Median : 6344   Median : 3.000   Median : 9.000   Median :3.000  
#>   Mean   : 6344   Mean   : 3.854   Mean   : 8.174   Mean   :2.434  
#>   3rd Qu.: 9515   3rd Qu.: 5.000   3rd Qu.:12.000   3rd Qu.:3.000  
#>   Max.   :12686   Max.   :29.000   Max.   :20.000   Max.   :3.000  
#>                   NA's   :18                                       
#>        sex            region          income         res_1980     
#>   Min.   :1.000   Min.   :1.000   Min.   :    0   Min.   : 1.000  
#>   1st Qu.:1.000   1st Qu.:2.000   1st Qu.: 5900   1st Qu.: 6.000  
#>   Median :1.000   Median :3.000   Median :10700   Median :11.000  
#>   Mean   :1.495   Mean   :2.547   Mean   :14707   Mean   : 8.896  
#>   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:20000   3rd Qu.:11.000  
#>   Max.   :2.000   Max.   :4.000   Max.   :75001   Max.   :16.000  
#>                   NA's   :239     NA's   :2491    NA's   :8186    
#>      res_2002        age_bir     
#>   Min.   : 1.00   Min.   :10.00  
#>   1st Qu.:11.00   1st Qu.:20.00  
#>   Median :11.00   Median :23.00  
#>   Mean   :11.28   Mean   :24.38  
#>   3rd Qu.:11.00   3rd Qu.:28.00  
#>   Max.   :19.00   Max.   :52.00  
#>   NA's   :4962    NA's   :6743
ggplot(nlsy) +
  geom_histogram(aes(age_bir))

count(nlsy, nsibs)
#>  # A tibble: 23 x 2
#>     nsibs     n
#>     <dbl> <int>
#>   1     0   362
#>   2     1  1611
#>   3     2  2530
#>   4     3  2412
#>   5     4  1783
#>   6     5  1198
#>   7     6   933
#>   8     7   634
#>   9     8   421
#>  10     9   289
#>  # … with 13 more rows

There are certainly some outliers we might want to investigate, but nothing else coded as missing.

(Obviously, these missing values are noted in the data dictionary. You shouldn’t have to guess what the missing values are when you’re working with real data – ask someone who knows, instead!)

Now imagine you’re interested in how the number of siblings relates to one’s age when they have their first child. Create a dataset to study this question:

  • Assume that if the number of siblings is missing, they have 0
  • Create a variable that is 1 if someone has kids, and 0 otherwise
  • Create a dataset containing id, the sibling/child variables of interest, and income.
  • Subset the data to exclude people who are missing income.
for_analysis <- nlsy %>%
  mutate(
    nsibs = case_when( # or ifelse(is.na(nsibs), 0, nsibs)
      is.na(nsibs) ~ 0,
      TRUE ~ nsibs
    ),
    has_kids = case_when( # or ifelse(!is.na(age_bir), 1, 0)
      !is.na(age_bir) ~ 1,
      TRUE ~ 0
    )
  ) %>%
  select(id, nsibs, has_kids, age_bir, income) %>%
  filter(!is.na(income))

6.3 Questions?!?!