6 Analyze your data

Congratulations! You’ve made it to the end of the course. I hope this is just the beginning of your journey with R!

Download the materials for this last set of exercises here.

6.1 File paths

Try to knit this document by pressing the “Knit” button or ctrl/cmd + shift + k.

You’ll find that it doesn’t work because the file paths don’t point to the files. Use the here package to revise the code so that the code runs and the document knits.

nlsy <- read_csv("nlsy.csv")
# this function adds a static image
read_rds("table1.rds") %>%
  kable() # this function prints nicer tables in R Markdown

6.2 Missing data

The nlsy.csv file includes all 12,687 participants of the NLSY-79. Read in the data once without specifying the values that indicate missingness. Explore the data and find them all. Then read in the data again, using the na = argument in read_csv() to read them in as NA’s.

# skip = 1 means to skip the first row, which were the original col names
nlsy <- read_csv("nlsy.csv", skip = 1, col_names = c(
  "glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
  "id", "nsibs", "samp", "race_eth", "sex", "region",
  "income", "res_1980", "res_2002", "age_bir"

Now imagine you’re interested in how the number of siblings relates to one’s age when they have their first child. Create a dataset to study this question:

  • Assume that if the number of siblings is missing, they have 0
  • Create a variable that is 1 if someone has kids, and 0 otherwise
  • Create a dataset containing id, the sibling/child variables of interest, and income.
  • Subset the data to exclude people who are missing income.

6.3 Challenge

I’ve given instructions for a project using the NLSY data. You can use the full dataset from the exercises this week (we were using a complete-case dataset earlier in the course) OR if you want, you can also use the NLS Investigator to download other variables that align with your interests!

If you have data that you’re working with and you’d like to practice with that data instead, go for it! Instead of the variables mentioned below, use whatever variables from your dataset you want. For example, change categorical variables that are currently numeric to factors as needed in your data. Then restrict your sample based on whatever conditions make sense to you for a project you’re working on (or want to work on), and run a regression with whichever variables interest you!

Prepare your project

  • File -> New Project -> New Directory -> New Project
    • Watch this video on R Projects if you’re having trouble or want to know more!
  • Name it something like “NLSY” and put it in an appropriate folder on your computer
  • Within that folder, make new folders as follows:
 ├─ NLSY.Rproj
 ├─ data/
 │   ├── raw/
 │   └── processed/
 ├─ code/
 └── results/
     ├── tables/
     └── figures/

Prepare the data

  • Copy and paste nlsy.csv into data/raw.
  • Create a new file and save it as clean_data.R. (If you’d prefer to work with R Markdown files, and write text in between your code chunks as you’ve seen me do with the labs and exercises, go for it! I’ve added some helpful R Markdown links to the resources page.)
  • In that file, read in the NLSY data and load any packages you need. Make sure you replace any missing values with NA. Hint: there are extra missing values in the age_bir variable. Also, the variable names might be useful:
colnames_nlsy <- c(
  "glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
  "id", "nsibs", "samp", "race_eth", "sex", "region",
  "income", "res_1980", "res_2002", "age_bir"
  • Add factor labels to eyesight, sex, race_eth, region, as in earlier slides. Select those variables plus income, id, nsibs, age_bir, and the sleep variables. Then restrict to complete cases and people with incomes < $30,000. Make a variable for the log of income (replace with NA if income <= 0).
  • Also in that file, save your new dataset as a .rds file to the data/processed folder.

Do some exploratory analysis

  • Create a file called create_figure.R. In this file, read in the cleaned dataset. Load any packages you need. Then make a ggplot figure of your choosing to show something about the distribution of the data. Save it to the results/figures folder as a .png file using the ggsave() function.
  • Create a file called table_1.R. In this file, read in the cleaned dataset and use the tableone package to create a table 1 with the variables of your choosing. Modify the following code to save it as a .csv file. Open it in Excel/Numbers/Google Sheets/etc. to make sure it worked.
tab1 <- CreateTableOne(...) %>% print() %>% as_tibble(rownames = "id")
write_csv(tab1, ...)

Do some regression analysis

  • In another file called lin_reg.R, read in the data and run the following linear regression: lm(log_inc ~ age_bir + sex + race_eth + nsibs, data = nlsy). Modify the CI function to produce a table of results for a linear regression. Add an argument digits =, with a default of 2, to allow you to choose the number of digits you’d like. Save it in a separate file called functions.R. Use source() to read in the function at the beginning of your script.
  • Save a table of your results as a .csv file. Make the names of the coefficients nice!
  • Using the results, use ggplot to make a figure. Use geom_point() for the point estimates and geom_errorbar() for the confidence intervals. It will look something like this:
ggplot(data) + 
  geom_point(aes(x = , y = )) + 
  geom_errorbar(aes(x = , ymin = , ymax = ))
  • Save that figure as a .pdf using ggsave(). You may want to play around with the height = and width = arguments to make it look like you want.