Exercises
6 Analyze your data
Congratulations! You’ve made it to the end of the course. I hope this is just the beginning of your journey with R!
Download the materials for this last set of exercises here.
6.1 File paths
Try to knit this document by pressing the “Knit” button or ctrl
/cmd
+ shift
+ k
.
You’ll find that it doesn’t work because the file paths don’t point to the files. Use the here
package to revise the code so that the code runs and the document knits.
library(tidyverse)
library(knitr)
nlsy <- read_csv("nlsy.csv")
# this function adds a static image
include_graphics("figure1.jpg")
read_rds("table1.rds") %>%
kable() # this function prints nicer tables in R Markdown
6.2 Missing data
The nlsy.csv
file includes all 12,687 participants of the NLSY-79. Read in the data once without specifying the values that indicate missingness. Explore the data and find them all. Then read in the data again, using the na =
argument in read_csv()
to read them in as NA’s.
# skip = 1 means to skip the first row, which were the original col names
nlsy <- read_csv("nlsy.csv", skip = 1, col_names = c(
"glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir"
))
Now imagine you’re interested in how the number of siblings relates to one’s age when they have their first child. Create a dataset to study this question:
- Assume that if the number of siblings is missing, they have 0
- Create a variable that is 1 if someone has kids, and 0 otherwise
- Create a dataset containing id, the sibling/child variables of interest, and income.
- Subset the data to exclude people who are missing income.
6.3 Challenge
I’ve given instructions for a project using the NLSY data. You can use the full dataset from the exercises this week (we were using a complete-case dataset earlier in the course) OR if you want, you can also use the NLS Investigator to download other variables that align with your interests!
If you have data that you’re working with and you’d like to practice with that data instead, go for it! Instead of the variables mentioned below, use whatever variables from your dataset you want. For example, change categorical variables that are currently numeric to factors as needed in your data. Then restrict your sample based on whatever conditions make sense to you for a project you’re working on (or want to work on), and run a regression with whichever variables interest you!
Prepare your project
- File -> New Project -> New Directory -> New Project
- Watch this video on R Projects if you’re having trouble or want to know more!
- Name it something like “NLSY” and put it in an appropriate folder on your computer
- Within that folder, make new folders as follows:
NLSY/
├─ NLSY.Rproj
├─ data/
│ ├── raw/
│ └── processed/
├─ code/
└── results/
├── tables/
└── figures/
Prepare the data
- Copy and paste
nlsy.csv
intodata/raw
. - Create a new file and save it as
clean_data.R
. (If you’d prefer to work with R Markdown files, and write text in between your code chunks as you’ve seen me do with the labs and exercises, go for it! I’ve added some helpful R Markdown links to the resources page.) - In that file, read in the NLSY data and load any packages you need. Make sure you replace any missing values with
NA
. Hint: there are extra missing values in theage_bir
variable. Also, the variable names might be useful:
colnames_nlsy <- c(
"glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir"
)
- Add factor labels to
eyesight
,sex
,race_eth
,region
, as in earlier slides. Select those variables plusincome
,id
,nsibs
,age_bir
, and the sleep variables. Then restrict to complete cases and people with incomes < $30,000. Make a variable for the log of income (replace withNA
if income <= 0). - Also in that file, save your new dataset as a
.rds
file to thedata/processed
folder.
Do some exploratory analysis
- Create a file called
create_figure.R
. In this file, read in the cleaned dataset. Load any packages you need. Then make aggplot
figure of your choosing to show something about the distribution of the data. Save it to theresults/figures
folder as a.png
file using theggsave()
function. - Create a file called
table_1.R
. In this file, read in the cleaned dataset and use thetableone
package to create a table 1 with the variables of your choosing. Modify the following code to save it as a.csv
file. Open it in Excel/Numbers/Google Sheets/etc. to make sure it worked.
tab1 <- CreateTableOne(...) %>% print() %>% as_tibble(rownames = "id")
write_csv(tab1, ...)
Do some regression analysis
- In another file called
lin_reg.R
, read in the data and run the following linear regression:lm(log_inc ~ age_bir + sex + race_eth + nsibs, data = nlsy)
. Modify the CI function to produce a table of results for a linear regression. Add an argumentdigits =
, with a default of 2, to allow you to choose the number of digits you’d like. Save it in a separate file calledfunctions.R
. Usesource()
to read in the function at the beginning of your script. - Save a table of your results as a
.csv
file. Make the names of the coefficients nice! - Using the results, use
ggplot
to make a figure. Usegeom_point()
for the point estimates andgeom_errorbar()
for the confidence intervals. It will look something like this:
ggplot(data) +
geom_point(aes(x = , y = )) +
geom_errorbar(aes(x = , ymin = , ymax = ))
- Save that figure as a
.pdf
usingggsave()
. You may want to play around with theheight =
andwidth =
arguments to make it look like you want.