Lab
1 The basics
You can either download the lab as an RMarkdown file here, or copy and paste the code as we go into a .R script. Either way, save it into the 01-week folder where you completed the exercises!
1.2 This week’s exercises
# create a vector of numeric values
vals <- c(1, 645, 329)
vals
# run these lines of code one at a time and compare what each does
# what happens in your environment window? what about your console?
new_vals
c(13, 7245, 23, 49.32)
new_vals <- c(13, 7245, 23, 49.32)
new_vals
# create and view different types of vectors
chars <- c("dog", "cat", "rhino")
chars
logs <- c(TRUE, FALSE, FALSE)
logs
# create a matrix
mat <- matrix(c(234, 7456, 12, 654, 183, 753), nrow = 2)
mat
# pull out rows
mat[2, ]
Extract
645fromvalsusing square brackets.Extract
"rhino"fromcharsusing square brackets.You saw how to extract the second row of
mat. Figure out how to extract the second column.Extract
183frommatusing square brackets.Figure out how to get the following errors:
incorrect number of dimensionssubscript out of bounds
1.3 Data in R

We’re using some data from the National Longitudinal Survey of Youth 1979, a cohort of American young adults aged 14-22 at enrollment in 1979. They continue to be followed to this day, and there is a wealth of publicly available data online. I’ve downloaded the answers to a survey question about whether respondents wear glasses, a scale about their eyesight with glasses, their (NLSY-assigned 😒) race/ethnicity, their sex, their family’s income in 1979, and their age at the birth of their first child.
Reading in data
I’ve saved the dataset as a csv file. We can read this into R using the read_csv() function, which is loaded with the tidyverse. For now we’ll load it from the internet. We’ll talk about other options for reading in data later in the course!
library(tidyverse)
nlsy <- read_csv("https://intro-to-R-2020.louisahsmith.com/data/nlsy_cc.csv")
We can explore the data with a number of functions that we apply to either the whole dataset, or to a single variable in the dataset. Here are a couple of ways we can look at the whole dataset:
nlsy
#> # A tibble: 1,205 x 14
#> glasses eyesight sleep_wkdy sleep_wknd id nsibs samp race_eth sex region income
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1 5 7 3 3 5 3 2 1 22390
#> 2 1 2 6 7 6 1 1 3 1 1 35000
#> 3 0 2 7 9 8 7 6 3 2 1 7227
#> 4 1 3 6 7 16 3 5 3 2 1 48000
#> 5 0 3 10 10 18 2 1 3 1 3 4510
#> 6 1 2 7 8 20 2 5 3 2 1 50000
#> 7 0 1 8 8 27 1 5 3 2 1 20000
#> 8 1 1 8 8 49 6 5 3 2 1 23900
#> 9 1 2 7 8 57 1 5 3 2 1 23289
#> 10 0 1 8 8 67 1 1 3 1 1 35000
#> # … with 1,195 more rows, and 3 more variables: res_1980 <dbl>, res_2002 <dbl>, age_bir <dbl>
glimpse(nlsy)
#> Rows: 1,205
#> Columns: 14
#> $ glasses <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0…
#> $ eyesight <dbl> 1, 2, 2, 3, 3, 2, 1, 1, 2, 1, 3, 5, 1, 1, 1, 1, 3, 2, 3, 3, 4, 2, 2, 5, 1…
#> $ sleep_wkdy <dbl> 5, 6, 7, 6, 10, 7, 8, 8, 7, 8, 8, 7, 7, 7, 8, 7, 7, 8, 8, 8, 7, 6, 8, 7, …
#> $ sleep_wknd <dbl> 7, 7, 9, 7, 10, 8, 8, 8, 8, 8, 8, 7, 8, 7, 8, 7, 4, 8, 8, 9, 7, 10, 8, 7,…
#> $ id <dbl> 3, 6, 8, 16, 18, 20, 27, 49, 57, 67, 86, 96, 97, 98, 117, 137, 172, 179, …
#> $ nsibs <dbl> 3, 1, 7, 3, 2, 2, 1, 6, 1, 1, 7, 2, 7, 2, 2, 4, 9, 2, 2, 2, 4, 2, 4, 4, 2…
#> $ samp <dbl> 5, 1, 6, 5, 1, 5, 5, 5, 5, 1, 7, 6, 5, 6, 1, 5, 6, 5, 5, 5, 8, 1, 7, 5, 5…
#> $ race_eth <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 2, 3, 3…
#> $ sex <dbl> 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2…
#> $ region <dbl> 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1…
#> $ income <dbl> 22390, 35000, 7227, 48000, 4510, 50000, 20000, 23900, 23289, 35000, 1688,…
#> $ res_1980 <dbl> 11, 3, 11, 11, 11, 3, 11, 11, 11, 3, 11, 11, 11, 11, 6, 3, 11, 11, 3, 11,…
#> $ res_2002 <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 19, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ age_bir <dbl> 19, 30, 17, 31, 19, 30, 27, 24, 21, 36, 17, 19, 29, 30, 26, 26, 35, 22, 3…
summary(nlsy)
#> glasses eyesight sleep_wkdy sleep_wknd id
#> Min. :0.0000 Min. :1.00 Min. : 0.000 Min. : 0.000 Min. : 3
#> 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.: 6.000 1st Qu.: 6.000 1st Qu.: 2317
#> Median :1.0000 Median :2.00 Median : 7.000 Median : 7.000 Median : 4744
#> Mean :0.5178 Mean :1.99 Mean : 6.643 Mean : 7.267 Mean : 5229
#> 3rd Qu.:1.0000 3rd Qu.:3.00 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 7937
#> Max. :1.0000 Max. :5.00 Max. :13.000 Max. :14.000 Max. :12667
#> nsibs samp race_eth sex region
#> Min. : 0.000 Min. : 1.000 Min. :1.000 Min. :1.000 Min. :1.000
#> 1st Qu.: 2.000 1st Qu.: 4.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
#> Median : 3.000 Median : 5.000 Median :3.000 Median :2.000 Median :3.000
#> Mean : 3.937 Mean : 7.002 Mean :2.395 Mean :1.584 Mean :2.593
#> 3rd Qu.: 5.000 3rd Qu.:11.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.000
#> Max. :16.000 Max. :20.000 Max. :3.000 Max. :2.000 Max. :4.000
#> income res_1980 res_2002 age_bir
#> Min. : 0 Min. : 1.00 Min. : 5.00 Min. :13.00
#> 1st Qu.: 6000 1st Qu.:11.00 1st Qu.:11.00 1st Qu.:19.00
#> Median :11155 Median :11.00 Median :11.00 Median :22.00
#> Mean :15289 Mean : 9.14 Mean :11.05 Mean :23.45
#> 3rd Qu.:20000 3rd Qu.:11.00 3rd Qu.:11.00 3rd Qu.:27.00
#> Max. :75001 Max. :16.00 Max. :19.00 Max. :52.00
# within the RStudio browser
View(nlsy)
In many functions in R, we refer to specific variables using dollar-sign notation. So to access the id variable in the nlsy dataset we’d type nlsy$id and all of the id numbers would print to the console. Don’t do this though, or 1000+ numbers will print out! Instead, we might look at the first or last few with head() or tail()
head(nlsy$id)
#> [1] 3 6 8 16 18 20
tail(nlsy$sleep_wknd)
#> [1] 12 8 12 5 7 5
We can use the summary() function on a single variable.
summary(nlsy$sleep_wkdy)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.000 6.000 7.000 6.643 8.000 13.000
Many of the most basic functions in R are pretty straightforward:
table(nlsy$region)
#>
#> 1 2 3 4
#> 206 333 411 255
mean(nlsy$age_bir)
#> [1] 23.44813
We can find out more information from the documentation:
help(cor)
And if you’re not sure what you’re looking for, there’s a ton of info elsewhere:
1.4 Group challenge exercises
- How many people are in the NLSY? How many variables are in this dataset? What are two ways you can answer these questions using tools we’ve discussed?
- Can you find an R function(s) we haven’t discussed that answers question 1? Feel free to Google! See how many ways you and your group can come up with!
- What’s the standard deviation of the number of hours of sleep on weekends?
- What’s the Spearman correlation between hours of sleep on weekends and weekdays in this data?
- Try to read in the data from an Excel file (it should be possible even if you don’t have Excel on your computer!). It’s in a tab called
data, but there’s a header as well. (It might help to open up in whatever spreadsheet program you have.) You’ll have to load thereadxlpackage (you already installed with withtidyverse, but it doesn’t load automatically), and probably read some of the documentation: https://readxl.tidyverse.org.
# first, use this script to download the data to your current working directory
download.file("https://intro-to-R-2020.louisahsmith.com/data/nlsy_cc.xlsx",
destfile = file.path(getwd(), "nlsy_cc.xlsx"))
# this will be the path argument you'll need
path <- "nlsy_cc.xlsx"
# the variables also still have the NLSY-assigned names, so you'll need these
col_names <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd", "id", "nsibs",
"samp", "race_eth", "sex", "region", "income", "res_1980",
"res_2002", "age_bir")