Lab
1 The basics
You can either download the lab as an RMarkdown file here, or copy and paste the code as we go into a .R
script. Either way, save it into the 01-week
folder where you completed the exercises!
1.2 This week’s exercises
# create a vector of numeric values
vals <- c(1, 645, 329)
vals
# run these lines of code one at a time and compare what each does
# what happens in your environment window? what about your console?
new_vals
c(13, 7245, 23, 49.32)
new_vals <- c(13, 7245, 23, 49.32)
new_vals
# create and view different types of vectors
chars <- c("dog", "cat", "rhino")
chars
logs <- c(TRUE, FALSE, FALSE)
logs
# create a matrix
mat <- matrix(c(234, 7456, 12, 654, 183, 753), nrow = 2)
mat
# pull out rows
mat[2, ]
Extract
645
fromvals
using square brackets.Extract
"rhino"
fromchars
using square brackets.You saw how to extract the second row of
mat
. Figure out how to extract the second column.Extract
183
frommat
using square brackets.Figure out how to get the following errors:
incorrect number of dimensions
subscript out of bounds
1.3 Data in R
We’re using some data from the National Longitudinal Survey of Youth 1979, a cohort of American young adults aged 14-22 at enrollment in 1979. They continue to be followed to this day, and there is a wealth of publicly available data online. I’ve downloaded the answers to a survey question about whether respondents wear glasses, a scale about their eyesight with glasses, their (NLSY-assigned 😒) race/ethnicity, their sex, their family’s income in 1979, and their age at the birth of their first child.
Reading in data
I’ve saved the dataset as a csv
file. We can read this into R using the read_csv()
function, which is loaded with the tidyverse
. For now we’ll load it from the internet. We’ll talk about other options for reading in data later in the course!
library(tidyverse)
nlsy <- read_csv("https://intro-to-R-2020.louisahsmith.com/data/nlsy_cc.csv")
We can explore the data with a number of functions that we apply to either the whole dataset, or to a single variable in the dataset. Here are a couple of ways we can look at the whole dataset:
nlsy
#> # A tibble: 1,205 x 14
#> glasses eyesight sleep_wkdy sleep_wknd id nsibs samp race_eth sex region income
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1 5 7 3 3 5 3 2 1 22390
#> 2 1 2 6 7 6 1 1 3 1 1 35000
#> 3 0 2 7 9 8 7 6 3 2 1 7227
#> 4 1 3 6 7 16 3 5 3 2 1 48000
#> 5 0 3 10 10 18 2 1 3 1 3 4510
#> 6 1 2 7 8 20 2 5 3 2 1 50000
#> 7 0 1 8 8 27 1 5 3 2 1 20000
#> 8 1 1 8 8 49 6 5 3 2 1 23900
#> 9 1 2 7 8 57 1 5 3 2 1 23289
#> 10 0 1 8 8 67 1 1 3 1 1 35000
#> # … with 1,195 more rows, and 3 more variables: res_1980 <dbl>, res_2002 <dbl>, age_bir <dbl>
glimpse(nlsy)
#> Rows: 1,205
#> Columns: 14
#> $ glasses <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0…
#> $ eyesight <dbl> 1, 2, 2, 3, 3, 2, 1, 1, 2, 1, 3, 5, 1, 1, 1, 1, 3, 2, 3, 3, 4, 2, 2, 5, 1…
#> $ sleep_wkdy <dbl> 5, 6, 7, 6, 10, 7, 8, 8, 7, 8, 8, 7, 7, 7, 8, 7, 7, 8, 8, 8, 7, 6, 8, 7, …
#> $ sleep_wknd <dbl> 7, 7, 9, 7, 10, 8, 8, 8, 8, 8, 8, 7, 8, 7, 8, 7, 4, 8, 8, 9, 7, 10, 8, 7,…
#> $ id <dbl> 3, 6, 8, 16, 18, 20, 27, 49, 57, 67, 86, 96, 97, 98, 117, 137, 172, 179, …
#> $ nsibs <dbl> 3, 1, 7, 3, 2, 2, 1, 6, 1, 1, 7, 2, 7, 2, 2, 4, 9, 2, 2, 2, 4, 2, 4, 4, 2…
#> $ samp <dbl> 5, 1, 6, 5, 1, 5, 5, 5, 5, 1, 7, 6, 5, 6, 1, 5, 6, 5, 5, 5, 8, 1, 7, 5, 5…
#> $ race_eth <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 2, 3, 3…
#> $ sex <dbl> 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2…
#> $ region <dbl> 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1…
#> $ income <dbl> 22390, 35000, 7227, 48000, 4510, 50000, 20000, 23900, 23289, 35000, 1688,…
#> $ res_1980 <dbl> 11, 3, 11, 11, 11, 3, 11, 11, 11, 3, 11, 11, 11, 11, 6, 3, 11, 11, 3, 11,…
#> $ res_2002 <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 19, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ age_bir <dbl> 19, 30, 17, 31, 19, 30, 27, 24, 21, 36, 17, 19, 29, 30, 26, 26, 35, 22, 3…
summary(nlsy)
#> glasses eyesight sleep_wkdy sleep_wknd id
#> Min. :0.0000 Min. :1.00 Min. : 0.000 Min. : 0.000 Min. : 3
#> 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.: 6.000 1st Qu.: 6.000 1st Qu.: 2317
#> Median :1.0000 Median :2.00 Median : 7.000 Median : 7.000 Median : 4744
#> Mean :0.5178 Mean :1.99 Mean : 6.643 Mean : 7.267 Mean : 5229
#> 3rd Qu.:1.0000 3rd Qu.:3.00 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 7937
#> Max. :1.0000 Max. :5.00 Max. :13.000 Max. :14.000 Max. :12667
#> nsibs samp race_eth sex region
#> Min. : 0.000 Min. : 1.000 Min. :1.000 Min. :1.000 Min. :1.000
#> 1st Qu.: 2.000 1st Qu.: 4.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
#> Median : 3.000 Median : 5.000 Median :3.000 Median :2.000 Median :3.000
#> Mean : 3.937 Mean : 7.002 Mean :2.395 Mean :1.584 Mean :2.593
#> 3rd Qu.: 5.000 3rd Qu.:11.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.000
#> Max. :16.000 Max. :20.000 Max. :3.000 Max. :2.000 Max. :4.000
#> income res_1980 res_2002 age_bir
#> Min. : 0 Min. : 1.00 Min. : 5.00 Min. :13.00
#> 1st Qu.: 6000 1st Qu.:11.00 1st Qu.:11.00 1st Qu.:19.00
#> Median :11155 Median :11.00 Median :11.00 Median :22.00
#> Mean :15289 Mean : 9.14 Mean :11.05 Mean :23.45
#> 3rd Qu.:20000 3rd Qu.:11.00 3rd Qu.:11.00 3rd Qu.:27.00
#> Max. :75001 Max. :16.00 Max. :19.00 Max. :52.00
# within the RStudio browser
View(nlsy)
In many functions in R, we refer to specific variables using dollar-sign notation. So to access the id
variable in the nlsy
dataset we’d type nlsy$id
and all of the id numbers would print to the console. Don’t do this though, or 1000+ numbers will print out! Instead, we might look at the first or last few with head()
or tail()
head(nlsy$id)
#> [1] 3 6 8 16 18 20
tail(nlsy$sleep_wknd)
#> [1] 12 8 12 5 7 5
We can use the summary()
function on a single variable.
summary(nlsy$sleep_wkdy)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.000 6.000 7.000 6.643 8.000 13.000
Many of the most basic functions in R are pretty straightforward:
table(nlsy$region)
#>
#> 1 2 3 4
#> 206 333 411 255
mean(nlsy$age_bir)
#> [1] 23.44813
We can find out more information from the documentation:
help(cor)
And if you’re not sure what you’re looking for, there’s a ton of info elsewhere:
1.4 Group challenge exercises
- How many people are in the NLSY? How many variables are in this dataset? What are two ways you can answer these questions using tools we’ve discussed?
- Can you find an R function(s) we haven’t discussed that answers question 1? Feel free to Google! See how many ways you and your group can come up with!
- What’s the standard deviation of the number of hours of sleep on weekends?
- What’s the Spearman correlation between hours of sleep on weekends and weekdays in this data?
- Try to read in the data from an Excel file (it should be possible even if you don’t have Excel on your computer!). It’s in a tab called
data
, but there’s a header as well. (It might help to open up in whatever spreadsheet program you have.) You’ll have to load thereadxl
package (you already installed with withtidyverse
, but it doesn’t load automatically), and probably read some of the documentation: https://readxl.tidyverse.org.
# first, use this script to download the data to your current working directory
download.file("https://intro-to-R-2020.louisahsmith.com/data/nlsy_cc.xlsx",
destfile = file.path(getwd(), "nlsy_cc.xlsx"))
# this will be the path argument you'll need
path <- "nlsy_cc.xlsx"
# the variables also still have the NLSY-assigned names, so you'll need these
col_names <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd", "id", "nsibs",
"samp", "race_eth", "sex", "region", "income", "res_1980",
"res_2002", "age_bir")