Lab

1 The basics

You can either download the lab as an RMarkdown file here, or copy and paste the code as we go into a .R script. Either way, save it into the 01-week folder where you completed the exercises!

1.2 This week’s exercises

# create a vector of numeric values
vals <- c(1, 645, 329)
vals

# run these lines of code one at a time and compare what each does
# what happens in your environment window? what about your console?
new_vals
c(13, 7245, 23, 49.32)
new_vals <- c(13, 7245, 23, 49.32)
new_vals

# create and view different types of vectors
chars <- c("dog", "cat", "rhino")
chars
logs <- c(TRUE, FALSE, FALSE)
logs

# create a matrix
mat <- matrix(c(234, 7456, 12, 654, 183, 753), nrow = 2)
mat

# pull out rows
mat[2, ]
  1. Extract 645 from vals using square brackets.

  2. Extract "rhino" from chars using square brackets.

  3. You saw how to extract the second row of mat. Figure out how to extract the second column.

  4. Extract 183 from mat using square brackets.

  5. Figure out how to get the following errors: incorrect number of dimensions subscript out of bounds

1.3 Data in R

We’re using some data from the National Longitudinal Survey of Youth 1979, a cohort of American young adults aged 14-22 at enrollment in 1979. They continue to be followed to this day, and there is a wealth of publicly available data online. I’ve downloaded the answers to a survey question about whether respondents wear glasses, a scale about their eyesight with glasses, their (NLSY-assigned 😒) race/ethnicity, their sex, their family’s income in 1979, and their age at the birth of their first child.

Reading in data

I’ve saved the dataset as a csv file. We can read this into R using the read_csv() function, which is loaded with the tidyverse. For now we’ll load it from the internet. We’ll talk about other options for reading in data later in the course!

library(tidyverse)
nlsy <- read_csv("https://intro-to-R-2020.louisahsmith.com/data/nlsy_cc.csv")

We can explore the data with a number of functions that we apply to either the whole dataset, or to a single variable in the dataset. Here are a couple of ways we can look at the whole dataset:

nlsy
#>  # A tibble: 1,205 x 14
#>     glasses eyesight sleep_wkdy sleep_wknd    id nsibs  samp race_eth   sex region income
#>       <dbl>    <dbl>      <dbl>      <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl>  <dbl>  <dbl>
#>   1       0        1          5          7     3     3     5        3     2      1  22390
#>   2       1        2          6          7     6     1     1        3     1      1  35000
#>   3       0        2          7          9     8     7     6        3     2      1   7227
#>   4       1        3          6          7    16     3     5        3     2      1  48000
#>   5       0        3         10         10    18     2     1        3     1      3   4510
#>   6       1        2          7          8    20     2     5        3     2      1  50000
#>   7       0        1          8          8    27     1     5        3     2      1  20000
#>   8       1        1          8          8    49     6     5        3     2      1  23900
#>   9       1        2          7          8    57     1     5        3     2      1  23289
#>  10       0        1          8          8    67     1     1        3     1      1  35000
#>  # … with 1,195 more rows, and 3 more variables: res_1980 <dbl>, res_2002 <dbl>, age_bir <dbl>
glimpse(nlsy)
#>  Rows: 1,205
#>  Columns: 14
#>  $ glasses    <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0…
#>  $ eyesight   <dbl> 1, 2, 2, 3, 3, 2, 1, 1, 2, 1, 3, 5, 1, 1, 1, 1, 3, 2, 3, 3, 4, 2, 2, 5, 1…
#>  $ sleep_wkdy <dbl> 5, 6, 7, 6, 10, 7, 8, 8, 7, 8, 8, 7, 7, 7, 8, 7, 7, 8, 8, 8, 7, 6, 8, 7, …
#>  $ sleep_wknd <dbl> 7, 7, 9, 7, 10, 8, 8, 8, 8, 8, 8, 7, 8, 7, 8, 7, 4, 8, 8, 9, 7, 10, 8, 7,…
#>  $ id         <dbl> 3, 6, 8, 16, 18, 20, 27, 49, 57, 67, 86, 96, 97, 98, 117, 137, 172, 179, …
#>  $ nsibs      <dbl> 3, 1, 7, 3, 2, 2, 1, 6, 1, 1, 7, 2, 7, 2, 2, 4, 9, 2, 2, 2, 4, 2, 4, 4, 2…
#>  $ samp       <dbl> 5, 1, 6, 5, 1, 5, 5, 5, 5, 1, 7, 6, 5, 6, 1, 5, 6, 5, 5, 5, 8, 1, 7, 5, 5…
#>  $ race_eth   <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 2, 3, 3…
#>  $ sex        <dbl> 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2…
#>  $ region     <dbl> 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1…
#>  $ income     <dbl> 22390, 35000, 7227, 48000, 4510, 50000, 20000, 23900, 23289, 35000, 1688,…
#>  $ res_1980   <dbl> 11, 3, 11, 11, 11, 3, 11, 11, 11, 3, 11, 11, 11, 11, 6, 3, 11, 11, 3, 11,…
#>  $ res_2002   <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 19, 11, 11, 11, 11, 11, 11, 11, 1…
#>  $ age_bir    <dbl> 19, 30, 17, 31, 19, 30, 27, 24, 21, 36, 17, 19, 29, 30, 26, 26, 35, 22, 3…
summary(nlsy)
#>      glasses          eyesight      sleep_wkdy       sleep_wknd           id       
#>   Min.   :0.0000   Min.   :1.00   Min.   : 0.000   Min.   : 0.000   Min.   :    3  
#>   1st Qu.:0.0000   1st Qu.:1.00   1st Qu.: 6.000   1st Qu.: 6.000   1st Qu.: 2317  
#>   Median :1.0000   Median :2.00   Median : 7.000   Median : 7.000   Median : 4744  
#>   Mean   :0.5178   Mean   :1.99   Mean   : 6.643   Mean   : 7.267   Mean   : 5229  
#>   3rd Qu.:1.0000   3rd Qu.:3.00   3rd Qu.: 8.000   3rd Qu.: 8.000   3rd Qu.: 7937  
#>   Max.   :1.0000   Max.   :5.00   Max.   :13.000   Max.   :14.000   Max.   :12667  
#>       nsibs             samp           race_eth          sex            region     
#>   Min.   : 0.000   Min.   : 1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
#>   1st Qu.: 2.000   1st Qu.: 4.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000  
#>   Median : 3.000   Median : 5.000   Median :3.000   Median :2.000   Median :3.000  
#>   Mean   : 3.937   Mean   : 7.002   Mean   :2.395   Mean   :1.584   Mean   :2.593  
#>   3rd Qu.: 5.000   3rd Qu.:11.000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:3.000  
#>   Max.   :16.000   Max.   :20.000   Max.   :3.000   Max.   :2.000   Max.   :4.000  
#>       income         res_1980        res_2002        age_bir     
#>   Min.   :    0   Min.   : 1.00   Min.   : 5.00   Min.   :13.00  
#>   1st Qu.: 6000   1st Qu.:11.00   1st Qu.:11.00   1st Qu.:19.00  
#>   Median :11155   Median :11.00   Median :11.00   Median :22.00  
#>   Mean   :15289   Mean   : 9.14   Mean   :11.05   Mean   :23.45  
#>   3rd Qu.:20000   3rd Qu.:11.00   3rd Qu.:11.00   3rd Qu.:27.00  
#>   Max.   :75001   Max.   :16.00   Max.   :19.00   Max.   :52.00
# within the RStudio browser
View(nlsy)

In many functions in R, we refer to specific variables using dollar-sign notation. So to access the id variable in the nlsy dataset we’d type nlsy$id and all of the id numbers would print to the console. Don’t do this though, or 1000+ numbers will print out! Instead, we might look at the first or last few with head() or tail()

head(nlsy$id)
#>  [1]  3  6  8 16 18 20
tail(nlsy$sleep_wknd)
#>  [1] 12  8 12  5  7  5

We can use the summary() function on a single variable.

summary(nlsy$sleep_wkdy)
#>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.000   6.000   7.000   6.643   8.000  13.000

Many of the most basic functions in R are pretty straightforward:

table(nlsy$region)
#>  
#>    1   2   3   4 
#>  206 333 411 255
mean(nlsy$age_bir)
#>  [1] 23.44813

We can find out more information from the documentation:

help(cor)

And if you’re not sure what you’re looking for, there’s a ton of info elsewhere:

1.4 Group challenge exercises

  1. How many people are in the NLSY? How many variables are in this dataset? What are two ways you can answer these questions using tools we’ve discussed?
  2. Can you find an R function(s) we haven’t discussed that answers question 1? Feel free to Google! See how many ways you and your group can come up with!
  3. What’s the standard deviation of the number of hours of sleep on weekends?
  4. What’s the Spearman correlation between hours of sleep on weekends and weekdays in this data?
  5. Try to read in the data from an Excel file (it should be possible even if you don’t have Excel on your computer!). It’s in a tab called data, but there’s a header as well. (It might help to open up in whatever spreadsheet program you have.) You’ll have to load the readxl package (you already installed with with tidyverse, but it doesn’t load automatically), and probably read some of the documentation: https://readxl.tidyverse.org.
# first, use this script to download the data to your current working directory
download.file("https://intro-to-R-2020.louisahsmith.com/data/nlsy_cc.xlsx",
              destfile = file.path(getwd(), "nlsy_cc.xlsx"))
# this will be the path argument you'll need
path <- "nlsy_cc.xlsx"
# the variables also still have the NLSY-assigned names, so you'll need these
col_names <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd", "id", "nsibs", 
               "samp", "race_eth", "sex", "region", "income", "res_1980",
               "res_2002", "age_bir")