Then you’ll need to load the tidyverse
package and read in the data like before. I included it in this week’s materials too, so we don’t have to worry about reading it in from a different directory:
library(tidyverse)
nlsy <- read_csv("nlsy_cc.csv")
mutate()
Using the NLSY data and mutate()
, make a standardized (centered at the mean, and divided by the standard deviation) version of income.
# replace the ... with your code
nlsy <- mutate(nlsy, income_stand = ...)
Do the same thing, but using income on the log scale. Look at this variable using summary()
. Can you figure out what happened? (Hint: look at your log(income) variable.)
nlsy <- mutate(nlsy, log_income_stand = ...)
summary(nlsy$log_income_stand)
case_when()
Redo the previous question, but if you are not able to calculate log(income) for an observation, replace it with a missing value (using case_when()
). This time, when you standardize log(income), you’ll have to use na.rm = TRUE
to remove missing values both when you take the mean and the standard deviation.
nlsy <- mutate(nlsy,
log_income_stand = case_when(
...
))
forcats
packageTurn the eyesight variable into a factor variable. The numbers 1-5 correspond to “excellent”, “very good”, “good”, “fair”, and “poor.” Make sure that categories are in an appropriate order.
nlsy <- mutate(nlsy, eyesight_fact = ...)
Use two different methods to combine the worst two categories of eyesight into one category.
Make a new categorical income variable with at least 3 levels (you can choose the cutoffs). Make a bar graph with this new variable where the bars are in the correct order from low to high.
select()
Create mean-centered versions of “age_bir”, “nsibs”, “income”, and the two sleep variables. Use the same ending (e.g., "_cent") for all of them. Then make a new dataset of just the centered variables using select()
and a helper.
nlsy <- mutate(nlsy,
age_bir_cent = ...,
nsibs_cent = ...,
...)
nlsy_new <- select(nlsy, ...)
You may have added a lot of variables to the original dataset by now. Create a dataset called nlsy_orig
that contains only the variables we started off with, using the vector of names we originally used to name the columns and the all_of()
helper. I’ll start you off with the variable names.
colnames_orig <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir")
Look at help(rename)
. Looking at the examples to help, rename “age_bir” to “age_1st_birth” without making a new column.
filter()
Create a dataset with all the observations that get over 7 hours of sleep on both weekends and weekdays or who have an income greater than/equal to 20,000 and less than/equal to 50,000.
nlsy_or <- filter(nlsy, ...)
Create a dataset that consists only of the missing values in slp_cat_wkdy
. Check how many rows it has (there should be 3!).
Look up the between()
function in help. Figure out how to use this to answer the first question in this section, when choosing people whose income is between 20,000 and 50,000. Check to make sure you get the same number of rows.