3 Selecting, filtering, and mutating


Download the materials for this week’s exercises here. Once you’ve unzipped (and extracted, if you’re on Windows) the file and moved the folder wherever you want, open up a new RStudio session using the 03-week.Rproj file. You should have nothing in your environment. If you’re having trouble with this, look back at the first week’s exercises for guidance.

Then you’ll need to load the tidyverse package and read in the data like before. I included it in this week’s materials too, so we don’t have to worry about reading it in from a different directory:

nlsy <- read_csv("nlsy_cc.csv")

3.1 Making new variables with mutate()

Standardized income

Using the NLSY data and mutate(), make a standardized (centered at the mean, and divided by the standard deviation) version of income.

# replace the ... with your code
nlsy <- mutate(nlsy, income_stand = ...)

Standardized log(income)

Do the same thing, but using income on the log scale. Look at this variable using summary(). Can you figure out what happened? (Hint: look at your log(income) variable.)

nlsy <- mutate(nlsy, log_income_stand = ...)


Redo the previous question, but if you are not able to calculate log(income) for an observation, replace it with a missing value (using case_when()). This time, when you standardize log(income), you’ll have to use na.rm = TRUE to remove missing values both when you take the mean and the standard deviation.

nlsy <- mutate(nlsy,
               log_income_stand = case_when(

3.2 Factors in R using the forcats package

Recode a factor

Turn the eyesight variable into a factor variable. The numbers 1-5 correspond to “excellent”, “very good”, “good”, “fair”, and “poor.” Make sure that categories are in an appropriate order.

nlsy <- mutate(nlsy, eyesight_fact = ...)

Combining factor levels

Use two different methods to combine the worst two categories of eyesight into one category.

Relevel a factor

Make a new categorical income variable with at least 3 levels (you can choose the cutoffs). Make a bar graph with this new variable where the bars are in the correct order from low to high.

3.3 Selecting variables using select()

Select centered variables

Create mean-centered versions of “age_bir”, “nsibs”, “income”, and the two sleep variables. Use the same ending (e.g., "_cent") for all of them. Then make a new dataset of just the centered variables using select() and a helper.

nlsy <- mutate(nlsy,
               age_bir_cent = ...,
               nsibs_cent = ...,
nlsy_new <- select(nlsy, ...)

Go back to the beginning

You may have added a lot of variables to the original dataset by now. Create a dataset called nlsy_orig that contains only the variables we started off with, using the vector of names we originally used to name the columns and the all_of() helper. I’ll start you off with the variable names.

colnames_orig <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
                   "id", "nsibs", "samp", "race_eth", "sex", "region", 
                   "income", "res_1980", "res_2002", "age_bir")

Rename variables

Look at help(rename). Looking at the examples to help, rename “age_bir” to “age_1st_birth” without making a new column.

3.4 Subset your data with filter()

“Or” conditions

Create a dataset with all the observations that get over 7 hours of sleep on both weekends and weekdays or who have an income greater than/equal to 20,000 and less than/equal to 50,000.

nlsy_or <- filter(nlsy, ...)

Missing values

Create a dataset that consists only of the missing values in slp_cat_wkdy. Check how many rows it has (there should be 3!).

Greater than/less than

Look up the between() function in help. Figure out how to use this to answer the first question in this section, when choosing people whose income is between 20,000 and 50,000. Check to make sure you get the same number of rows.