Let's put it
ALL
together
Put everything you've learned into action, and more!
my-project/ ├─ my-project.Rproj ├─ README ├─ data/ │ ├── raw/ │ └── processed/ ├─ code/ ├─ results/ │ ├── tables/ │ ├── figures/ │ └── output/ └─ docs/
An .Rproj
file is mostly just a placeholder. It remembers various options, and makes it easy to open a new RStudio session that starts up in the correct working directory. You never need to edit it directly.
A README file can just be a text file that includes notes for yourself or future users.
I like to have a folder for raw data -- which I never touch -- and a folder(s) for datasets that I create along the way.
here
packagesource(here::here("code", "functions.R"))dat <- read_csv(here::here( "data", "raw", "data.csv"))p <- ggplot(dat) + geom_point(aes(x, y))ggsave(plot = p, filename = here::here( "results", "figures", "fig.pdf"))
The here
package lets you refer to files without worrying too much about relative paths.
Construct file paths with reference to the top directory holding your .Rproj
file.
here::here("data", "raw", "data.csv")
for me, here, becomes "/Users/louisahsmith/Google Drive/Teaching/R course/materials/data/raw/data.csv"
But if I send you this file, it will become whatever file path you need it to be.
here
packagehere::here()
is equivalent to
library(here)here()
I just prefer to write out the package name whenever I need it, but you can load the package for your entire session if you want.
Note that you can refer to any function without loading the whole package this way:
tableone::CreateTableOne()
is the same as
library(tableone)CreateTableOne()
source()
functionWill run code from another file.
# run the code in script.R, assuming it's in my current working directorysource("script.R")# run the code in my-project/code/functions.R from wherever I am in my-projectsource(here::here("code", "functions.R"))
All the objects will be created, packages loaded, etc. as if you had run the code directly from the console.
source()
functionCan even run code directly from a URL.
source("https://raw.githubusercontent.com/louisahsmith/intro-to-R-2020/master/static/exercises/01-week/01-todo.R")
## Error in eval(ei, envir): object 'new_vals' not found
Remember the first week when I had you generating errors on purpose?
Reading code from another file can make it a bit harder to debug.
But it's nice when you have functions, etc. that you use a lot and want to include them at the start of every script.
You could also begin your scripts by reading in your data via a data-cleaning file with source()
.
Each of these have different arguments that will allow you to read in specific columns only, skip rows, give the variables names, etc. There are also better options out there if your dataset is really big (look into the data.table
or the vroom
package), and if you have other types of data.
library(tidyverse)dat <- read_csv("data.csv")dat <- read_table("data.dat")dat <- read_rds("data.rds")dat <- readxl::read_excel("data.xlsx")dat <- haven::read_sas("data.sas7bdat")dat <- haven::read_stata("data.dta")dat <- googlesheets4::read_sheet("sheet-id")
But once you've cleaned your data and created your dataset, you probably just want to save a copy so you don't need to perform all your data cleaning functions every time you want to use it.
You can basically do the opposite of most of the read
functions: write
.
The one I usually use, if I'm creating data for myself, is write_rds()
. It creates an R object you can read in with read_rds()
, so you can guarantee nothing will change in between writing and reading.
If I'm sharing data, I usually use write_csv()
.
Note: these are the tidyverse
versions of the functions, which have better defaults, are more consistent, and are just more likely to do what you want. The "base R" versions are: read.csv()
, write.csv()
, readRDS()
and saveRDS()
.
So my process might look like this:
code/clean_data.R
# read in the raw dataset my collaborators gave medat <- read_csv(here::here("data", "raw", "dataset.csv"))# do whatever cleaning/subsetting I need tonewdat <- dat %>% mutate(new_var = var * 2) %>% filter(age >= 40) %>% select(age, new_var)# save as an r object for later analysiswrite_rds(newdat, here::here("data", "processed", "over_40.rds"))
code/main_analysis.R
, read in the clean datadat <- read_rds(here::here("data", "processed", "over_40.rds"))# run a linear regression modelmodel <- lm(new_var ~ age, data = dat)# save the model for making tables and figures laterwrite_rds(model, here::here("results", "output", "linear_mod.rds"))
code/make_tables.R
, make some tablesdat <- read_rds(here::here("data", "processed", "over_40.rds"))mod <- read_rds(here::here("data", "output", "linear_mod.rds"))# make a table 1tab1 <- tableone::CreateTableOne(..., data = dat)# make a table of analysis resultstab2 <- broom::tidy(model)
Depending on my needs, I might read these tables into an RMarkdown document, save them to a .csv
file, etc.
Now on to making figures, etc....
1
Your turn...
Exercises 6.1: Change the file paths so the document knits.
NA
for missing valuesNA
to any logical statement3 < NA
## [1] NA
mean(c(1, 2, NA))
## [1] NA
mean(c(1, 2, NA), na.rm = TRUE)
## [1] 1.5
NA
functionsCertain functions deal with missing values explicitly
vals <- c(1, 2, NA)is.na(vals)
## [1] FALSE FALSE TRUE
anyNA(vals)
## [1] TRUE
na.omit(vals)
## [1] 1 2
NA
s with na_if()
You might read in data that has special values to indicate missingness.
In NLSY, -1 = Refused, -2 = Don't know, -3 = Invalid missing, -4 = Valid missing, -5 = Non-interview
nlsy[1, c("id", "glasses", "age_bir")]
## # A tibble: 1 x 3## id glasses age_bir## <dbl> <dbl> <dbl>## 1 1 -4 -5
nlsy_na <- nlsy %>% na_if(-1) %>% na_if(-2) %>% na_if(-3) %>% na_if(-4) %>% na_if(-5)nlsy_na[1, c("id", "glasses", "age_bir")]
## # A tibble: 1 x 3## id glasses age_bir## <dbl> <dbl> <dbl>## 1 1 NA NA
This is obviously a bit annoying if you have a lot of values that indicate missingness. In that case, you may want to look into the naniar package.
na_if()
The na_if()
strategy is generally the most useful if you're determining NA's over the course of your analysis, or if you have different NA values for different variables.
# we decide that person 2 is a mistake...nlsy_bad <- nlsy %>% mutate(id = na_if(id, 2))nlsy_bad[1:3, c("id", "glasses", "age_bir")]
## # A tibble: 3 x 3## id glasses age_bir## <dbl> <dbl> <dbl>## 1 1 -4 -5## 2 NA 0 34## 3 3 0 19
Or, if you know a priori which values indicate missingness (e.g., "."), you can specify that when reading in the data.
nlsy <- read_csv(here::here("data", "nlsy.csv"), col_names = colnames_nlsy, skip = 1, na = c("-1", "-2", "-3", "-4", "-5"))
fct_recode()
), then changes the values to NA
.Sometimes you may just want to get rid of all the rows with missing values.
nrow(nlsy)
## [1] 12686
nlsy_cc <- nlsy %>% filter(complete.cases(nlsy))nrow(nlsy_cc)
## [1] 1436
nlsy2 <- nlsy %>% na.omit()nrow(nlsy2)
## [1] 1436
Don't do this without good reason!
2
Your turn...
Exercises 6.2: Create and exclude observations based on missing values.
# load packageslibrary(tidyverse)# must install with install.packages if haven't alreadylibrary(broom) # for making pretty model outputlibrary(splines) # for adding splines# read in datanlsy_clean <- read_rds(here::here("data", "nlsy_clean.rds"))
Model formulas will automatically make indicator variables for factors, with the first level the reference. An intercept will be included unless suppressed with y ~ -1 + x
.
# linear regression (OLS)mod_lin1 <- lm(log_inc ~ age_bir + sex + race_eth, data = nlsy_clean)# another way to do linear regression (GLM)mod_lin2 <- glm(log_inc ~ age_bir + sex + race_eth, family = gaussian(link = "identity"), data = nlsy_clean)# logistic regressionmod_log <- glm(glasses ~ eyesight + sex + race_eth, family = binomial(link = "logit"), data = nlsy_clean)# poisson regressionmod_pois <- glm(nsibs ~ sleep_wkdy + sleep_wknd, family = poisson(link = "log"), data = nlsy_clean)
You can use the survival
package for time-to-event models.
Create interactions with *
(will automatically include main terms too).
Create polynomial terms with, e.g., I(x^2)
.
Create splines with the splines package and the ns()
functions (or other packages/functions).
mod_big <- glm(log_inc ~ sex * age_bir + nsibs + I(nsibs^2) + ns(sleep_wkdy, knots = 3), family = gaussian(link = "identity"), data = nlsy_clean)
tidyverse
packages, you don't need to quote the variable names or use data$variable
notation in model formulassummary(mod_log)
## ## Call:## glm(formula = glasses ~ eyesight + sex + race_eth, family = binomial(link = "logit"), ## data = nlsy_clean)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.4275 -1.0825 -0.8343 1.2221 1.6261 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.51260 0.06317 -8.114 4.89e-16 ***## eyesightVery Good -0.07920 0.05359 -1.478 0.1394 ## eyesightGood -0.07146 0.06188 -1.155 0.2481 ## eyesightFair -0.21488 0.09105 -2.360 0.0183 * ## eyesightPoor 0.10558 0.18152 0.582 0.5608 ## sexFemale 0.69281 0.04493 15.420 < 2e-16 ***## race_ethNon-Hispanic Black -0.28460 0.06493 -4.383 1.17e-05 ***## race_ethNon-Black, Non-Hispanic 0.28528 0.05945 4.799 1.60e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 11655 on 8443 degrees of freedom## Residual deviance: 11281 on 8436 degrees of freedom## (4242 observations deleted due to missingness)## AIC: 11297## ## Number of Fisher Scoring iterations: 4
Or use the tidy()
function from the broom
package, which nicely summarizes all sorts of models.
# from the broom packagetidy(mod_log)
## # A tibble: 8 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -0.513 0.0632 -8.11 4.89e-16## 2 eyesightVery Good -0.0792 0.0536 -1.48 1.39e- 1## 3 eyesightGood -0.0715 0.0619 -1.15 2.48e- 1## 4 eyesightFair -0.215 0.0911 -2.36 1.83e- 2## 5 eyesightPoor 0.106 0.182 0.582 5.61e- 1## 6 sexFemale 0.693 0.0449 15.4 1.21e-53## 7 race_ethNon-Hispanic Black -0.285 0.0649 -4.38 1.17e- 5## 8 race_ethNon-Black, Non-Hispanic 0.285 0.0594 4.80 1.60e- 6
coef(mod_log)
## (Intercept) eyesightVery Good ## -0.51260479 -0.07920222 ## eyesightGood eyesightFair ## -0.07145996 -0.21487546 ## eyesightPoor sexFemale ## 0.10557518 0.69280825 ## race_ethNon-Hispanic Black race_ethNon-Black, Non-Hispanic ## -0.28460369 0.28527712
coef(mod_log)[6]
## sexFemale ## 0.6928082
tidy(mod_log) %>% slice(6) %>% pull(estimate)
## [1] 0.6928082
Reminder: if you have a model β0+β1x1+ etc., coefficient 6 is really β5!
coef(mod_log)["sexFemale"]
## sexFemale ## 0.6928082
tidy(mod_log) %>% filter(term == "sexFemale") %>% pull(estimate)
## [1] 0.6928082
Since it's just a tibble/dataframe, you can create new columns!
# 95% confidence intervalres_mod_log <- mod_log %>% tidy() %>% mutate(lci = estimate - 1.96 * std.error, uci = estimate + 1.96 * std.error)res_mod_log
## # A tibble: 8 x 7## term estimate std.error statistic p.value lci uci## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -0.513 0.0632 -8.11 4.89e-16 -0.636 -0.389 ## 2 eyesightVery Good -0.0792 0.0536 -1.48 1.39e- 1 -0.184 0.0258## 3 eyesightGood -0.0715 0.0619 -1.15 2.48e- 1 -0.193 0.0498## 4 eyesightFair -0.215 0.0911 -2.36 1.83e- 2 -0.393 -0.0364## 5 eyesightPoor 0.106 0.182 0.582 5.61e- 1 -0.250 0.461 ## 6 sexFemale 0.693 0.0449 15.4 1.21e-53 0.605 0.781 ## 7 race_ethNon-Hispanic Black -0.285 0.0649 -4.38 1.17e- 5 -0.412 -0.157 ## 8 race_ethNon-Black, Non-Hispanic 0.285 0.0594 4.80 1.60e- 6 0.169 0.402
We could also clean up the term
variable, perhaps with fct_recode()
.
Since these are results from a logistic regression, we'll probably want to exponentiate the coefficients and their CIs.
res_mod_log <- res_mod_log %>% select(term, estimate, lci, uci) %>% filter(term != "(Intercept)") %>% # exponentiate all three columns at once! mutate(across(c(estimate, lci, uci), exp))res_mod_log
## # A tibble: 7 x 4## term estimate lci uci## <chr> <dbl> <dbl> <dbl>## 1 eyesightVery Good 0.924 0.832 1.03 ## 2 eyesightGood 0.931 0.825 1.05 ## 3 eyesightFair 0.807 0.675 0.964## 4 eyesightPoor 1.11 0.779 1.59 ## 5 sexFemale 2.00 1.83 2.18 ## 6 race_ethNon-Hispanic Black 0.752 0.662 0.854## 7 race_ethNon-Black, Non-Hispanic 1.33 1.18 1.49
str_glue()
Now we want to combine the lower and upper CI limits.
res_mod_log %>% select(term, estimate, lci, uci) %>% filter(term != "(Intercept)") %>% mutate(ci = str_glue("({lci}, {uci})"))
## # A tibble: 7 x 5## term estimate lci uci ci ## <chr> <dbl> <dbl> <dbl> <glue> ## 1 eyesightVery Good 0.924 0.832 1.03 (0.831734362089001, 1.02617441582346) ## 2 eyesightGood 0.931 0.825 1.05 (0.824698936937584, 1.05107868439189) ## 3 eyesightFair 0.807 0.675 0.964 (0.674797666860532, 0.96424628337116) ## 4 eyesightPoor 1.11 0.779 1.59 (0.778639992096712, 1.58622477434497) ## 5 sexFemale 2.00 1.83 2.18 (1.8307856398006, 2.18337382956281) ## 6 race_ethNon-Hispanic Black 0.752 0.662 0.854 (0.662416384318316, 0.854408001346314)## 7 race_ethNon-Black, Non-Hispanic 1.33 1.18 1.49 (1.18383962963298, 1.49449918933821)
We can paste text and R code (within {}
) together with str_glue()
. Everything goes in quotation marks.
round()
function within the curly braces too!str_glue()
You can paste any R expression you want evaluated in the curly braces.
You can break up chunks of your string to make it easier to read in your code.
str_glue( "The intercept from the regression is ", "{round(coef(lm(income~sex, data = nlsy_clean))[1])} and a random ", "number that I generated is {round(rnorm(1, 0, 1), 3)}.")
## The intercept from the regression is 14880 and a random number that I generated is -1.469.
More functions are available in the glue
package. For example, you could write a nice list of the regions in the data like this:
glue::glue_collapse(levels(nlsy_clean$region), sep = ", ", last = ", and ")
## Northeast, North central, South, and West
We want to take these values and print "OR (95% CI LCI, UCI)" for each one. Let's make a function to put together everything we've done so far!
ci_func <- function(estimate, lci, uci) { OR <- round(exp(estimate), 2) lci <- round(exp(lci), 2) uci <- round(exp(uci), 2) to_print <- str_glue("{OR} (95% CI {lci}, {uci})") return(to_print)}
Let's test on some made-up values:
ci_func(.2523421, -.142433, .851234)
## 1.29 (95% CI 0.87, 2.34)
new_mod <- glm(glasses ~ eyesight + sex, family = binomial(link = "logit"), data = nlsy_clean)tidy(new_mod) %>% filter(term != "(Intercept)") %>% # we don't care about this term mutate(lci = estimate - 1.96 * std.error, uci = estimate + 1.96 * std.error, OR = ci_func(estimate, lci, uci), p.value = scales::pvalue(p.value)) %>% # for formatting p-values select(term, OR, p.value)
## # A tibble: 5 x 3## term OR p.value## <chr> <glue> <chr> ## 1 eyesightVery Good 0.9 (95% CI 0.81, 1) 0.042 ## 2 eyesightGood 0.88 (95% CI 0.78, 0.99) 0.041 ## 3 eyesightFair 0.74 (95% CI 0.62, 0.88) <0.001 ## 4 eyesightPoor 1.01 (95% CI 0.71, 1.44) 0.959 ## 5 sexFemale 1.99 (95% CI 1.82, 2.17) <0.001
If you want 2 decimal places no matter what, use something like format(round(x, digits = 2), nsmall = 2)
The tidy()
function could have actually done the exponentiating and confidence interval calculating for us. See help(tidy.glm)
. But much more fun to do it ourselves 😉
tidy(new_mod, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 6 x 7## term estimate std.error statistic p.value conf.low conf.high## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 0.655 0.0430 -9.85 6.59e-23 0.602 0.712## 2 eyesightVery Good 0.898 0.0531 -2.04 4.18e- 2 0.809 0.996## 3 eyesightGood 0.882 0.0612 -2.05 4.06e- 2 0.782 0.995## 4 eyesightFair 0.738 0.0902 -3.37 7.58e- 4 0.618 0.880## 5 eyesightPoor 1.01 0.181 0.0512 9.59e- 1 0.708 1.44 ## 6 sexFemale 1.99 0.0446 15.4 8.85e-54 1.82 2.17
If you're wondering why these confidence intervals differ slightly from the last slide, these are likelihood-based and ours were Wald confidence intervals. Don't worry if you don't know what this means!
You can do it!
NLSY/ ├─ NLSY.Rproj ├─ data/ │ ├── raw/ │ └── processed/ ├─ code/ └── results/ ├── tables/ └── figures/
data/raw
.clean_data.R
.income
, id
, and 2 others of your choosing. Then restrict to complete cases and people with incomes < $30,000. Make a variable for the log of income (replace with NA if income <= 0)..rds
file to the data/processed
folder.create_figure.R
. In this file, read in the cleaned dataset. Load any packages you need. Then make a ggplot
figure of your choosing to show something about the distribution of the data. Save it to the results/figures
folder as a .png
file using the ggsave()
function.table_1.R
. In this file, read in the cleaned dataset and use the tableone
package to create a table 1 with the variables of your choosing. Modify the following code to save it as a .csv
file. Open it in Excel/Numbers/Google Sheets/etc. to make sure it worked.tab1 <- CreateTableOne(...) %>% print() %>% as_tibble(rownames = "id")write_csv(tab1, ...)
lin_reg.R
, read in the data and run the following linear regression: lm(log_inc ~ sex + race_eth +{other variables of your choosing}, data = nlsy)
. Modify the CI function to produce a table of results for a linear regression. Add an argument digits =
, with a default of 2, to allow you to choose the number of digits you'd like. Save it in a separate file called functions.R
. Use source()
to read in the function at the beginning of your script..csv
file. Make the names of the coefficients nice!ggplot
to make a figure. Use geom_point()
for the point estimates and geom_errorbar()
for the confidence intervals. It will look something like this:ggplot(data) + geom_point(aes(x = , y = )) + geom_errorbar(aes(x = , ymin = , ymax = ))
.pdf
using ggsave()
. You may want to play around with the height =
and width =
arguments to make it look like you want.lubridate
: Work with dates and times really easily. (https://lubridate.tidyverse.org)janitor
: Helps clean variable names, etc. (http://sfirke.github.io/janitor/)furrr
: Speed up your code with parallel processing. (https://davisvaughan.github.io/furrr/)shiny
: Make interactive apps. I made http://selection-bias.louisahsmith.com in shiny. (http://shiny.rstudio.com)drake
: Pipeline for analysis. (https://docs.ropensci.org/drake)rvest
: Scrape data from websites. (https://rvest.tidyverse.org)3
Your turn...
Exercises 6.3: Work through the challenge!
Let's put it
ALL
together
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |