1 Overview

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

These all combine naturally with group_by() which allows you to perform any operation “by group”. You can learn more about them in vignette("dplyr"). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table").

dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code. Install the dbplyr package then read vignette("databases", package = "dbplyr").

2 Lesson

  1. Data Manipulation with dplyr

2.1 Variable recoding with dplyr

  • recode and recode_factor: Replace numeric values based on their position, and character values by their name;
  • if_else: Replace values based on a logical vector;
  • case_when: Vectorise multiple if and else if statements.

Recoding, when to use which function: - one-to-one, many-to-one: recode and recode_factor

Download the NHTS 2009 data file for the demos here (Right click & select Save As… to the directory of your RStudio project).

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
## ✔ tibble  1.4.2          ✔ dplyr   0.7.4.9000
## ✔ tidyr   0.8.0          ✔ stringr 1.3.0     
## ✔ readr   1.1.1          ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::vars()   masks ggplot2::vars()
# load NHTS2009 travel diaries subset
dd <- read_csv("NHTS2009_dd.csv")
## Parsed with column specification:
## cols(
##   HOUSEID = col_integer(),
##   PERSONID = col_character(),
##   HHSIZE = col_integer(),
##   HH_RACE = col_character(),
##   HHFAMINC = col_character(),
##   URBRUR = col_character(),
##   TRIPPURP = col_character(),
##   TRPTRANS = col_character(),
##   TRVLMIN = col_integer(),
##   TRPMILES = col_double()
## )
# recode race (HH_RACE column) according to data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=951
dd %>% mutate(hh_race_str=recode(HH_RACE, 
                                 "01"="White",
                                 "02"="African American, Black",
                                 "03"="Asian Only",
                                 "04"="American Indian, Alaskan Native",
                                 "05"="Native Hawaiian, other Pacific",
                                 "06"="Multiracial",
                                 "07"="Hispanic/Mexican",
                                 "97"="Other specify",
                                 .default = as.character(NA) # any unspecified values would be assgined NA
                                 )) %>% 
  select(HH_RACE, hh_race_str)
## # A tibble: 304 x 2
##    HH_RACE hh_race_str
##    <chr>   <chr>      
##  1 01      White      
##  2 01      White      
##  3 01      White      
##  4 01      White      
##  5 01      White      
##  6 01      White      
##  7 01      White      
##  8 01      White      
##  9 01      White      
## 10 01      White      
## # ... with 294 more rows
  • a logical condition: if_else
# code driving & non-driving based on travel modes (TRPTRANS column) data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=1084
dd %>% mutate(driving=ifelse(TRPTRANS %in% c("01", "02", "03", "04", "05", "06", "07"), 1, 0),
              driving=ifelse(TRPTRANS %in% c("-1", "-7", "-8", "-9"), NA, driving) # retain missing values as NA
             ) %>% 
  select(TRPTRANS, driving)
## # A tibble: 304 x 2
##    TRPTRANS driving
##    <chr>      <dbl>
##  1 03            1.
##  2 03            1.
##  3 03            1.
##  4 03            1.
##  5 03            1.
##  6 03            1.
##  7 03            1.
##  8 03            1.
##  9 03            1.
## 10 03            1.
## # ... with 294 more rows
  • multiple logical conditions: case_when
# code driving & non-driving based on travel modes (TRPTRANS column) data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=1084 use case_when
dd %>% mutate(driving=case_when(
  TRPTRANS %in% c("01", "02", "03", "04", "05", "06", "07") ~ 1, 
  TRPTRANS %in% c("-1", "-7", "-8", "-9") ~ as.double(NA), # retain missing values as NA
  TRUE ~ 0)) %>% 
  select(TRPTRANS, driving)
## # A tibble: 304 x 2
##    TRPTRANS driving
##    <chr>      <dbl>
##  1 03            1.
##  2 03            1.
##  3 03            1.
##  4 03            1.
##  5 03            1.
##  6 03            1.
##  7 03            1.
##  8 03            1.
##  9 03            1.
## 10 03            1.
## # ... with 294 more rows
# reclassify households into low, med, high income based on HHFAMINC column data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=949 with brackets [0, 30000, 6000]
dd <- dd %>% mutate(income_cat=case_when(
  HHFAMINC %in% c("01", "02", "03", "04", "05", "06") ~ "low income",
  HHFAMINC %in% c("07", "08", "09", "10", "11", "12") ~ "med income",
  HHFAMINC %in% c("13", "14", "15", "16", "17", "18") ~ "high income",
  TRUE ~ as.character(NA) # retain missing values as NA
  ))

# verify recodeing results with group_by & tally
dd %>% group_by(HHFAMINC, income_cat) %>% 
  tally()
## # A tibble: 13 x 3
## # Groups:   HHFAMINC [?]
##    HHFAMINC income_cat      n
##    <chr>    <chr>       <int>
##  1 01       low income      4
##  2 02       low income      2
##  3 03       low income     12
##  4 04       low income      2
##  5 06       low income     18
##  6 07       med income      6
##  7 08       med income     10
##  8 12       med income      7
##  9 14       high income    20
## 10 16       high income    38
## 11 17       high income    64
## 12 18       high income   115
## 13 -7       <NA>            6