dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their names.filter()
picks cases based on their values.summarise()
reduces multiple values down to a single summary.arrange()
changes the ordering of the rows.These all combine naturally with group_by()
which allows you to perform any operation “by group”. You can learn more about them in vignette("dplyr")
. As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table")
.
dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code. Install the dbplyr package then read vignette("databases", package = "dbplyr")
.
recode
and recode_factor
: Replace numeric values based on their position, and character values by their name;if_else
: Replace values based on a logical vector;case_when
: Vectorise multiple if and else if statements.Recoding, when to use which function: - one-to-one, many-to-one: recode
and recode_factor
Download the NHTS 2009 data file for the demos here (Right click & select Save As… to the directory of your RStudio project).
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4.9000
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::vars() masks ggplot2::vars()
# load NHTS2009 travel diaries subset
dd <- read_csv("NHTS2009_dd.csv")
## Parsed with column specification:
## cols(
## HOUSEID = col_integer(),
## PERSONID = col_character(),
## HHSIZE = col_integer(),
## HH_RACE = col_character(),
## HHFAMINC = col_character(),
## URBRUR = col_character(),
## TRIPPURP = col_character(),
## TRPTRANS = col_character(),
## TRVLMIN = col_integer(),
## TRPMILES = col_double()
## )
# recode race (HH_RACE column) according to data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=951
dd %>% mutate(hh_race_str=recode(HH_RACE,
"01"="White",
"02"="African American, Black",
"03"="Asian Only",
"04"="American Indian, Alaskan Native",
"05"="Native Hawaiian, other Pacific",
"06"="Multiracial",
"07"="Hispanic/Mexican",
"97"="Other specify",
.default = as.character(NA) # any unspecified values would be assgined NA
)) %>%
select(HH_RACE, hh_race_str)
## # A tibble: 304 x 2
## HH_RACE hh_race_str
## <chr> <chr>
## 1 01 White
## 2 01 White
## 3 01 White
## 4 01 White
## 5 01 White
## 6 01 White
## 7 01 White
## 8 01 White
## 9 01 White
## 10 01 White
## # ... with 294 more rows
if_else
# code driving & non-driving based on travel modes (TRPTRANS column) data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=1084
dd %>% mutate(driving=ifelse(TRPTRANS %in% c("01", "02", "03", "04", "05", "06", "07"), 1, 0),
driving=ifelse(TRPTRANS %in% c("-1", "-7", "-8", "-9"), NA, driving) # retain missing values as NA
) %>%
select(TRPTRANS, driving)
## # A tibble: 304 x 2
## TRPTRANS driving
## <chr> <dbl>
## 1 03 1.
## 2 03 1.
## 3 03 1.
## 4 03 1.
## 5 03 1.
## 6 03 1.
## 7 03 1.
## 8 03 1.
## 9 03 1.
## 10 03 1.
## # ... with 294 more rows
case_when
# code driving & non-driving based on travel modes (TRPTRANS column) data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=1084 use case_when
dd %>% mutate(driving=case_when(
TRPTRANS %in% c("01", "02", "03", "04", "05", "06", "07") ~ 1,
TRPTRANS %in% c("-1", "-7", "-8", "-9") ~ as.double(NA), # retain missing values as NA
TRUE ~ 0)) %>%
select(TRPTRANS, driving)
## # A tibble: 304 x 2
## TRPTRANS driving
## <chr> <dbl>
## 1 03 1.
## 2 03 1.
## 3 03 1.
## 4 03 1.
## 5 03 1.
## 6 03 1.
## 7 03 1.
## 8 03 1.
## 9 03 1.
## 10 03 1.
## # ... with 294 more rows
# reclassify households into low, med, high income based on HHFAMINC column data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=949 with brackets [0, 30000, 6000]
dd <- dd %>% mutate(income_cat=case_when(
HHFAMINC %in% c("01", "02", "03", "04", "05", "06") ~ "low income",
HHFAMINC %in% c("07", "08", "09", "10", "11", "12") ~ "med income",
HHFAMINC %in% c("13", "14", "15", "16", "17", "18") ~ "high income",
TRUE ~ as.character(NA) # retain missing values as NA
))
# verify recodeing results with group_by & tally
dd %>% group_by(HHFAMINC, income_cat) %>%
tally()
## # A tibble: 13 x 3
## # Groups: HHFAMINC [?]
## HHFAMINC income_cat n
## <chr> <chr> <int>
## 1 01 low income 4
## 2 02 low income 2
## 3 03 low income 12
## 4 04 low income 2
## 5 06 low income 18
## 6 07 med income 6
## 7 08 med income 10
## 8 12 med income 7
## 9 14 high income 20
## 10 16 high income 38
## 11 17 high income 64
## 12 18 high income 115
## 13 -7 <NA> 6