First, either create a new folder for your working directory or use a directory you’ve already made. Download the eb_pums dataset for the lab.
Data File
Load the eb_pums dataset using the readRDS() function.
eb_pums <- readRDS("eb_pums_2000.rds")
We will recode some of the values in pums dataset so that we can run our tests later on.
library(dplyr)
#Note: People of any race can also be Hispanic (Hispanic is not a race category),
#so when you recode Hispanics as ???4???, they will no longer be coded in the ???White,??? ???Black,??? or ???Asian??? categories.
eb_pums <- eb_pums %>% mutate(
race=as.character(race),
hispan=as.character(hispan),
racehisp=case_when(race=="Chinese" ~ "Asian",
race=="Japanese" ~ "Asian",
race=="Other Asian or Pacific" ~ "Asian",
race=="White" ~ "White",
race=="Black" ~ "Black",
race=="American Indian or Alaskan" ~ "American Indian or Alaskan",
hispan != "Not Hispanic" ~ "Hispanic",
TRUE~"Other"))
# recode modes (https://usa.ipums.org/usa-action/variables/TRANWORK#codes_section)
eb_pums <- eb_pums %>%
mutate(
tranwork = as.integer(tranwork),
mode = case_when(
tranwork %in% c(10, 20) ~ "driving",
tranwork %in% c(31, 32, 33, 34, 35, 36) ~ "transit",
tranwork %in% c(40, 50) ~ "bike/walk",
tranwork %in% c(60, 70) ~ "other",
TRUE ~ as.character(NA)
)
)
Crosstabulate means of travel to work by race/ethnicity
Cross tabs are one of the most basic and commonly used means to analyze categorical variables. We will crosstab two categorical variables and produce an analytical table using percentages. 1. First let’s simply get a table of mode. We will use the table() function.
table(eb_pums$mode)
##
## bike/walk driving other transit
## 1993 44456 2647 5512
Quick question: What does NA mean in our data? Quicker answer: The “universe” for this variable is all workers, and NA refers to all non-workers.
Now we shall add our second categorical variable and finish our crosstabulation. We will crosstabe “mode” with “racehisp” (your recoded Race/Hispanic variable). We shall make the race/ethnicity categories our column headers and travel model our row labels. The general mode for getting a crosstabs is of the form:
xtabs(~var1 + var2, data = df)
Limit our crosstabs to workers only
We can use the subset() function to isolate particular values of our variables. In this case, we are only interested in workers. (we will save the table object as xtabs.TR)
xtabs.TR <- xtabs(~ racehisp + mode, data = eb_pums,
subset = !is.na(mode), drop.unused.levels = TRUE)
xtabs.TR
## mode
## racehisp bike/walk driving other transit
## American Indian or Alaskan 17 279 16 33
## Asian 333 7686 341 1019
## Black 142 4227 198 935
## Hispanic 212 3812 145 468
## Other 91 1593 100 222
## White 1198 26859 1847 2835
Add column percentages
We have a nice proper crosstab but frequency counts are often not what we want. We are often interested in the different proportions of our categorical variables. So, to answer this we will calculate the percentages of worker mode of transit by race/ethnicity.
prop.table(xtabs(~ racehisp + mode, data = eb_pums,
subset = !is.na(mode), drop.unused.levels = TRUE), 1)
## mode
## racehisp bike/walk driving other transit
## American Indian or Alaskan 0.04927536 0.80869565 0.04637681 0.09565217
## Asian 0.03550485 0.81949035 0.03635782 0.10864698
## Black 0.02580880 0.76826609 0.03598691 0.16993820
## Hispanic 0.04571922 0.82208324 0.03127022 0.10092732
## Other 0.04536391 0.79411765 0.04985045 0.11066800
## White 0.03659244 0.82039769 0.05641590 0.08659397
#Or if you saved the xtabs results in our object this also works
prop.table(xtabs.TR, 1)
## mode
## racehisp bike/walk driving other transit
## American Indian or Alaskan 0.04927536 0.80869565 0.04637681 0.09565217
## Asian 0.03550485 0.81949035 0.03635782 0.10864698
## Black 0.02580880 0.76826609 0.03598691 0.16993820
## Hispanic 0.04571922 0.82208324 0.03127022 0.10092732
## Other 0.04536391 0.79411765 0.04985045 0.11066800
## White 0.03659244 0.82039769 0.05641590 0.08659397
A crosstab with column percentages can be quite useful for our analyses. When you present this sort of table in a report be sure to include total “N” or frequency counts for each cell.
Question: Now that we have some results, which racial groups are most likely to use a car, truck, or van to get to work?
We will use the chi-square test in order to determine whether the difference between two groups is significant. We can use two approaches for getting a chi-square test for two groups using either the summary() function or the chisq.tes() function.
#using our xtabs.TR object
summary(xtabs.TR)
## Call: xtabs(formula = ~racehisp + mode, data = eb_pums, subset = !is.na(mode),
## drop.unused.levels = TRUE)
## Number of cases in table: 54608
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 502.5, df = 15, p-value = 1.656e-97
chisq.test(xtabs(~racehisp + mode, data = eb_pums, subset = !is.na(mode), drop.unused.levels = TRUE))
##
## Pearson's Chi-squared test
##
## data: xtabs(~racehisp + mode, data = eb_pums, subset = !is.na(mode), drop.unused.levels = TRUE)
## X-squared = 502.52, df = 15, p-value < 2.2e-16
chisq.test(xtabs.TR)
##
## Pearson's Chi-squared test
##
## data: xtabs.TR
## X-squared = 502.52, df = 15, p-value < 2.2e-16
# OR directly use the original variables
chisq.test(eb_pums$racehis, eb_pums$mode)
##
## Pearson's Chi-squared test
##
## data: eb_pums$racehis and eb_pums$mode
## X-squared = 502.52, df = 15, p-value < 2.2e-16
Are the differences in modes to work between racial and ethnic groups statistically significant?