First, either create a new folder for your working directory or use a directory you’ve already made. Download the eb_pums dataset for the lab.

Data File

eb_pums_2000.RDS

Load the eb_pums dataset using the readRDS() function.

eb_pums <- readRDS("eb_pums_2000.rds")

1 Recoding Variables (Review)

We will recode some of the values in pums dataset so that we can run our tests later on.

library(dplyr)

#Note: People of any race can also be Hispanic (Hispanic is not a race category), 
#so when you recode Hispanics as ???4???, they will no longer be coded in the ???White,??? ???Black,??? or ???Asian??? categories. 
eb_pums <- eb_pums %>% mutate(
  race=as.character(race), 
  hispan=as.character(hispan),
  racehisp=case_when(race=="Chinese" ~ "Asian", 
                     race=="Japanese" ~ "Asian", 
                     race=="Other Asian or Pacific" ~ "Asian",
                     race=="White" ~ "White", 
                     race=="Black" ~ "Black", 
                     race=="American Indian or Alaskan" ~ "American Indian or Alaskan", 
                     hispan != "Not Hispanic" ~ "Hispanic",
                     TRUE~"Other"))

# recode modes (https://usa.ipums.org/usa-action/variables/TRANWORK#codes_section)
eb_pums <- eb_pums %>%
  mutate(
    tranwork = as.integer(tranwork),
    mode = case_when(
      tranwork %in% c(10, 20) ~ "driving",
      tranwork %in% c(31, 32, 33, 34, 35, 36) ~ "transit",
      tranwork %in% c(40, 50) ~ "bike/walk",
      tranwork %in% c(60, 70) ~ "other",
      TRUE ~ as.character(NA)
    )
  )

2 Run Cross-tablulations

Crosstabulate means of travel to work by race/ethnicity

Cross tabs are one of the most basic and commonly used means to analyze categorical variables. We will crosstab two categorical variables and produce an analytical table using percentages. 1. First let’s simply get a table of mode. We will use the table() function.

 table(eb_pums$mode)

## 
## bike/walk   driving     other   transit 
##      1993     44456      2647      5512

Quick question: What does NA mean in our data? Quicker answer: The “universe” for this variable is all workers, and NA refers to all non-workers.

Now we shall add our second categorical variable and finish our crosstabulation. We will crosstabe “mode” with “racehisp” (your recoded Race/Hispanic variable). We shall make the race/ethnicity categories our column headers and travel model our row labels. The general mode for getting a crosstabs is of the form:

xtabs(~var1 + var2, data = df)

Limit our crosstabs to workers only

We can use the subset() function to isolate particular values of our variables. In this case, we are only interested in workers. (we will save the table object as xtabs.TR)

xtabs.TR <- xtabs(~ racehisp + mode, data = eb_pums, 
                  subset = !is.na(mode), drop.unused.levels = TRUE)

xtabs.TR

##                             mode
## racehisp                     bike/walk driving other transit
##   American Indian or Alaskan        17     279    16      33
##   Asian                            333    7686   341    1019
##   Black                            142    4227   198     935
##   Hispanic                         212    3812   145     468
##   Other                             91    1593   100     222
##   White                           1198   26859  1847    2835

Add column percentages

We have a nice proper crosstab but frequency counts are often not what we want. We are often interested in the different proportions of our categorical variables. So, to answer this we will calculate the percentages of worker mode of transit by race/ethnicity.

prop.table(xtabs(~ racehisp + mode, data = eb_pums, 
                 subset = !is.na(mode), drop.unused.levels = TRUE), 1)

##                             mode
## racehisp                      bike/walk    driving      other    transit
##   American Indian or Alaskan 0.04927536 0.80869565 0.04637681 0.09565217
##   Asian                      0.03550485 0.81949035 0.03635782 0.10864698
##   Black                      0.02580880 0.76826609 0.03598691 0.16993820
##   Hispanic                   0.04571922 0.82208324 0.03127022 0.10092732
##   Other                      0.04536391 0.79411765 0.04985045 0.11066800
##   White                      0.03659244 0.82039769 0.05641590 0.08659397

#Or if you saved the xtabs results in our object this also works
prop.table(xtabs.TR, 1)

##                             mode
## racehisp                      bike/walk    driving      other    transit
##   American Indian or Alaskan 0.04927536 0.80869565 0.04637681 0.09565217
##   Asian                      0.03550485 0.81949035 0.03635782 0.10864698
##   Black                      0.02580880 0.76826609 0.03598691 0.16993820
##   Hispanic                   0.04571922 0.82208324 0.03127022 0.10092732
##   Other                      0.04536391 0.79411765 0.04985045 0.11066800
##   White                      0.03659244 0.82039769 0.05641590 0.08659397

A crosstab with column percentages can be quite useful for our analyses. When you present this sort of table in a report be sure to include total “N” or frequency counts for each cell.

Question: Now that we have some results, which racial groups are most likely to use a car, truck, or van to get to work?

3 Run the Chi-Square

We will use the chi-square test in order to determine whether the difference between two groups is significant. We can use two approaches for getting a chi-square test for two groups using either the summary() function or the chisq.tes() function.

#using our xtabs.TR object

summary(xtabs.TR)

## Call: xtabs(formula = ~racehisp + mode, data = eb_pums, subset = !is.na(mode), 
##     drop.unused.levels = TRUE)
## Number of cases in table: 54608 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 502.5, df = 15, p-value = 1.656e-97

chisq.test(xtabs(~racehisp + mode, data = eb_pums, subset = !is.na(mode), drop.unused.levels = TRUE))

## 
##  Pearson's Chi-squared test
## 
## data:  xtabs(~racehisp + mode, data = eb_pums, subset = !is.na(mode),     drop.unused.levels = TRUE)
## X-squared = 502.52, df = 15, p-value < 2.2e-16

chisq.test(xtabs.TR)

## 
##  Pearson's Chi-squared test
## 
## data:  xtabs.TR
## X-squared = 502.52, df = 15, p-value < 2.2e-16

# OR directly use the original variables
chisq.test(eb_pums$racehis, eb_pums$mode)

## 
##  Pearson's Chi-squared test
## 
## data:  eb_pums$racehis and eb_pums$mode
## X-squared = 502.52, df = 15, p-value < 2.2e-16

Are the differences in modes to work between racial and ethnic groups statistically significant?

USP 634 Lab 7: Working with Categorical Data, Crosstabulation and Chi-Square Test

Jamaal Green and Liming Wang

1 Recoding Variables (Review)

2 Run Cross-tablulations

3 Run the Chi-Square