1 Learning Objectives
2 Open and prepare data in R
3 Review of Basic Techniques for Analyzing Variables by Type
4 Measures of Association
5 Regression
- 5.1 Beautify model estimation output
- 5.2 Beta Weights

1 Learning Objectives

Review basics for analyzing data with R
Review basics for determining which analytical technique to use with which variable types
Learn commands for lambda, Gamma, and Pearson’s r measures of association
Learn to do regression analysis in R
Interpret regression outputs
Standardized regression coefficients (beta weights)
Download (to your RStudio project directory) and source the recode_vars.R script to recode your eb_pums dataset

2 Open and prepare data in R

Load the eb_pums data and recode as we’ve done before in prior labs (if feeling adventurous try to use the source() function to run an independent recode script).

#remember to set your working director
eb_pums <- readRDS("eb_pums_2000.rds")

source("recode_vars.R")

3 Review of Basic Techniques for Analyzing Variables by Type

Some basic procedures for analyzing variables by type

Find basic info about dataset
- use the summary() function in order to get a set of basic descriptive statistics
- use the str() function to see the underlying structure of your dataset including names of variables, data type etc
Look at data case by case
- we could simply type eb_pums into our console and have R print out the entire dataset but this is ridiculous for larger datasets. So we will pick out particular cases.
Subset a data set
- We can limit our the number of cases viewed by column
  
  eb_pums %>% select(age, sex, racehisp)
- We can look at the first and last few rows of our data by using the head() and tail() functions
- We could also look at our data using conditional statements
  
  eb_pums %>% filter(sex == 2) %>% select(age, racehisp)
  
  pums.sub <- eb_pums %>% filter(racehisp %in% c("Hispanic", "Black", "Asian"))

3.1 Analyze categorical (factor) variables

Frequency distributions and bar graphs are two general categorical analysis/visualization approaches
Contingency tables: the xtabs or table functions are commands used to analyze the relationship between two categorical variables.
If we want to look at the columns percentages we would use the prop.table function
Quick question: what is the appropriate significance test when comparing categorical variables?

3.2 Analyze numeric variables

use summary to get basic descriptive stats
histograms are a major tool in looking at the distribution of numeric variables
Association (using the cov function) and scatterplots are our primary tools for examining the relationship between two or more numeric variables

3.3 Combine categorical (factor) and numeric variables

when examining categorical and continuous variables we have to group by our categorical variables. We have multiple ways to perform group operations ranging from tapply functions to the doBy package. We’ve also covered the dplyr group_by function: pums %>% group_by(racehisp) %>% summarize(avginc = mean(increc),medinc = median(increc))

3.4 Ordinal variables

Ordinal variables fall in-between categorical and numeric variables. This particularly matters for how we treat their values in terms of analysis and visualization.

For example, we have a survey response with a five point Likert scale (1-5). How should we examine this? Since the variable has only 5 values, the xtabs command may be appropriate as you can get frequency counts and percentages using the prop.table function. But we could also use the summary command if we were interested in say the mean responses.

4 Measures of Association

4.1 Racial/Ethnic identity and mode of work

First, what are the types of variables we’re discussing here? What sort of analysis would be appropriate?

Examine the data to prepare for the test
- Make a table for the race/ethnicity variable racehisp and for the mode-to-work variable mode
- Which variable should be considered the independent variable? That variable should be the row variable for our contingency table.

xtabs(~mode+racehisp, data = eb_pums, drop.unused.levels= T) + Save xtabs results to a new object

mode.race <- xtabs(~mode+racehisp, data = eb_pums, drop.unused.levels= T) + Now find the percentages using prop.table

Based on these results what is the association between race and mode-to-work?
Now test the significance and strength of the relationship between racehisp and mode-to-work. [for this use a chi-square test]

chisq.test(mode.race)

Is the relationship statisitcally significant?
How strong is the relationship? Which test statistic tells you?

chisq.test(mode.race)

## 
##  Pearson's Chi-squared test
## 
## data:  mode.race
## X-squared = 502.52, df = 15, p-value < 2.2e-16

Calculate lambda to gauge the strength of the relationship
- Install and load the Visualizing Categorical Data Package, then enter assocstats(mode.race)

install.packages("vcd", repos='http://cran.us.r-project.org' )
install.packages("vcdExtra", repos = 'http://cran.us.r-project.org' )
library(vcd)
library(vcdExtra)

assocstats(mode.race)

##                     X^2 df P(> X^2)
## Likelihood Ratio 468.48 15        0
## Pearson          502.52 15        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.095 
## Cramer's V        : 0.055

+ What do these results indicate about the strength of the relationship?

4.2 Age of Housing and Household Income of Residents

Recode household income into 3 categories:

eb_pums$hinc4 <- cut(eb_pums$increc, breaks=4, labels=c("1st Quantile",
"2nd Quantile", "3rd Quantile", "4th Quantile"))

Next, lets assess the relationship between the ago of housing (by decade) and household income of the residents (by quintiles).

What types of variables are these– age of housing (by decade) and household income of residents (by quintiles)? What sort of analysis is appropriate?
Examine/prepare data for your test. Create a contingency table for housing age and household income and calculate their percentages. Then calculate the gamma.

builtyr.hinc <- xtabs(~eb_pums$builtyr2 + eb_pums$hinc4)

GKgamma(builtyr.hinc)

## gamma        : -0.174 
## std. error   : 0.01 
## CI           : -0.193 -0.155

4.3 Household income and commuting time

Now, consider the association between household income and commuting time. What are these two variables? What is an appropriate measure of association?

Create a scatterplot for household income (increc) and commuting time (trantime)
Caluclate the association between the two variables

cor(eb_pums %>% select(increc, trantime), use="complete", method="pearson")

##              increc   trantime
## increc   1.00000000 0.08849303
## trantime 0.08849303 1.00000000

5 Regression

Research Questions: For this exercise we shall investigate the factors influencing commuting time (trantime) in the eb_pums dataset. Our dependent (response) variable is commuting time (trantime), while our independent (predictor, explanatory) variables are: increc and racehisp.
Explore and Plot the Data
1. Let’s create scatterplots for trantime v. increc

#install.packages("ggplot2", repos = "http://cran.r-project.org")
library(ggplot2)

#increc vs trantime
ggplot(eb_pums, aes(x = increc, y = trantime)) + geom_point()

2. Now, let's create boxplots by groups for trantime and racehisp

#racehisp v trantime
ggplot(eb_pums, aes(x = racehisp, y = trantime)) + geom_boxplot()

Building out our model
Start with a univariate regression between trantime and income:

#We will use the lm() function for our linear modeling

RegMod1 <- lm(trantime ~ increc, data = eb_pums)
summary(RegMod1)

## 
## Call:
## lm(formula = trantime ~ increc, data = eb_pums)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.733 -13.654 -12.418   7.235  86.905 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.209e+01  9.516e-02  126.99   <2e-16 ***
## increc      2.414e-05  8.043e-07   30.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.53 on 114174 degrees of freedom
##   (3268 observations deleted due to missingness)
## Multiple R-squared:  0.007831,   Adjusted R-squared:  0.007822 
## F-statistic: 901.2 on 1 and 114174 DF,  p-value: < 2.2e-16

What do our results tell us? Now run this again using the racehisp variable?

Now let’s try our model out using the categorical variables railtype and region

RegMod2 <- lm(trantime ~ racehisp, data = eb_pums)
summary(RegMod2)

## 
## Call:
## lm(formula = trantime ~ racehisp, data = eb_pums)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.549 -14.549 -11.701   7.386  87.299 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       14.1971     0.7724  18.381  < 2e-16 ***
## racehispAsian      0.2314     0.7868   0.294  0.76866    
## racehispBlack     -1.5832     0.7934  -1.995  0.04599 *  
## racehispHispanic  -2.4964     0.7965  -3.134  0.00172 ** 
## racehispOther     -2.0892     0.8294  -2.519  0.01177 *  
## racehispWhite      0.3518     0.7769   0.453  0.65070    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.45 on 117438 degrees of freedom
## Multiple R-squared:  0.002425,   Adjusted R-squared:  0.002382 
## F-statistic: 57.09 on 5 and 117438 DF,  p-value: < 2.2e-16

Recall that this is a so-called “fixed effects” dummy variable. How should we interpret the coefficient for our categorical variable?

You can try to estimate a parsimonious model with the highest adjusted \(R^2\) by adding independent variables to your model

finalmod <- lm(trantime ~ increc + racehisp, data = eb_pums)
summary(finalmod)

## 
## Call:
## lm(formula = trantime ~ increc + racehisp, data = eb_pums)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.063 -13.837 -11.751   7.582  88.560 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.278e+01  7.910e-01  16.152  < 2e-16 ***
## increc            2.280e-05  8.187e-07  27.844  < 2e-16 ***
## racehispAsian    -9.446e-02  8.036e-01  -0.118  0.90643    
## racehispBlack    -9.502e-01  8.107e-01  -1.172  0.24121    
## racehispHispanic -2.376e+00  8.132e-01  -2.922  0.00348 ** 
## racehispOther    -2.405e+00  8.466e-01  -2.841  0.00449 ** 
## racehispWhite    -1.692e-01  7.938e-01  -0.213  0.83125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.52 on 114169 degrees of freedom
##   (3268 observations deleted due to missingness)
## Multiple R-squared:  0.00916,    Adjusted R-squared:  0.009108 
## F-statistic: 175.9 on 6 and 114169 DF,  p-value: < 2.2e-16

See if you can get an interaction effect between a dummy variable and a continuous variable (or another dummy variable).

Model Diagnostics - use the plot() function on your model and look at the diagnostic plots. What do they tell us about our model?
Finally, conduct a “reasonableness” test against initial theories/postulates and derive the policy implications.
Estimating regressions can answer many questions for us, such as:
1. What is the direction of the relationship between our dependent variable and our independent variables.
2. Are the partial slopes statistically significant? Check the t-scores and probability for each coefficient to find out
3. You can get a confidence interval for each slope by using: confint(finalmod)
4. Finally, we can determine how well are independent variables predict the values of our dependent variables. What does our \(R^2\) tell us? Is it statistically significant? Check the F statistics and its probability

5.1 Beautify model estimation output

You can use R package such as stargazer and texreg to beautify your model estimation results

install.packages("texreg", repos = "http://cran.r-project.org")
library(texreg)
screenreg(finalmod)

=============================== Model 1
——————————- (Intercept) 12.78 (0.79)
increc 0.00 (0.00)
racehispAsian -0.09
(0.80)
racehispBlack -0.95
(0.81)
racehispHispanic -2.38 ** (0.81)
racehispOther -2.41 ** (0.85)
racehispWhite -0.17
(0.79)
——————————- R^2 0.01
Adj. R^2 0.01
Num. obs. 114176
RMSE 21.52
=============================== *** p < 0.001, ** p < 0.01, * p < 0.05

# or
htmlreg(finalmod)

Statistical models
	Model 1
(Intercept)	12.78^***
	(0.79)
increc	0.00^***
	(0.00)
racehispAsian	-0.09
	(0.80)
racehispBlack	-0.95
	(0.81)
racehispHispanic	-2.38^**
	(0.81)
racehispOther	-2.41^**
	(0.85)
racehispWhite	-0.17
	(0.79)
R²	0.01
Adj. R²	0.01
Num. obs.	114176
RMSE	21.52
p < 0.001, p < 0.01, p < 0.05

5.2 Beta Weights

You may wonder about the relative influence of each partial slope. To determine this, you need to convert the output to a standardized regression equation, with standardized partial slopes (aka beta weights). To do this we will use the lm.beta package.

install.packages("lm.beta", repos = "http://cran.r-project.org")
library(lm.beta)

summary(lmbetamod <- lm.beta(finalmod))

## 
## Call:
## lm(formula = trantime ~ increc + racehisp, data = eb_pums)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.063 -13.837 -11.751   7.582  88.560 
## 
## Coefficients:
##                    Estimate Standardized Std. Error t value Pr(>|t|)    
## (Intercept)       1.278e+01    0.000e+00  7.910e-01  16.152  < 2e-16 ***
## increc            2.280e-05    8.355e-02  8.187e-07  27.844  < 2e-16 ***
## racehispAsian    -9.446e-02   -1.657e-03  8.036e-01  -0.118  0.90643    
## racehispBlack    -9.502e-01   -1.409e-02  8.107e-01  -1.172  0.24121    
## racehispHispanic -2.376e+00   -3.358e-02  8.132e-01  -2.922  0.00348 ** 
## racehispOther    -2.405e+00   -2.258e-02  8.466e-01  -2.841  0.00449 ** 
## racehispWhite    -1.692e-01   -3.889e-03  7.938e-01  -0.213  0.83125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.52 on 114169 degrees of freedom
##   (3268 observations deleted due to missingness)
## Multiple R-squared:  0.00916,    Adjusted R-squared:  0.009108 
## F-statistic: 175.9 on 6 and 114169 DF,  p-value: < 2.2e-16

install.packages("stargazer", repos = "http://cran.r-project.org")
library(stargazer)
stargazer(finalmod, lmbetamod, column.labels=c("base", "standardized beta"), 
          coef=list(finalmod$coefficients, lmbetamod$standardized.coefficients), type="html")


	Dependent variable:

	trantime
	base	standardized beta
	(1)	(2)

increc	0.00002^***	0.084^***
	(0.00000)	(0.00000)

racehispAsian	-0.094	-0.002
	(0.804)	(0.804)

racehispBlack	-0.950	-0.014
	(0.811)	(0.811)

racehispHispanic	-2.376^***	-0.034
	(0.813)	(0.813)

racehispOther	-2.405^***	-0.023
	(0.847)	(0.847)

racehispWhite	-0.169	-0.004
	(0.794)	(0.794)

Constant	12.776^***	0.000
	(0.791)	(0.791)


Observations	114,176	114,176
R²	0.009	0.009
Adjusted R²	0.009	0.009
Residual Std. Error (df = 114169)	21.518	21.518
F Statistic (df = 6; 114169)	175.908^***	175.908^***

Note:	p<0.1; p<0.05; p<0.01

Alternatively, if all your variables are numeric, you can first scale the variables and the re-run the regression.

Assuming a causal relationship, which independent variable exerts the most influence on trantime?

USP 634 Lab 9 - Association and Linear Regression

Jamaal Green and Liming Wang