Data File - californiatod.csv
#install.packages(readr)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
californiatod <- read_csv("californiatod.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## region = col_character(),
## transit = col_double(),
## density = col_double(),
## houseval = col_double(),
## railtype = col_character()
## )
We will calculate z-scores using the formula. In order to do so we will use R just as we would use a spreadsheet, creating new variables as we go, and then applying the formula using the new variables we’ve created along the way.
The z-score for each individual observation of a variable is computed in a similar fashion: the value for that observation minus a constant (the variable’s mean), and then divided by another constant (the variable’s standard deviation).
This is the z-score formula we just described above:
\[z = \frac{x - \bar{x}}{s}\]
For our variable “density”, which lists the density for each TOD site, you could create a new variable in R called “z_density” which would contain the z-score for the density variable. We would use this formula:
californiatod$z_density <- (californiatod$density - mean(californiatod$density))/sd(californiatod$density)
californiatod %>% print()
## # A tibble: 26 x 7
## name region transit density houseval railtype z_density
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Archstone Barr… Bay Ar… 0.400 3.36 192593. Heavy or co… -0.566
## 2 Archstone Miss… SD 0.108 2.18 203055. Light rail -0.901
## 3 Atherton Place Bay Ar… 0.500 5.21 217672. Heavy or co… -0.0425
## 4 Bellamar Apart… LA 0.318 10.1 117256. Light rail 1.33
## 5 Coggins Square Bay Ar… 0.528 4.68 270555. Heavy or co… -0.193
## 6 Del Norte Place Bay Ar… 0.643 4.91 229336. Heavy or co… -0.127
## 7 Iron Horse Lof… Bay Ar… 0.606 4.68 270207. Heavy or co… -0.193
## 8 Mercado Apartm… SD 0. 3.21 121786. Light rail -0.609
## 9 Mission Wells Bay Ar… 0.147 3.96 323497. Heavy or co… -0.396
## 10 Northpark Apar… Bay Ar… 0.174 3.27 667661. Heavy or co… -0.592
## # ... with 16 more rows
Remember that mean(Dataset$varname) and sd(Dataset$varname) give you the mean and standard deviation of a variable (varname).
To get the probability of following a certain (e.g. normal) distribution, we use the distribution function with the prefix p. For example, for a normal distribution, we use pnorm(x, mean, sd)
, which gives us the probability up to the value of x.
# shaded area on the left
pnorm(-2)
## [1] 0.02275013
# shaded area on the right
(1 - pnorm(-1))
## [1] 0.8413447
# put together
pnorm(-2) + (1 - pnorm(-1))
## [1] 0.8640949
## equivalent to
pnorm(50, mean=100, sd=25) + (1 - pnorm(75, mean=100, sd=25))
## [1] 0.8640949
qnorm
functionTo get the cutoff corresponding to a probability following a certain (e.g. normal) distribution, we use the distribution function with the prefix q (for quantile function). For example, for a normal distribution, we use qnorm(p, mean, sd)
, which gives us the cutoff for probability p.
## what is the cutoff for p=0.03 in a standard normal distribution (mean=0, sd=1)
qnorm(0.03)
## [1] -1.880794
## what is the cutoff for p=0.03 in a normal distribution with mean=100, sd=25
qnorm(0.03, mean=100, sd=25)
## [1] 52.98016
## what is the cutoff to the right for p=0.97 in a standard normal distribution
1 - qnorm(0.97)
## [1] -0.8807936
# OR
qnorm(0.03, lower.tail=FALSE)
## [1] 1.880794
In this lab we will be introduced to the beginning of our R workflow. This will include covering the basics of importing data, variable manipulation, and saving our results in various formats for later use.
Data File
Packages
Load .RDs file .RDS files are files that save individual objectsin one’s global environment at the time of saving. We do not need to cover all of the intricacies of differing environments (a treatment of the advantages of RDS files versus slightly more common RDA files can be found here), but do know that .RDS files are a data type built for R and you do not need to have any additional packages to load them and the syntax is straightforward. Because they are built for R they also load a lot faster than trying to import some other form of tabular data because R does not have to work out differing column types or other related issues.
Note: I also loaded the pander package. This is for display purposes for the lab
#remember to begin by opening your RStudio project and save the data file to your project directory
if(!require(pacman)){install.packages("pacman"); library(pacman)}
p_load(dplyr, pander)
eb_pums <- readRDS("eb_pums_2000.rds")
#Note: you do not need to have the full file path if you have properly set up your RStudio project, you just need the filename surrounded by quotation marks
Now that we have our data loaded up, let’s move onto the basics of creating and deleting data.
R supports vectors and matrices. You can create a scalar, vector, and matrix with the following command:
s <- 5 + 1
s
## [1] 6
This creates a vector form the result 5+1. Now type “s” and hit return in order to get the answer.
v1 <- c(2,3,4)
v1
## [1] 2 3 4
This creates a vector of 2,3,4. Now type “v1” and hit return in order to see your vector.
v2 <- 1:10
v2
## [1] 1 2 3 4 5 6 7 8 9 10
This creates a vector of all integers between 1 and 10. Now type “v2” in order to view your vector.
m <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol = 3, byrow = T)
m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 11 12 13
This creates a matrix of the above values. Now type “m” in order to see it. If you wish to learn more about the differences between matrices and data frames please see here.
Constructing Data Frames Say you have collected some data with a few variables and you want to input it in R. We can use the data.frame() function
df <- data.frame(x = c(1,2,3,4,5),
y = c(5,7,4,8,0),
sex = c(1,2,2,1,2))
df
## x y sex
## 1 1 5 1
## 2 2 7 2
## 3 3 4 2
## 4 4 8 1
## 5 5 0 2
Type “df” in order to print out your data frame to your console. *Note: Data frames are the more “traditional rectangular data sets we are used to seeing and are the bread and butter data objects we will use in this class.*
View Column or Row of Data Frame View a variable column in a data frame with the “$” operator e.g.
df$x
## [1] 1 2 3 4 5
You can access rows of a data frame with this syntax: df[1:2,]
(this will return the first two rows of your data frame not including your headers)
Using Operators In R you can use all the common mathematical and logical operators to create new variables.
The most common operators are <-, +, -,*,/,^,>,<.>=,<=
Creating Variables
We have covered creating new variables using the basic [name of variable] <-. But how do we make a new variable for a data frame?
It is a straightforward process where we use the “$” operator and <- together, like so:
df$new_variable <- expression
Here “df” is our dataframe and “new_variable” is the variable we want to define. However, with the new dplyr
package, my personal favoriate is to use the dplyr syntax for data frame manimpulation:
df <- df %>% mutate(new_variable=experession)
˝
We will be using dplyr
for today’s lab. All of the process has base R equivalent without the need of dplyr
.
A quick look of what is in eb_pums before we go ahead:
eb_pums %>% head
## year hhwt hhtype statefip metarea metaread puma pumaty00 pumaland
## 1 2000 19 <NA> 6 736 7361 2101 32 66840144
## 2 2000 18 <NA> 6 736 7361 2101 32 66840144
## 3 2000 18 <NA> 6 736 7361 2101 32 66840144
## 4 2000 18 <NA> 6 736 7361 2101 32 66840144
## 5 2000 18 <NA> 6 736 7361 2101 32 66840144
## 6 2000 18 <NA> 6 736 7361 2101 32 66840144
## pumaarea gq ownershp mortgage commuse acreprop acrehous mortamt1
## 1 127634167 1 2 0 0 0 0 0
## 2 127634167 1 1 3 0 0 0 1200
## 3 127634167 1 1 3 0 0 0 1200
## 4 127634167 1 1 1 1 2 1 0
## 5 127634167 1 1 1 1 2 1 0
## 6 127634167 1 1 1 1 2 1 0
## proptx99 rent rentgrs condo hhincome valueh lingisol vacancy kitchen
## 1 0 650 708 0 65000 999999 1 0 4
## 2 41 0 0 1 94000 225000 1 0 4
## 3 41 0 0 1 94000 225000 1 0 4
## 4 9 0 0 1 46500 27500 1 0 4
## 5 9 0 0 1 46500 27500 1 0 4
## 6 9 0 0 1 46500 27500 1 0 4
## rooms builtyr unitsstr bedrooms vehicles nfams perwt relate age sex
## 1 3 8 6 2 1 1 21 1 29 2
## 2 5 9 5 3 3 2 18 1 56 2
## 3 5 9 5 3 3 2 14 11 57 1
## 4 3 4 1 3 3 1 24 1 41 2
## 5 3 4 1 3 3 1 23 2 42 1
## 6 3 4 1 3 3 1 37 3 19 1
## marst race bpl ancestr1 citizen yrimmig yrsusa1
## 1 6 White 6 416 0 0 0
## 2 4 White 49 82 0 0 0
## 3 4 Three or more major race 22 900 0 0 0
## 4 1 Other race 210 226 3 974 26
## 5 1 Other race 200 210 3 974 26
## 6 6 White 6 210 0 0 0
## language speakeng hispan school educ00 gradeatt schltype empstat
## 1 29 4 Not Hispanic 1 12 0 1 1
## 2 1 3 Not Hispanic 1 16 0 1 1
## 3 1 3 Not Hispanic 1 12 0 1 1
## 4 12 5 Other 1 8 0 1 3
## 5 12 6 Mexico 1 2 0 1 3
## 6 12 4 Mexico 2 8 5 2 3
## labforce occ classwkr indnaics wkswork1 uhrswork workedyr inctot
## 1 2 263 Self-employed 5414 52 40 2 65000
## 2 2 176 Works for wages 3254 52 40 2 62000
## 3 2 605 Works for wages 712 52 40 2 32000
## 4 1 364 Works for wages 6212 48 30 2 12900
## 5 1 896 Works for wages 531 48 40 2 30000
## 6 1 402 Works for wages 722Z 20 25 2 3600
## incwage incearn poverty migrate5 migplac5 movedin pwtype pwtype00
## 1 43000 65000 501 2 6 2 4 32
## 2 55000 55000 501 2 6 2 5 33
## 3 32000 32000 369 2 6 0 5 34
## 4 12900 12900 266 1 0 5 0 0
## 5 30000 30000 266 1 0 0 0 0
## 6 3600 3600 266 1 0 0 0 0
## pwpuma00 tranwork carpool riders trantime departs
## 1 2100 70 0 0 0 0
## 2 2400 10 1 1 20 802
## 3 2700 10 1 1 45 532
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
# head shows the first 6 rows of a data frame (or other R object)
eb_pums <- eb_pums %>% mutate(inctot1k=inctot/1000)
# which is equivalent to base R command
eb_pums$inctot1k <- eb_pums$inctot/1000
eb_pums <- eb_pums %>% mutate(age65plus=age>65)
# which is equivalent to base R command
eb_pums$age65 <- eb_pums$age > 65
eb_pums <- eb_pums %>% mutate(povstat=ifelse(inctot < 12000, 1,0))
# which is equivalent to base R command
eb_pums$povstat <- ifelse(eb_pums$inctot < 12000, 1,0)
eb_pums <- eb_pums %>% mutate(povfactor=factor(povstat,
levels = c(1,0),
labels =c("below poverty line", "above poverty line")))
# which is equivalent to base R command
eb_pums$povfactor <- factor(eb_pums$povstat,
levels = c(1,0),
labels =c("below poverty line", "above poverty line"))
eb_pums <- eb_pums %>% select(-variable_to_delete)
which is equivalent to base R command > eb_pums$variable_to_delete <- NULL
You can also use the df <- NULL
command to delete a dataframe, though the name will still be present in your environment.
First, we will create a new race/ethnicity variable. Remember that Census data include two race/ethnicity variables: race and Hispanic origin. We will use these categories to create the follow categories for our population: White non-Hispanic, Black non-Hispanic, Asian non-Hispanic, Hispanic, and Other non-Hispanic.
First we will make a quick two-way contingency table to take a look at our two variables.
xtabs(~race + hispan, data = eb_pums)
## hispan
## race Not Hispanic Mexico Puerto Rico Cuban Other
## White 56451 6032 261 95 2270
## Black 13632 192 52 15 87
## American Indian or Alaskan 496 216 1 0 58
## Chinese 6962 5 1 0 15
## Japanese 1046 9 2 0 7
## Other Asian or Pacific 12176 75 11 1 81
## Other race 349 7354 282 18 2430
## Two major races 4313 1157 118 16 635
## Three or more major race 370 84 9 4 56
To create the new variable, we must first understand the way the variables for race and ethnicity are organized and coded.
#one-way table frequency tabulation
table(eb_pums$race)
##
## White Black
## 65109 13978
## American Indian or Alaskan Chinese
## 771 6983
## Japanese Other Asian or Pacific
## 1064 12344
## Other race Two major races
## 10433 6239
## Three or more major race
## 523
table(eb_pums$hispan)
##
## Not Hispanic Mexico Puerto Rico Cuban Other
## 95795 15124 737 149 5639
#to view the underlying codes
table(as.integer(eb_pums$race))
##
## 1 2 3 4 5 6 7 8 9
## 65109 13978 771 6983 1064 12344 10433 6239 523
table(as.integer(eb_pums$hispan))
##
## 1 2 3 4 5
## 95795 15124 737 149 5639
Remember that factors (the name for categorical variables in R) can be ordered or unordered. As such, they have numeric values attached to them that R uses an identifier, so “White non-Hispanic” is coded as “1” and so on.
We will make what is called a nested conditional statement. This is a long statement for saying we are going to make a series of ifelse() statements to recode our variable.
Note: You can use the label, such as “White” or “Black” or the underlying integer value of the factor such as 1 for “White” or 2 for “Black”
#we will create a new variable "racehisp" using #dplyr's mutate function
eb_pums <- eb_pums %>% mutate(
race=as.character(race),
hispan=as.character(hispan),
racehisp=case_when(
race=="Chinese" ~ "Asian",
race=="Japanese" ~ "Asian",
race=="Other Asian or Pacific" ~ "Asian",
race=="White" ~ "White",
race=="Black" ~ "Black",
race=="American Indian or Alaskan" ~ "American Indian or Alaskan",
hispan != "Not Hispanic" ~ "Hispanic",
TRUE~"Other"))
#check your work
pander(table(eb_pums$racehisp, eb_pums$race))
American Indian or Alaskan | Black | Chinese | |
---|---|---|---|
American Indian or Alaskan | 771 | 0 | 0 |
Asian | 0 | 0 | 6983 |
Black | 0 | 13978 | 0 |
Hispanic | 0 | 0 | 0 |
Other | 0 | 0 | 0 |
White | 0 | 0 | 0 |
Japanese | Other Asian or Pacific | |
---|---|---|
American Indian or Alaskan | 0 | 0 |
Asian | 1064 | 12344 |
Black | 0 | 0 |
Hispanic | 0 | 0 |
Other | 0 | 0 |
White | 0 | 0 |
Other race | Three or more major race | |
---|---|---|
American Indian or Alaskan | 0 | 0 |
Asian | 0 | 0 |
Black | 0 | 0 |
Hispanic | 10084 | 153 |
Other | 349 | 370 |
White | 0 | 0 |
Two major races | White | |
---|---|---|
American Indian or Alaskan | 0 | 0 |
Asian | 0 | 0 |
Black | 0 | 0 |
Hispanic | 1926 | 0 |
Other | 4313 | 0 |
White | 0 | 65109 |
pander(xtabs(~race + hispan, data = eb_pums))
Cuban | Mexico | Not Hispanic | Other | |
---|---|---|---|---|
American Indian or Alaskan | 0 | 216 | 496 | 58 |
Black | 15 | 192 | 13632 | 87 |
Chinese | 0 | 5 | 6962 | 15 |
Japanese | 0 | 9 | 1046 | 7 |
Other Asian or Pacific | 1 | 75 | 12176 | 81 |
Other race | 18 | 7354 | 349 | 2430 |
Three or more major race | 4 | 84 | 370 | 56 |
Two major races | 16 | 1157 | 4313 | 635 |
White | 95 | 6032 | 56451 | 2270 |
Puerto Rico | |
---|---|
American Indian or Alaskan | 1 |
Black | 52 |
Chinese | 1 |
Japanese | 2 |
Other Asian or Pacific | 11 |
Other race | 282 |
Three or more major race | 9 |
Two major races | 118 |
White | 261 |
tab_racehisp <- table(eb_pums$racehisp)
#check work again
pander(tab_racehisp)
American Indian or Alaskan | Asian | Black | Hispanic | Other | White |
---|---|---|---|---|---|
771 | 20391 | 13978 | 12163 | 5032 | 65109 |
Recoding a continuous variable
Now we will recode a continuous variable–personal income (“inctot”). We will recode “inctot” to an ordinal categorical variable (or a leveled factor variable in R parlance) by grouping categories of income.
Look at the Data
Let’s start by looking more closely at the orginal variable’s coding and distribution by using the summary() and hist() functions.
summary(eb_pums$inctot)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -12000 10000 35000 243120 120100 999999
hist(eb_pums$inctot)
Notice that the top values are repeated (999999). That code indicates “N/A” (see technical documentation at [www.ipums.org]http://www.ipums.org). There are also negative values indicating income losses.
Clean the Data
We will clean this data by eliminating outliers and problematic cases. We will do this by creating a recode of the original vairable (Note: NEVER write over your original data).
#first we'll copy over our original variable to a
#new column
eb_pums$increc <- eb_pums$inctot
#Set Outliers to "NA". Values less than zero or
#999999 to NA
eb_pums <- eb_pums %>% mutate(increc=ifelse(inctot < 0 | inctot >= 999999, NA, inctot))
#OR base R equivalent
eb_pums$increc[eb_pums$inctot < 0 | eb_pums$inctot >= 999999] <- NA
We will now recode the data into our new categories, in this case, quantiles. To do so, we shall use the Hmisc pacakge and its cuts() function to divide income into quintiles.
#We will create a new variable called "inc5" made up #of 5 quantiles of our #cleaned up household income variable
quantile_brackets <- quantile(eb_pums$increc, probs = seq(0, 1, 0.2), na.rm=T)
eb_pums <- eb_pums %>%
mutate(inc5=cut(increc, breaks=quantile_brackets))
#check that it worked
summary(eb_pums$inctot)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -12000 10000 35000 243120 120100 999999
summary(eb_pums$inc5)
## (0,3.6e+03] (3.6e+03,1.51e+04] (1.51e+04,3.12e+04]
## 5909 18388 18421
## (3.12e+04,5.5e+04] (5.5e+04,7.22e+05] NA's
## 18877 17974 37875
#We will run a new two way table with our two new #variables "racehisp" and "inc5"
pander(xtabs(~inc5 + racehisp, data = eb_pums))
American Indian or Alaskan | Asian | Black | |
---|---|---|---|
(0,3.6e+03] | 46 | 1180 | 840 |
(3.6e+03,1.51e+04] | 145 | 3249 | 2667 |
(1.51e+04,3.12e+04] | 126 | 3043 | 2323 |
(3.12e+04,5.5e+04] | 119 | 2993 | 2112 |
(5.5e+04,7.22e+05] | 73 | 2732 | 1155 |
Hispanic | Other | White | |
---|---|---|---|
(0,3.6e+03] | 537 | 256 | 3050 |
(3.6e+03,1.51e+04] | 2041 | 718 | 9568 |
(1.51e+04,3.12e+04] | 2050 | 700 | 10179 |
(3.12e+04,5.5e+04] | 1349 | 597 | 11707 |
(5.5e+04,7.22e+05] | 501 | 497 | 13016 |
Try comparing the results using these two methods (one using the categorical variable racehisp with an ordinal variable – our quintile recode of income – and the other using racehisp with the original interval-ratio version of income – increc). What do they indicate?
Save your data frame
saveRDS(eb_pums, file="eb_pums_v2.rds")
Save R Output/Scripts
In RStudio click File –> Save in order to save your scripts so you can reference them later on.
Save Graphs
Navigate to your “Plots” pane in RStudio and you can save individual graphs in varied formats such as a png or pdf file and modify its size.