1 Working with Normal distribution

Data File - californiatod.csv

#install.packages(readr)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
californiatod <- read_csv("californiatod.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   region = col_character(),
##   transit = col_double(),
##   density = col_double(),
##   houseval = col_double(),
##   railtype = col_character()
## )

1.1 Calculate z-scores

We will calculate z-scores using the formula. In order to do so we will use R just as we would use a spreadsheet, creating new variables as we go, and then applying the formula using the new variables we’ve created along the way.

The z-score for each individual observation of a variable is computed in a similar fashion: the value for that observation minus a constant (the variable’s mean), and then divided by another constant (the variable’s standard deviation).

This is the z-score formula we just described above:

\[z = \frac{x - \bar{x}}{s}\]

For our variable “density”, which lists the density for each TOD site, you could create a new variable in R called “z_density” which would contain the z-score for the density variable. We would use this formula:

californiatod$z_density <- (californiatod$density - mean(californiatod$density))/sd(californiatod$density)

californiatod %>% print()
## # A tibble: 26 x 7
##    name            region  transit density houseval railtype     z_density
##    <chr>           <chr>     <dbl>   <dbl>    <dbl> <chr>            <dbl>
##  1 Archstone Barr… Bay Ar…   0.400    3.36  192593. Heavy or co…   -0.566 
##  2 Archstone Miss… SD        0.108    2.18  203055. Light rail     -0.901 
##  3 Atherton Place  Bay Ar…   0.500    5.21  217672. Heavy or co…   -0.0425
##  4 Bellamar Apart… LA        0.318   10.1   117256. Light rail      1.33  
##  5 Coggins Square  Bay Ar…   0.528    4.68  270555. Heavy or co…   -0.193 
##  6 Del Norte Place Bay Ar…   0.643    4.91  229336. Heavy or co…   -0.127 
##  7 Iron Horse Lof… Bay Ar…   0.606    4.68  270207. Heavy or co…   -0.193 
##  8 Mercado Apartm… SD        0.       3.21  121786. Light rail     -0.609 
##  9 Mission Wells   Bay Ar…   0.147    3.96  323497. Heavy or co…   -0.396 
## 10 Northpark Apar… Bay Ar…   0.174    3.27  667661. Heavy or co…   -0.592 
## # ... with 16 more rows

Remember that mean(Dataset$varname) and sd(Dataset$varname) give you the mean and standard deviation of a variable (varname).

1.2 Calculate the Shaded Area with the pnorm function

To get the probability of following a certain (e.g. normal) distribution, we use the distribution function with the prefix p. For example, for a normal distribution, we use pnorm(x, mean, sd), which gives us the probability up to the value of x.

The Normal Distribution

The Normal Distribution

# shaded area on the left
pnorm(-2)
## [1] 0.02275013
# shaded area on the right
(1 - pnorm(-1))
## [1] 0.8413447
# put together
pnorm(-2) + (1 - pnorm(-1))
## [1] 0.8640949
## equivalent to

pnorm(50, mean=100, sd=25) + (1 - pnorm(75, mean=100, sd=25))
## [1] 0.8640949

1.3 Find cutoff given probability (shaded area) with the qnorm function

To get the cutoff corresponding to a probability following a certain (e.g. normal) distribution, we use the distribution function with the prefix q (for quantile function). For example, for a normal distribution, we use qnorm(p, mean, sd), which gives us the cutoff for probability p.

## what is the cutoff for p=0.03 in a standard normal distribution (mean=0, sd=1)
qnorm(0.03)
## [1] -1.880794
## what is the cutoff for p=0.03 in a normal distribution with mean=100, sd=25
qnorm(0.03, mean=100, sd=25)
## [1] 52.98016
## what is the cutoff to the right for p=0.97 in a standard normal distribution
1 - qnorm(0.97)
## [1] -0.8807936
# OR
qnorm(0.03, lower.tail=FALSE)
## [1] 1.880794

2 Data Wrangling with R

In this lab we will be introduced to the beginning of our R workflow. This will include covering the basics of importing data, variable manipulation, and saving our results in various formats for later use.

Data File

Packages

Load .RDs file .RDS files are files that save individual objectsin one’s global environment at the time of saving. We do not need to cover all of the intricacies of differing environments (a treatment of the advantages of RDS files versus slightly more common RDA files can be found here), but do know that .RDS files are a data type built for R and you do not need to have any additional packages to load them and the syntax is straightforward. Because they are built for R they also load a lot faster than trying to import some other form of tabular data because R does not have to work out differing column types or other related issues.

Note: I also loaded the pander package. This is for display purposes for the lab

#remember to begin by opening your RStudio project and save the data file to your project directory

if(!require(pacman)){install.packages("pacman"); library(pacman)}
p_load(dplyr, pander)

eb_pums <- readRDS("eb_pums_2000.rds")

#Note: you do not need to have the full file path if you have properly set up your RStudio project, you just need the filename surrounded by quotation marks

Now that we have our data loaded up, let’s move onto the basics of creating and deleting data.

2.1 Creating and Deleting Variables in R

R supports vectors and matrices. You can create a scalar, vector, and matrix with the following command:

s <- 5 + 1
s
## [1] 6

This creates a vector form the result 5+1. Now type “s” and hit return in order to get the answer.

v1 <- c(2,3,4)
v1
## [1] 2 3 4

This creates a vector of 2,3,4. Now type “v1” and hit return in order to see your vector.

v2 <- 1:10
v2
##  [1]  1  2  3  4  5  6  7  8  9 10

This creates a vector of all integers between 1 and 10. Now type “v2” in order to view your vector.

m <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol = 3, byrow = T)
m
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]   11   12   13

This creates a matrix of the above values. Now type “m” in order to see it. If you wish to learn more about the differences between matrices and data frames please see here.

Constructing Data Frames Say you have collected some data with a few variables and you want to input it in R. We can use the data.frame() function

df <- data.frame(x = c(1,2,3,4,5), 
                 y = c(5,7,4,8,0), 
                 sex = c(1,2,2,1,2))
df
##   x y sex
## 1 1 5   1
## 2 2 7   2
## 3 3 4   2
## 4 4 8   1
## 5 5 0   2

Type “df” in order to print out your data frame to your console. *Note: Data frames are the more “traditional rectangular data sets we are used to seeing and are the bread and butter data objects we will use in this class.*

2.2 Manipulate variables and values of a data frame

View Column or Row of Data Frame View a variable column in a data frame with the “$” operator e.g.

df$x
## [1] 1 2 3 4 5

You can access rows of a data frame with this syntax: df[1:2,] (this will return the first two rows of your data frame not including your headers)

Using Operators In R you can use all the common mathematical and logical operators to create new variables.

The most common operators are <-, +, -,*,/,^,>,<.>=,<=

  • <- is the operator for assigning a value to the variable a the left
  • == means “equals” for conditional statements (conditional statements use an if expression such as “is such and such a condition is true”)
  • != means “does not equal”
  • & means “and”
  • means “or”

Creating Variables

We have covered creating new variables using the basic [name of variable] <-. But how do we make a new variable for a data frame?

It is a straightforward process where we use the “$” operator and <- together, like so:

df$new_variable <- expression

Here “df” is our dataframe and “new_variable” is the variable we want to define. However, with the new dplyr package, my personal favoriate is to use the dplyr syntax for data frame manimpulation:

df <- df %>% mutate(new_variable=experession)˝

We will be using dplyr for today’s lab. All of the process has base R equivalent without the need of dplyr.

A quick look of what is in eb_pums before we go ahead:

eb_pums %>% head
##   year hhwt hhtype statefip metarea metaread puma pumaty00 pumaland
## 1 2000   19   <NA>        6     736     7361 2101       32 66840144
## 2 2000   18   <NA>        6     736     7361 2101       32 66840144
## 3 2000   18   <NA>        6     736     7361 2101       32 66840144
## 4 2000   18   <NA>        6     736     7361 2101       32 66840144
## 5 2000   18   <NA>        6     736     7361 2101       32 66840144
## 6 2000   18   <NA>        6     736     7361 2101       32 66840144
##    pumaarea gq ownershp mortgage commuse acreprop acrehous mortamt1
## 1 127634167  1        2        0       0        0        0        0
## 2 127634167  1        1        3       0        0        0     1200
## 3 127634167  1        1        3       0        0        0     1200
## 4 127634167  1        1        1       1        2        1        0
## 5 127634167  1        1        1       1        2        1        0
## 6 127634167  1        1        1       1        2        1        0
##   proptx99 rent rentgrs condo hhincome valueh lingisol vacancy kitchen
## 1        0  650     708     0    65000 999999        1       0       4
## 2       41    0       0     1    94000 225000        1       0       4
## 3       41    0       0     1    94000 225000        1       0       4
## 4        9    0       0     1    46500  27500        1       0       4
## 5        9    0       0     1    46500  27500        1       0       4
## 6        9    0       0     1    46500  27500        1       0       4
##   rooms builtyr unitsstr bedrooms vehicles nfams perwt relate age sex
## 1     3       8        6        2        1     1    21      1  29   2
## 2     5       9        5        3        3     2    18      1  56   2
## 3     5       9        5        3        3     2    14     11  57   1
## 4     3       4        1        3        3     1    24      1  41   2
## 5     3       4        1        3        3     1    23      2  42   1
## 6     3       4        1        3        3     1    37      3  19   1
##   marst                     race bpl ancestr1 citizen yrimmig yrsusa1
## 1     6                    White   6      416       0       0       0
## 2     4                    White  49       82       0       0       0
## 3     4 Three or more major race  22      900       0       0       0
## 4     1               Other race 210      226       3     974      26
## 5     1               Other race 200      210       3     974      26
## 6     6                    White   6      210       0       0       0
##   language speakeng       hispan school educ00 gradeatt schltype empstat
## 1       29        4 Not Hispanic      1     12        0        1       1
## 2        1        3 Not Hispanic      1     16        0        1       1
## 3        1        3 Not Hispanic      1     12        0        1       1
## 4       12        5        Other      1      8        0        1       3
## 5       12        6       Mexico      1      2        0        1       3
## 6       12        4       Mexico      2      8        5        2       3
##   labforce occ        classwkr indnaics wkswork1 uhrswork workedyr inctot
## 1        2 263   Self-employed 5414           52       40        2  65000
## 2        2 176 Works for wages 3254           52       40        2  62000
## 3        2 605 Works for wages 712            52       40        2  32000
## 4        1 364 Works for wages 6212           48       30        2  12900
## 5        1 896 Works for wages 531            48       40        2  30000
## 6        1 402 Works for wages 722Z           20       25        2   3600
##   incwage incearn poverty migrate5 migplac5 movedin pwtype pwtype00
## 1   43000   65000     501        2        6       2      4       32
## 2   55000   55000     501        2        6       2      5       33
## 3   32000   32000     369        2        6       0      5       34
## 4   12900   12900     266        1        0       5      0        0
## 5   30000   30000     266        1        0       0      0        0
## 6    3600    3600     266        1        0       0      0        0
##   pwpuma00 tranwork carpool riders trantime departs
## 1     2100       70       0      0        0       0
## 2     2400       10       1      1       20     802
## 3     2700       10       1      1       45     532
## 4        0        0       0      0        0       0
## 5        0        0       0      0        0       0
## 6        0        0       0      0        0       0
# head shows the first 6 rows of a data frame (or other R object)
  • Mathematical operations on variables: We can create varibales basedon values of existing variables. For example, if you have a variable for total household income “inctot”, create a new variable with total household income in thousands.
eb_pums <- eb_pums %>% mutate(inctot1k=inctot/1000)

# which is equivalent to base R command
eb_pums$inctot1k <- eb_pums$inctot/1000
  • True/False from Existing Variables: Create a new variable “age65” in a new column at the end of the table, indiciating whether a person is older than 65 (using True/False)
eb_pums <- eb_pums %>% mutate(age65plus=age>65)

# which is equivalent to base R command
eb_pums$age65 <- eb_pums$age > 65
  • Create new variables by recoding existing ones: We often need to recode existing variables into different ones for ease of analysis, visualization, or reporting. R has multiple ways and packages that can assist, but we will first cover the most basic approach to recoding. Assume we want to create a poverty status variable called “povstat” which takes the “1” when household income (hhinc) is less than the hypothetical poverty line of $12,000 or “0” if the household income is above the poverty line
eb_pums <- eb_pums %>% mutate(povstat=ifelse(inctot < 12000, 1,0))

# which is equivalent to base R command
eb_pums$povstat <- ifelse(eb_pums$inctot < 12000, 1,0)
  • Factors and levels: Remember that R calls categorical variables “factors” and that factors can be nominal or ordinal (meaning they have an order). Say we want to create a factor variable from our new “povstat” variable where 1 = “below the poverty line” and 0 = “above poverty line”
eb_pums <- eb_pums %>% mutate(povfactor=factor(povstat, 
                                               levels = c(1,0), 
                                               labels =c("below poverty line", "above poverty line")))

# which is equivalent to base R command
eb_pums$povfactor <- factor(eb_pums$povstat, 
                            levels = c(1,0), 
                            labels =c("below poverty line", "above poverty line"))
  • Dropping columns: In order to remove a variable you can just type

eb_pums <- eb_pums %>% select(-variable_to_delete)

which is equivalent to base R command > eb_pums$variable_to_delete <- NULL

You can also use the df <- NULL command to delete a dataframe, though the name will still be present in your environment.

2.3 Cleaning and Creating New Variables

First, we will create a new race/ethnicity variable. Remember that Census data include two race/ethnicity variables: race and Hispanic origin. We will use these categories to create the follow categories for our population: White non-Hispanic, Black non-Hispanic, Asian non-Hispanic, Hispanic, and Other non-Hispanic.

First we will make a quick two-way contingency table to take a look at our two variables.

xtabs(~race + hispan, data = eb_pums)
##                             hispan
## race                         Not Hispanic Mexico Puerto Rico Cuban Other
##   White                             56451   6032         261    95  2270
##   Black                             13632    192          52    15    87
##   American Indian or Alaskan          496    216           1     0    58
##   Chinese                            6962      5           1     0    15
##   Japanese                           1046      9           2     0     7
##   Other Asian or Pacific            12176     75          11     1    81
##   Other race                          349   7354         282    18  2430
##   Two major races                    4313   1157         118    16   635
##   Three or more major race            370     84           9     4    56

To create the new variable, we must first understand the way the variables for race and ethnicity are organized and coded.

#one-way table frequency tabulation
table(eb_pums$race)
## 
##                      White                      Black 
##                      65109                      13978 
## American Indian or Alaskan                    Chinese 
##                        771                       6983 
##                   Japanese     Other Asian or Pacific 
##                       1064                      12344 
##                 Other race            Two major races 
##                      10433                       6239 
##   Three or more major race 
##                        523
table(eb_pums$hispan)
## 
## Not Hispanic       Mexico  Puerto Rico        Cuban        Other 
##        95795        15124          737          149         5639
#to view the underlying codes 
table(as.integer(eb_pums$race))
## 
##     1     2     3     4     5     6     7     8     9 
## 65109 13978   771  6983  1064 12344 10433  6239   523
table(as.integer(eb_pums$hispan))
## 
##     1     2     3     4     5 
## 95795 15124   737   149  5639

Remember that factors (the name for categorical variables in R) can be ordered or unordered. As such, they have numeric values attached to them that R uses an identifier, so “White non-Hispanic” is coded as “1” and so on.

We will make what is called a nested conditional statement. This is a long statement for saying we are going to make a series of ifelse() statements to recode our variable.

Note: You can use the label, such as “White” or “Black” or the underlying integer value of the factor such as 1 for “White” or 2 for “Black”

#we will create a new variable "racehisp" using #dplyr's mutate function
eb_pums <- eb_pums %>% mutate(
  race=as.character(race), 
  hispan=as.character(hispan),
  racehisp=case_when(
                     race=="Chinese" ~ "Asian", 
                     race=="Japanese" ~ "Asian", 
                     race=="Other Asian or Pacific" ~ "Asian",
                     race=="White" ~ "White", 
                     race=="Black" ~ "Black", 
                     race=="American Indian or Alaskan" ~ "American Indian or Alaskan", 
                     hispan != "Not Hispanic" ~ "Hispanic",
                     TRUE~"Other"))

#check your work
pander(table(eb_pums$racehisp, eb_pums$race))
Table continues below
  American Indian or Alaskan Black Chinese
American Indian or Alaskan 771 0 0
Asian 0 0 6983
Black 0 13978 0
Hispanic 0 0 0
Other 0 0 0
White 0 0 0
Table continues below
  Japanese Other Asian or Pacific
American Indian or Alaskan 0 0
Asian 1064 12344
Black 0 0
Hispanic 0 0
Other 0 0
White 0 0
Table continues below
  Other race Three or more major race
American Indian or Alaskan 0 0
Asian 0 0
Black 0 0
Hispanic 10084 153
Other 349 370
White 0 0
  Two major races White
American Indian or Alaskan 0 0
Asian 0 0
Black 0 0
Hispanic 1926 0
Other 4313 0
White 0 65109
pander(xtabs(~race + hispan, data = eb_pums))
Table continues below
  Cuban Mexico Not Hispanic Other
American Indian or Alaskan 0 216 496 58
Black 15 192 13632 87
Chinese 0 5 6962 15
Japanese 0 9 1046 7
Other Asian or Pacific 1 75 12176 81
Other race 18 7354 349 2430
Three or more major race 4 84 370 56
Two major races 16 1157 4313 635
White 95 6032 56451 2270
  Puerto Rico
American Indian or Alaskan 1
Black 52
Chinese 1
Japanese 2
Other Asian or Pacific 11
Other race 282
Three or more major race 9
Two major races 118
White 261
tab_racehisp <- table(eb_pums$racehisp)

#check work again
pander(tab_racehisp)
American Indian or Alaskan Asian Black Hispanic Other White
771 20391 13978 12163 5032 65109

Recoding a continuous variable

Now we will recode a continuous variable–personal income (“inctot”). We will recode “inctot” to an ordinal categorical variable (or a leveled factor variable in R parlance) by grouping categories of income.

Look at the Data

Let’s start by looking more closely at the orginal variable’s coding and distribution by using the summary() and hist() functions.

summary(eb_pums$inctot)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -12000   10000   35000  243120  120100  999999
hist(eb_pums$inctot)

Notice that the top values are repeated (999999). That code indicates “N/A” (see technical documentation at [www.ipums.org]http://www.ipums.org). There are also negative values indicating income losses.

Clean the Data

We will clean this data by eliminating outliers and problematic cases. We will do this by creating a recode of the original vairable (Note: NEVER write over your original data).

#first we'll copy over our original variable to a
#new column
eb_pums$increc <- eb_pums$inctot
#Set Outliers to "NA". Values less than zero or
#999999 to NA
eb_pums <- eb_pums %>% mutate(increc=ifelse(inctot < 0 | inctot >= 999999, NA, inctot))

#OR base R equivalent
eb_pums$increc[eb_pums$inctot < 0 | eb_pums$inctot >= 999999] <- NA

We will now recode the data into our new categories, in this case, quantiles. To do so, we shall use the Hmisc pacakge and its cuts() function to divide income into quintiles.

#We will create a new variable called "inc5" made up #of 5 quantiles of our #cleaned up household income variable
quantile_brackets <- quantile(eb_pums$increc, probs = seq(0, 1, 0.2), na.rm=T)
eb_pums <- eb_pums %>% 
  mutate(inc5=cut(increc, breaks=quantile_brackets))

#check that it worked
summary(eb_pums$inctot)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -12000   10000   35000  243120  120100  999999
summary(eb_pums$inc5)
##         (0,3.6e+03]  (3.6e+03,1.51e+04] (1.51e+04,3.12e+04] 
##                5909               18388               18421 
##  (3.12e+04,5.5e+04]  (5.5e+04,7.22e+05]                NA's 
##               18877               17974               37875
#We will run a new two way table with our two new #variables "racehisp" and "inc5"

pander(xtabs(~inc5 + racehisp, data = eb_pums))
Table continues below
  American Indian or Alaskan Asian Black
(0,3.6e+03] 46 1180 840
(3.6e+03,1.51e+04] 145 3249 2667
(1.51e+04,3.12e+04] 126 3043 2323
(3.12e+04,5.5e+04] 119 2993 2112
(5.5e+04,7.22e+05] 73 2732 1155
  Hispanic Other White
(0,3.6e+03] 537 256 3050
(3.6e+03,1.51e+04] 2041 718 9568
(1.51e+04,3.12e+04] 2050 700 10179
(3.12e+04,5.5e+04] 1349 597 11707
(5.5e+04,7.22e+05] 501 497 13016

Try comparing the results using these two methods (one using the categorical variable racehisp with an ordinal variable – our quintile recode of income – and the other using racehisp with the original interval-ratio version of income – increc). What do they indicate?

2.4 Saving Your Work

Save your data frame

saveRDS(eb_pums, file="eb_pums_v2.rds")

Save R Output/Scripts

In RStudio click File –> Save in order to save your scripts so you can reference them later on.

Save Graphs

Navigate to your “Plots” pane in RStudio and you can save individual graphs in varied formats such as a png or pdf file and modify its size.