Install the R Software Go to http://cran.rstudio.com/ and click on Download R for Windows (or OSX if you’re using a Mac), click on install R for the first time and download R. When the file is done downloading, double click “R-3.4.4-win.exe” (or apporiate installation file for your OS) to install the file and select Run, and continue with the installation wizard.
Install Rstudio There are many ways to interface with R from the bare-bones base R software, RStudio, or R Commander (a GUI wrapper for base R). We will be using RStudio for our labs. In order to install RStudio visit http://www.rstudio.com/products/rstudio/download and download RStudio for desktop for your operating system. If in the lab and RStudio is not installed and the software manager is not working you can download the zip/tar RStudio for windows, open the zip file, and run RStudio without installing. If you must do this, then I recommend downloading the zip/tar file to your H: drive where it can rest.
Downlad the californiatod.csv file and save it in your personal folder or C: drive. Open up RStudio, click Tools –> Import Dataset –> From Local File and navigate to your folder and click the csv. Make sure the Heading option is marked as “Yes” and press Ok.
Importing californiatod through RStudio graphical user interface is equivalent to these R commands:
#install.packages(readr)
library(readr)
californiatod <- read_csv("californiatod.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## region = col_character(),
## transit = col_double(),
## density = col_double(),
## houseval = col_double(),
## railtype = col_character()
## )
Environment Tab You can see your data by clicking on the Environment Tab in the upper right corner of your RStudio console and clicking the table californiatod.
Create Quick Summary In your console type
summary(californiatod)
## name region transit density
## Length:26 Length:26 Min. :0.0000 Min. : 1.720
## Class :character Class :character 1st Qu.:0.1162 1st Qu.: 3.225
## Mode :character Mode :character Median :0.1883 Median : 4.335
## Mean :0.2529 Mean : 5.360
## 3rd Qu.:0.3833 3rd Qu.: 5.135
## Max. :0.6429 Max. :14.850
## houseval railtype
## Min. : 89046 Length:26
## 1st Qu.:195318 Class :character
## Median :258209 Mode :character
## Mean :282155
## 3rd Qu.:314860
## Max. :779792
As you can see the output from summary
is not very helpful for summary statistics. A package called skimr
enhances R’s capability of producting summary statistics:
# First install the skimr package
if (!require(skimr))
install.packages("skimr", type="binary")
## Loading required package: skimr
# load the skimr package (telling R we're going to use it)
library(skimr)
# use it to create summary stats for the californiatod data set
skim(californiatod)
## Skim summary statistics
## n obs: 26
## n variables: 6
##
## Variable type: character
## variable missing complete n min max empty n_unique
## name 0 26 26 12 37 0 26
## railtype 0 26 26 10 22 0 2
## region 0 26 26 2 10 0 4
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25
## density 0 26 26 5.36 3.53 1.72 3.23
## houseval 0 26 26 282154.69 163092.19 89045.74 2e+05
## transit 0 26 26 0.25 0.19 0 0.12
## p50 p75 p100 hist
## 4.34 5.13 14.85 ▇▇▂▁▁▂▁▂
## 258209.17 314859.91 779791.98 ▅▇▇▂▁▁▁▁
## 0.19 0.38 0.64 ▅▇▆▂▂▁▃▂
Look at the output from the dataset summary. We can see that “name”, “region”, and “railtype” are nominal variables (frequency count only), and “transit”, “density”, and “houseval” are interval-ration data (summary statistics).
Create a frequency table (for categorical variables). We will create a frequency table using the table() function in R for “railtype”. This can be used to get basic frequencies for one or two variables (one for columns and one for rows). Note that the $ operator after the name of the table tells R which column(s) in our dataframe to reference.
table(californiatod$railtype)
##
## Heavy or commuter rail Light rail
## 17 9
The methods for analyzing your variables depend on the variable type.
Create a Frequency Table. Frequency tables help us to understand categorical variables. To create one use the table() function.
Create a Cross-Tabulation Table. To see how rail type breaks down by region we will create a cross-tabulation table for the two categorical variables (railtype and region). Which region has the most light rail TOD sites in this dataset?
table(californiatod$railtype, californiatod$region)
##
## Bay Area LA Sacramento SD
## Heavy or commuter rail 12 4 0 1
## Light rail 1 2 2 4
In order to produce summary statistics for continuous variables we will use the summary() function. And to calculate the standard deviation we will use the sd() function.
skim(californiatod$density)
## Skim summary statistics
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25 p50 p75
## californiatod$density 0 26 26 5.36 3.53 1.72 3.23 4.34 5.13
## p100 hist
## 14.85 ▇▇▂▁▁▂▁▂
sd(californiatod$density)
## [1] 3.531352
As we saw above, we can use skimr::skim
function for better summary statistics.
Compare the mean density to the minimum and maximum value for transit usage. Is the mean closer to the minimum or the maximum? How does it compre to the median? What does this tell us?
skim(californiatod)
## Skim summary statistics
## n obs: 26
## n variables: 6
##
## Variable type: character
## variable missing complete n min max empty n_unique
## name 0 26 26 12 37 0 26
## railtype 0 26 26 10 22 0 2
## region 0 26 26 2 10 0 4
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25
## density 0 26 26 5.36 3.53 1.72 3.23
## houseval 0 26 26 282154.69 163092.19 89045.74 2e+05
## transit 0 26 26 0.25 0.19 0 0.12
## p50 p75 p100 hist
## 4.34 5.13 14.85 ▇▇▂▁▁▂▁▂
## 258209.17 314859.91 779791.98 ▅▇▇▂▁▁▁▁
## 0.19 0.38 0.64 ▅▇▆▂▂▁▃▂
Let’s explore further. This time we will include the skewness and kurtosis of our summary stats. (we will use the psych package for this.)
#the psych package gives us some more powerful summary #functions that give us even more summary values than just
#mean and median
#install.packages("psych")
#just delete the # before the install.packages() to install psych
library(psych)
describe(californiatod)
## vars n mean sd median trimmed mad
## name* 1 26 NaN NA NA NaN NA
## region* 2 26 NaN NA NA NaN NA
## transit 3 26 0.25 0.19 0.19 0.24 0.19
## density 4 26 5.36 3.53 4.34 4.81 1.62
## houseval 5 26 282154.69 163092.19 258209.17 258284.88 96930.33
## railtype* 6 26 NaN NA NA NaN NA
## min max range skew kurtosis se
## name* Inf -Inf -Inf NA NA NA
## region* Inf -Inf -Inf NA NA NA
## transit 0.00 0.64 0.64 0.54 -0.96 0.04
## density 1.72 14.85 13.13 1.53 1.38 0.69
## houseval 89045.74 779791.98 690746.24 1.57 2.08 31985.01
## railtype* Inf -Inf -Inf NA NA NA
Create a two-way cross Run a two-way cross-tabulation for each of the following questions:
Then we will measure the two-way cross tab of these two categorical variables with the means of a continuous third variable. This sounds complex but we shall approach it in a straight forward way using the dplyr
package.
#in order to get these cross tabs with the additional #continuous variable we will use the table() function then
#dplyr
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
table(californiatod$region, californiatod$railtype)
##
## Heavy or commuter rail Light rail
## Bay Area 12 1
## LA 4 2
## Sacramento 0 2
## SD 1 4
californiatod %>%
group_by(region, railtype) %>%
summarise(avg_transit = mean(transit))
## # A tibble: 7 x 3
## # Groups: region [?]
## region railtype avg_transit
## <chr> <chr> <dbl>
## 1 Bay Area Heavy or commuter rail 0.373
## 2 Bay Area Light rail 0.
## 3 LA Heavy or commuter rail 0.209
## 4 LA Light rail 0.159
## 5 Sacramento Light rail 0.190
## 6 SD Heavy or commuter rail 0.154
## 7 SD Light rail 0.102
The dplyr package is designed to perform table operations on data frames for data manipulation and some summarization. The preceding syntax told R to take our californiatod dataframe, group by the variables for region and railtype, and then calculate the mean transit usage from there. This is similar to the kind of pivot table operations one can do in excel.
As with all things in R there are multiple ways to get to the same spot. We will make a histogram in order to look at the distribution of transit usage frequency using base R commands and then using the ggplot2 package.
hist(californiatod$transit, main="Histogram of Transit Usage", xlab = "Transit Usage")
#install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
p1 <- ggplot(californiatod, aes(x = transit))
p1 + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Look at the histogram for density. What does this indicate about whether more TOD sites are located in high or low-density areas?
A scatterplot will show us the relationship between two continuous variables. Again we will use a base R and ggplot approach.
#comparing transit usage and houseval
plot(californiatod$transit, californiatod$houseval)
#we will install the "scales" package so we can label
# our Y axis in dollars
#install.packages("scales")
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
## The following object is masked from 'package:readr':
##
## col_factor
p2 <- ggplot(californiatod, aes(x=transit, y = houseval))
p2 + geom_point() + scale_y_continuous(labels = scales::dollar)
Script Window
If you click button with the little green + sign in the upper left corner of your screen you will get a series of options. Click “R Script” and this will open a new window of an R script file. R script files are simply text files that hold our R code. You can type commands into them and press Run at the top of the window to run the entire script. If you scroll to the end of the line you just typed and hit “CTRL+r” R-Studio will execute that line. Additionally you can highlight multiple lines of code and type “CTRL+Enter” and it will return the selection.
Area under the curve (withou z-score lookup)
If you’re trying to find out what proportion of the sample had incomes between $25,000-$30,000 (assuming a normal distribution of income), and you know the mean income is $20,000 and the standard deviation is $10,000, you can look up the proportion using the pnorm function.
Type the command into the script window and hit “Run”
pnorm(30000, mean = 20000, sd= 10000)
## [1] 0.8413447