library(tidyverse)
#set your working directory
californiatod <- read_csv("californiatod.csv")
Recall that superimposed a normal curve on top of a histogram to get a rough idea of whether a variable is normally distributed.
First, we will create a histogram for density and superimpose a curve on top of it.
library(scales)
hist.percent <- ggplot(californiatod, aes(x=density)) +
geom_histogram(aes(y=..density.., group=1)) +
scale_y_continuous(labels=percent)
# show percentage
hist.percent <- ggplot(californiatod, aes(x=density)) +
geom_histogram(aes(y=..density..)) +
scale_y_continuous(labels=percent) + labs(y="")
hist.percent
# with normal curve superimposed
hist.percent + stat_function(fun=dnorm, args=list(mean=mean(californiatod$density),
sd=sd(californiatod$density)))
There are better ways to do this. One way to determine if a variable is normally distributed is to use a QQ plot. You can use the geom_qq function to create a Quantile-Quantile plot evaluating the fit of a sample data to a normal distribution. More generally, the geom_qq(distribution=q
Let’s create a qqplot for the density and house value variables:
ggplot(californiatod, aes(sample=density)) +
geom_qq() + stat_qq_line() + ylab("Density")
ggplot(californiatod, aes(sample=houseval)) +
geom_qq() + geom_qq_line() + ylab("houseval")
Which of these two variables are more likely to be normally distributed?
ggplot2
is a grammar of graphics, a coherent system for describing and building graphs.
coordinate_*
theme_bw
labs(x=..., y=...)
And you can add layers to your graph with +
. See how we plot our histogram and add the normal curve on top of it
hist.percent <- ggplot(californiatod, aes(x=density)) +
geom_histogram(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels=percent) + labs(y="")
hist.percent + stat_function(fun=dnorm, args=list(mean=mean(californiatod$density),
sd=sd(californiatod$density)))
We will now compute a single sample hypothesis test with a known mean by hand. We shall test whether the mean housing value in the TOD dataset is statistically significantly different from the mean of all housing on the market (assuming mean = 250000).
The formular R uses for computing the test statistic is:
\[\frac{\bar{x} - \mu}{\frac{s}{\sqrt{n-1}}}\]
xbar <- mean(californiatod$houseval)
mu <- 250000
s <- sd(californiatod$houseval)
n <- nrow(californiatod)
(sample_mean - population_mean)/(sample_sd/(square_root(n-1)))
Replace each term with the correct corresponding numerical value. Note that to designate a square root you can use sqrt().
tstar <- (xbar - mu)/(s/sqrt(n-1))
tstar
## [1] 0.9857827
p <- 2*pt(-abs(tstar), df = n-1)
p
## [1] 0.3336849
Question: What is the null hypothesis for this test? Should you reject the null hypothesis based on the p-value?
t.test(californiatod$houseval, alternative = 'two.sided', mu = 250000, conf.level = .95)
##
## One Sample t-test
##
## data: californiatod$houseval
## t = 1.0053, df = 25, p-value = 0.3244
## alternative hypothesis: true mean is not equal to 250000
## 95 percent confidence interval:
## 216280.3 348029.1
## sample estimates:
## mean of x
## 282154.7
Compare the results to our hand calculations.
The formula R uses for computing the test statistics is:
\[\frac{\bar{x_{1}} - \bar{x_{2}}}{\sqrt{\frac{s^2_{1}}{n_{1}}+ \frac{s^2_{2}}{n_2}}}\]
xbar1 <- with(californiatod, mean(houseval[railtype == "Light rail"]))
xbar2 <- with(californiatod, mean(houseval[railtype == "Heavy or commuter rail"]))
s1 <- with(californiatod, sd(houseval[railtype == "Light rail"]))
s2 <- with(californiatod, sd(houseval[railtype == "Heavy or commuter rail"]))
n1 <- with(californiatod, length(houseval[railtype == "Light rail"]))
n2 <- with(californiatod, length(houseval[railtype == "Heavy or commuter rail"]))
\[\frac{\bar{x_{1}} - \bar{x_{2}}}{\sqrt{\frac{s^2_{1}}{n_{1}}+ \frac{s^2_{2}}{n_2}}}\]
Replace each term with the correct corresponding numerical value. Note that to designate a square root use the sqrt() function.
tstar <- (xbar1 - xbar2)/ sqrt(s1^2/n1 + s2^2/n2)
tstar
## [1] -3.408729
p <- 2*pt(-abs(tstar), df = pmin(n1, n2)-1)
p
## [1] 0.009240479
Theoretically the formula is not exactly appropriate for our small sample size, but for now it shall suffice. You can use this same equation for calculating the p-value in Assignment 2.
The general syntax for the t-test in R is
t.test(var1 ~ var2, data)
In this var1 is a continuous variable and var2 is a categorical variable.
Execute the test by placing the proper variable names into the generic command (in this case houseval and railtype)
t.test(houseval ~ railtype, data = californiatod)
##
## Welch Two Sample t-test
##
## data: houseval by railtype
## t = 3.4087, df = 23.927, p-value = 0.002314
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 65647.09 267225.93
## sample estimates:
## mean in group Heavy or commuter rail mean in group Light rail
## 339767.3 173330.8
Can we reject the null hypothesis given these results?
The group-means t-test (like many statistical tests you will learn) measures the probability that the null hypothesis can be rejected. For a two-tailed test, the typical null hypothesis is that the population means are the same for two different groups. Rejecting the null is the same as concluding that the population means are different.
The null hypothesis is \(H_{0}: diff = 0\), or in other words, there is not a difference in the mean density by railtype in the population from which the sample is drawn. And the alternative hypothesis is “the true difference in means is not equal to 0”
R accepts arguments for alternative hypotheses and confidence levels. The alternative argument can be ‘two-sided’ for a two-tailed hypothesis (mean1 = mean2) as well as two one-tailed hypotheses (‘greater’ for mean1 > mean2, and ‘less’ for mean1 < mean2)
By default, the t.test() function is a two-tailed test, i.e. the alternative hypothesis is that the difference in means is not equal to zero. Put another way, the alternative hypothesis is that there is a difference in average density between railtypes. We can switch to one-sided tests by changing the ‘alternative’ argument, an example follows
#Default t.test()
t.test(houseval ~ railtype, data = californiatod, alternative = 'two.sided', conf.level = .95)
##
## Welch Two Sample t-test
##
## data: houseval by railtype
## t = 3.4087, df = 23.927, p-value = 0.002314
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 65647.09 267225.93
## sample estimates:
## mean in group Heavy or commuter rail mean in group Light rail
## 339767.3 173330.8
#one sided t.test()
t.test(houseval ~ railtype, data = californiatod, alternative = 'greater', conf.level = .95)
##
## Welch Two Sample t-test
##
## data: houseval by railtype
## t = 3.4087, df = 23.927, p-value = 0.001157
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 82889.75 Inf
## sample estimates:
## mean in group Heavy or commuter rail mean in group Light rail
## 339767.3 173330.8
In order to get the p-value for one-sided t distribution
pt(-abs(tstar), df = length(californiatod) - 1)
The output for the two-tailed test has a p-value of .001 and thus we would reject the null hypothesis that the difference between our group means in 0.
Note that R also allows you to compute a two-sample t-test, appropriate in situations where you have “matched pairs” data available in separate variables. For example, in the Krizek paper we read this week, he tests whether the means of variables differ significantly pre and post-move (treatment). You can do a paired t-test in R!
The general form of the syntax is:
t.test(var1, var2, data = Dataset, alternative = ‘two.sided’, conf.level = .95, paired = TRUE)
This is the type of t-test you want to conduct for Question 4 in Assigment 2.