-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathregression-project.Rmd
198 lines (140 loc) · 8.24 KB
/
regression-project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: "Is an automatic or manual transmission better for MPG?"
output: pdf_document
fontsize: 10pt
geometry: margin=0.5in
---
# Executive summary
Using linear regression models, this report looks at the questions whether manual or automatic transmission cars have higher mpg, and how big the difference is. Mpg of manual transmission cars is significantly higher: by 7.2 miles. This result is qualified by controlling for weight: the difference between manual and automatic transmission cars is smaller for heavier cars.
# Questions
This report tries to answer two questions, based on the mtcars dataset (n=32):
1. Is an automatic transmission better for MPG?
2. Quantify the MPG difference between automatic and manual transmissions.
```{r echo=FALSE}
setwd("~/Documents/ml/datascience/courses/07_RegressionModels/project/")
library(datasets)
data(mtcars)
# Create factors for the categorial variables
cars <- mtcars
cars$am <- factor(mtcars$am, levels=c(0,1), labels=c("automatic", "manual"))
cars$cyl <- factor(mtcars$cyl)
cars$vs <- factor(mtcars$vs, levels=c(0,1), labels=c("s", "v"))
cars$gear <- factor(mtcars$gear)
cars$carb <- factor(mtcars$carb)
```
# Exploratory analysis
All variables correlate significantly with `mpg`, between 0.42 and 0.87 (results omitted for brevity).
```{r echo=FALSE}
cors <- cor(mtcars$mpg,mtcars[,-1])
cor.tests <- sapply(mtcars[, -1], cor.test, mtcars$mpg)
alphas <- sapply(0:9, function(i){ cor.tests[i*9+3]} )
#results <- rbind(cors, alphas)
#rownames(results) <- c("corr", "p-value")
#results
```
To find variables that should be adjusted for, I used chi-square tests to check the relationship between the categorial variables and `am`. It turns out that the variables `cyl` and `gear` differ depending on `am` (results omitted for brevity).
```{r echo=FALSE}
result <- sapply( c("cyl", "vs", "gear", "carb"), function(cat){ summary( table(cars[,cat], cars$am) )} )
```
```{r echo=FALSE}
results <- rbind( sapply( c("disp", "hp", "drat", "wt", "qsec"), function(var){ t.test( cars[,var] ~ cars$am)[3] } ) )
```
For the numerical variables: `disp`, `drat`, and `wt` differ depending on `am` (based on t-tests, results omitted for brevity).
Figure 1 shows the relationship between `mpg` and `am` both depending on `cyl` and `gear`.
Figure 2 to 4 show the relationships between `mpg` and `am` depending on `dips`, `drat`, and `wt`.
# Regression results: manual transmission cars have better mpg than automatic transmission cars
The boxplots (see figure 1) suggest that manual transmission cars have higher mpg values than automatic transmission cars. A simple linear regression of `mpg` on `am` confirms this:
```{r echo=FALSE}
fit <- lm( mpg ~ am, data=cars)
simple.fit <- fit
summary(fit)[4]
```
The mean of mpg for the **manual transmission cars is 7.2 miles higher than for the automatical transmission cars**, a significant difference (p=0.0003).
With 95% confidence we can estimate that the difference between automatic and manual transmission cars is between 3.65 and 10.85 miles per gallon.
```{r echo=FALSE}
sumCoef <- summary(fit)$coefficients
est.automatic <- sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
est.diff <- sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]
coef.tab <- rbind(est.automatic, est.diff)
colnames(coef.tab) <- c("lower bound", "upper bound")
rownames(coef.tab) <- c("mean automatic", "manual - automatic")
coef.tab
```
The residuals of the simple regression seem to be random and normally distributed: see figure 5.
No data point has an overly strong influence:
```{r}
range(hatvalues(fit))
```
Car weight is related to both transmission and mpg. Looking at the three plots relating `disp`, `drat`, and `wt` to `mpg` (see fig. 2 to 4), the difference between automatic and manual transmission seemed to be the greatest in the case of weight. So let's adjust for weight:
```{r echo=FALSE}
fit.wt <- lm(mpg ~ am + wt, data=cars)
summary(fit.wt)[4]
```
This model explains much more variance than the model that includes transmission alone (75% as opposed to 36%).
Adjusting for weight, we see that difference in transmission disappears.
Let's include the interaction term:
```{r echo=FALSE}
fit.am.wt <- lm(mpg ~ am*wt, data=cars)
summary(fit.am.wt)[4]
```
Now we see that there is in fact a difference between automatic and manual transmission: the mpg of both manual and automatic transmission cars drops the heavier the car, but the mpg drop is much steeper for manual than for automatic transmission cars.
Concretely:
* if the weight of an automatic transmission car goes up by 1000lbs, the expected mpg drops by roughly 3.8 miles.
* if the weight of a manual transmission car goes up by 1000lbs, the expected mpg drops only roughly 9.1 miles -- more than twice as fast.
This model with the interaction explains 83% of variance (as opposed to 75% for the model without the interaction term).
Based on the relationships established in the exploratory section, `cyl` have an influence on both `mpg` and transmission. Including cylinders to the interaction (`mpg ~ am*wt + cyl`) makes a significant, but small difference: it adds only 5% explained variance.
This small additional value of including cylinders is confirmed by an anova between the models.
```{r echo=FALSE}
fit1 <- lm(mpg ~ am, data=cars)
fit2 <- update(fit1, mpg ~ am + wt, data=cars)
fit3 <- update(fit1, mpg ~ am + wt + am*wt, data=cars)
fit4 <- update(fit1, mpg ~ am + wt + am*wt + cyl, data=cars)
an <- anova(fit1, fit2, fit3, fit4)
```
# Conclusions
Using linear regression models, we tried to answer whether manual or automatic transmission cars have higher mpg, and how big the difference is.
A single variable linear regression shows that the mpg of manual transmission cars is with 95% confidence between 3.6 and 10.8 miles higher.
This result is qualified by controlling for weight: the difference in between manual and automatic transmission cars is smaller for heavier cars.
## Limitations
Not included in this report are the influences of other variables of the dataset. Some exploratory results suggested that they wouldn't change the relationship of transmission and mpg greatly, but there is some room for further research.
Applying regression to variables like number of cylinders violate the assumption of normality, so the results are somewhat questionable. Still, the visual analysis (looking at the plots) corroborates the regression results.
#Appendix: Figures
Figure 1: Relationship between `mpg` and `am`, depending on `cyl` and `gear`:
```{r echo=FALSE}
par(mfrow=c(1,2))
plot(cars$cyl, cars$mpg, ylab="mpg", xlab="cylinders")
points(jitter(as.numeric(cars$cyl), factor=0.5), cars$mpg, col=sapply(cars$am, switch, 'red', 'green'), pch=16)
legend(x="topright", legend=c('manual', 'automatic'), pch=16, col=c('red', 'green'), cex=0.8, inset=0.03)
plot(cars$gear, cars$mpg, ylab="mpg", xlab="gears")
points(jitter(as.numeric(cars$gear), factor=0.5), cars$mpg, col=sapply(cars$am, switch, 'red', 'green'), pch=16)
legend(x="topleft", legend=c('manual', 'automatic'), pch=16, col=c('red', 'green'), cex=0.8, inset=0.03)
```
Figure 2: Relationship between `mpg` and `disp` separated by `am`:
```{r echo=FALSE, fig.height=4, fig.width=8}
library(ggplot2)
par(mfrow=c(1, 1))
p1 <- qplot(disp, mpg, data=cars, geom=c("point", "smooth"),
method="lm", formula=y~x, color=am,
main="Regression of mpg on disp", xlab="disp", ylab="mpg") + theme_bw() + theme(legend.position = c(0.8, 0.8))
print(p1)
```
Figure 3: Relationship between `mpg` and `drat` separated by `am`:
```{r echo=FALSE, fig.height=4, fig.width=8}
p2 <- qplot(drat, mpg, data=cars, geom=c("point", "smooth"),
method="lm", formula=y~x, color=am,
main="Regression of mpg on drat", xlab="drat", ylab="mpg") + theme_bw() + theme(legend.position = c(0.2, 0.8))
print(p2)
```
Figure 4: Relationship between `mpg` and `wt` separated by `am`:
```{r echo=FALSE, fig.height=4, fig.width=8}
p3 <- qplot(wt, mpg, data=cars, geom=c("point", "smooth"),
method="lm", formula=y~x, color=am,
main="Regression of mpg on weight", xlab="wt", ylab="mpg") + theme_bw() + theme(legend.position = c(0.8, 0.8))
print(p1)
```
Figure 5: Residual plot and plot of distribution for model `mpg` ~ `am`:
```{r echo=FALSE}
par(mfrow=c(1,2))
plot(simple.fit, which=1)
plot(simple.fit, which=2)
```