-
Notifications
You must be signed in to change notification settings - Fork 1
/
Case-Study-3-CodeSup.rmd
170 lines (117 loc) · 7.23 KB
/
Case-Study-3-CodeSup.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: "Case Study 3 Code Sup"
author: "Elliot Pickens & Dean Gladish"
date: "May 20, 2018"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, message = F, warning = F}
library(Sleuth3)
library(dplyr)
library(ggformula)
library(pander)
library(knitr)
library(stargazer)
library(car)
library(pander)
library(gridExtra)
library(broom)
library(ggthemes)
library(MASS)
library(leaps)
library(GGally)
library(effects)
```
After loading the necessary libraries, we first need to take a look at the data:
```{r, message = F, warning = F}
nes <- read.csv("http://aloy.rbind.io/data/NES.csv")
head(nes)
summary(nes)
sum(is.na(nes))
```
After inspecting the data and getting a bit of an idea about what our data set looked like (which features have numeric values, which features are binary, ect) we began to select our model. The following explains our derivation of a model:
```{r, message = F, warning = F}
# The following code generates our baseline model for dem.
glm.base <- glm(dem ~ gender + region + union + income + educ + year + race + age, data = nes, family = binomial)
summary(glm.base)
# Regressing on a constant allows us to hold everything except for dem (party
# identification) constant.
glm.basic <- glm(dem ~ 1, data = nes, family = binomial)
```
After fitting a couple of initial models, used BIC to proform stepwise selection of a final model. We started off trying to find the best model using only the untransformed variables, but we also checked if adding square or interaction terms would produce a better model, but none of these more complex terms ended up in the final model. Interestingly, region was not included in our final model.
```{r, message = F, warning = F}
stpFwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age), direction = "both", k = log(nrow(nes)))
summary(stpFwd)
stpBk <- stepAIC(glm.base, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age), direction = "both", k = log(nrow(nes)))
summary(stpBk)
glm.square <- glm(dem ~ year + region + union + income + educ + gender + race + age +
I(year)^2 + I(region)^2 + I(union)^2 + I(income)^2 + I(educ)^2 + I(gender)^2 + I(race)^2 + I(age)^2,
data = nes, family = binomial, k = log(nrow(nes)))
glm.inter <- glm(dem ~ year + region + union + income + educ + gender + race + age +
age * year + age * region + age * union + age * income + age * educ + age * gender + age * race, data = nes, family = binomial, k = log(nrow(nes)))
summary(glm.inter)
stp.inter <- stepAIC(glm.inter, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age +
age * year + age * region + age * union + age * income + age * educ + age * gender + age * race), direction = "both", k = log(nrow(nes)))
glm.inter <- glm(dem ~ year + region + union + income + educ + gender + race + age +
+ age * union + age * income + age * educ + age * race, data = nes, family = binomial, k = log(nrow(nes)))
stp.inter <- stepAIC(glm.inter, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age +
age * year + age * region + age * union + age * income + age * educ + age * gender + age * race), direction = "both", k = log(nrow(nes)))
summary(stp.inter)
stp.inter.fwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age + age * year + age * region + age * union + age * income + age * educ + age * gender + age * race),
direction = "both", k = log(nrow(nes)))
summary(stp.inter.fwd)
```
```{r, message = F, warning = F}
# In order to do some preliminary investigation into whether there are
# associations between gender and party preference,
# between region and party preference,
# and between unionized status and party preference,
# I have created some plots of gender, region, and union.
plot(allEffects(stp.inter.fwd), rows = 2, cols = 3, type = "link",
ylab = "Log(Odds of Democratic Party Support)")
```
We can see from the plots that males, white people, older people, and the wealthy have lower odds of supporting the Democratic party.
Union members have higher odds of supporting the Democratic party.
For further analysis of the probability that any given individual supports the Democrats, we can use the following code:
```{r, message = F, warning = F}
plot(Effect(c("gender", "union"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
plot(Effect(c("income", "gender"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
plot(Effect(c("race", "gender"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
plot(Effect(c("race", "age"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
```
This graphs clarify the connection between gender, union membership, income, age, race, and support of the Democratic Party.
They show us that baed on our model having a higher income, being white, being male, being of older age, and not being a member of a union increase the chance that a person is not a democrat, and if a person is the oppisite of these things then they are likely a democrat.
NOW, we need to assess the significance of these effects regardless of time.
```{r, message = F, warning = F}
for (i in c(2, 3, 4, 5, 6, 7, 8)) {
coefficient <- coef(stp.inter.fwd)[i]
standardError <- sqrt(vcov(stp.inter.fwd)[i,i])
waldStat <- (coefficient / standardError)^2
print(1-pchisq(waldStat, df = 1))
}
```
Based off these p-values, we can reject the null hypothesis that the coefficients are zero. We can reject them for small p-values. Specifically, unionyes and gendermale seem to have an undeniable impact at an alpha level of 0.05.
If we also want to look at the significance of the union variable,
```{r, message = F, warning = F}
gender_only <- glm(dem ~ union, family = binomial, data = nes)
anova(gender_only, glm.base, test = "Chisq")
```
shows that we can reject the notion that the other coefficients are not necessary.
```{r, message = F, warning = F}
plot(stp.inter.fwd$residuals)
```
The residuals plot shows that our model generally fits the data.
```{r, message = F, warning = F}
infIndexPlot(stp.inter.fwd, vars = c("hat", "cook"))
anova(stp.inter.fwd, test = "Chisq")
qqnorm(stp.inter.fwd$residuals)
```
Our p-values indicate that all of our variables are significant.
Despite the qqplot indicating a potential lack of normality, this is typical for multiple regression and our model does fit the data as indicated by the Loess smoother on our Deviance residuals plots below:
```{r, message = F, warning = F}
residualPlots(stp.inter.fwd, tests = false, type = "response")
residualPlots(stp.inter.fwd, tests = false, type = "pearson")
residualPlots(stp.inter.fwd, tests = false, type = "deviance")
```