Case-Study-3-CodeSup.rmd

---
title: "Case Study 3 Code Sup"
author: "Elliot Pickens & Dean Gladish"
date: "May 20, 2018"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, message = F, warning = F}
library(Sleuth3)
library(dplyr)
library(ggformula)
library(pander)
library(knitr)
library(stargazer)
library(car)
library(pander)
library(gridExtra)
library(broom)
library(ggthemes)
library(MASS)
library(leaps)
library(GGally)

library(effects)
```

After loading the necessary libraries, we first need to take a look at the data: 

```{r, message = F, warning = F}
nes <- read.csv("http://aloy.rbind.io/data/NES.csv")
head(nes)
summary(nes)
sum(is.na(nes))

```

After inspecting the data and getting a bit of an idea about what our data set looked like (which features have numeric values, which features are binary, ect) we began to select our model. The following explains our derivation of a model:  

```{r, message = F, warning = F}
# The following code generates our baseline model for dem.  
glm.base <- glm(dem ~ gender + region + union + income + educ + year + race + age, data = nes, family = binomial)
summary(glm.base)

# Regressing on a constant allows us to hold everything except for dem (party 
# identification) constant.  
glm.basic <- glm(dem ~ 1, data = nes, family = binomial)
```

After fitting a couple of initial models, used BIC to proform stepwise selection of a final model. We started off trying to find the best model using only the untransformed variables, but we also checked if adding square or interaction terms would produce a better model, but none of these more complex terms ended up in the final model. Interestingly, region was not included in our final model.

```{r, message = F, warning = F}
stpFwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age), direction = "both", k = log(nrow(nes)))

summary(stpFwd)

stpBk <- stepAIC(glm.base, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age), direction = "both", k = log(nrow(nes)))

summary(stpBk)

glm.square <- glm(dem ~ year + region + union + income + educ + gender + race + age + 
                    I(year)^2 + I(region)^2 + I(union)^2 + I(income)^2 + I(educ)^2 + I(gender)^2 + I(race)^2 + I(age)^2, 
                            data = nes, family = binomial, k = log(nrow(nes)))

glm.inter <- glm(dem ~ year + region + union + income + educ + gender + race + age + 
                   age * year + age * region + age * union + age * income + age * educ + age * gender + age * race, data = nes,                               family = binomial, k = log(nrow(nes)))
summary(glm.inter)

stp.inter <- stepAIC(glm.inter, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age + 
                                               age * year + age * region + age * union + age * income + age * educ + age * gender +                                                   age * race), direction = "both", k = log(nrow(nes)))

glm.inter <- glm(dem ~ year + region + union + income + educ + gender + race + age + 
                  + age * union + age * income + age * educ + age * race, data = nes, family = binomial, k = log(nrow(nes)))

stp.inter <- stepAIC(glm.inter, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age + 
                                               age * year + age * region + age * union + age * income + age * educ + age * gender +                                                    age * race), direction = "both", k = log(nrow(nes)))

summary(stp.inter)

stp.inter.fwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age +                                    age * year + age * region + age * union + age * income + age * educ + age * gender + age * race),
                                         direction = "both", k = log(nrow(nes)))

summary(stp.inter.fwd)

```


```{r, message = F, warning = F}

# In order to do some preliminary investigation into whether there are 
# associations between gender and party preference, 
# between region and party preference, 
# and between unionized status and party preference, 
# I have created some plots of gender, region, and union.  

plot(allEffects(stp.inter.fwd), rows = 2, cols = 3, type = "link", 
     ylab = "Log(Odds of Democratic Party Support)")

```

We can see from the plots that males, white people, older people, and the wealthy have lower odds of supporting the Democratic party.  

Union members have higher odds of supporting the Democratic party.  

For further analysis of the probability that any given individual supports the Democrats, we can use the following code:  

```{r, message = F, warning = F}
plot(Effect(c("gender", "union"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
plot(Effect(c("income", "gender"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
plot(Effect(c("race", "gender"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
plot(Effect(c("race", "age"), stp.inter.fwd), multiline = TRUE, type = "response", ylab = "Probability(Democrat)")
```

This graphs clarify the connection between gender, union membership, income, age, race, and support of the Democratic Party.

They show us that baed on our model having a higher income, being white, being male, being of older age, and not being a member of a union increase the chance that a person is not a democrat, and if a person is the oppisite of these things then they are likely a democrat.

NOW, we need to assess the significance of these effects regardless of time.  

```{r, message = F, warning = F}
for (i in c(2, 3, 4, 5, 6, 7, 8)) {
  coefficient <- coef(stp.inter.fwd)[i]
  standardError <- sqrt(vcov(stp.inter.fwd)[i,i])
  waldStat <- (coefficient / standardError)^2
  print(1-pchisq(waldStat, df = 1)) 
}
```

Based off these p-values, we can reject the null hypothesis that the coefficients are zero.  We can reject them for small p-values.  Specifically, unionyes and gendermale seem to have an undeniable impact at an alpha level of 0.05.  

If we also want to look at the significance of the union variable, 

```{r, message = F, warning = F}
gender_only <- glm(dem ~ union, family = binomial, data = nes)
anova(gender_only, glm.base, test = "Chisq")

```

shows that we can reject the notion that the other coefficients are not necessary.  

```{r, message = F, warning = F}
plot(stp.inter.fwd$residuals)
```

The residuals plot shows that our model generally fits the data.  

```{r, message = F, warning = F}
infIndexPlot(stp.inter.fwd, vars = c("hat", "cook"))
anova(stp.inter.fwd, test = "Chisq")
qqnorm(stp.inter.fwd$residuals)
```

Our p-values indicate that all of our variables are significant.

Despite the qqplot indicating a potential lack of normality, this is typical for multiple regression and our model does fit the data as indicated by the Loess smoother on our Deviance residuals plots below:

```{r, message = F, warning = F}
residualPlots(stp.inter.fwd, tests = false, type = "response")
residualPlots(stp.inter.fwd, tests = false, type = "pearson")
residualPlots(stp.inter.fwd, tests = false, type = "deviance")
```