Merge pull request #1 from natelangholz/master

Give me the files!
natelangholz · May 11, 2018 · bfbcfb0 · bfbcfb0
2 parents 5fd94d3 + aa78807
commit bfbcfb0
Show file tree

Hide file tree

Showing 21 changed files with 1,532 additions and 7 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,6 @@
 .Rhistory
 .RData
 .Ruserdata
+
+/final-project/data_prep.R
+/final-project/nyt-comments/
diff --git a/README.md b/README.md
@@ -14,10 +14,8 @@ Problem set due dates will be announced when each problem set it distributed.
 
 Other important deadlines and dates during the term are: 
 
-* (Tbd) Exploratory data analysis
-* (Tbd) Submit brief (e.g. one-page) progress report
 * 6/4/2018 Final model submissions
-* 6/6/2018 Final slide sumbmissions
+* 6/6/2018 Final slide submissions
 * 6/7/2018 Student presentations (Week 10)
 
 ## Resources

diff --git a/final-project/.DS_Store b/final-project/.DS_Store
diff --git a/final-project/README.md b/final-project/README.md
@@ -0,0 +1,13 @@
+# Final Project: Kaggle Competition
+
+[Final Project Kaggle page](https://www.kaggle.com/c/mas-412-final-project) is now live. 
+
+[https://www.kaggle.com/c/mas-412-final-project](https://www.kaggle.com/c/mas-412-final-project)
+
+## Project Description
+
+The MAS 412 final project will be this Kaggle competition to predict the number of upvotes a New York Times article comment will receive. This estimation will be a tool that can give us a gauge on public opinion. The response variable will be `recommendations` in the training/testing `comments` file.
+
+Submissions for the final project will be live until June 4, 2018 at 11:59 pm. 
+
+There is also a presentation component that will happen final class period for 10 minutes each. 
diff --git a/week-2/problem-set-1/.DS_Store b/week-2/problem-set-1/.DS_Store
diff --git a/week-2/problem-set-1/hw1_possible_sols.Rmd b/week-2/problem-set-1/hw1_possible_sols.Rmd
diff --git a/week-2/problem-set-1/hw1_possible_sols.html b/week-2/problem-set-1/hw1_possible_sols.html
diff --git a/week-4/british-doctors-overdispersion-example.R b/week-4/british-doctors-overdispersion-example.R
@@ -54,3 +54,6 @@ points(aids$Year,aids$Cases,pch=19, cex=0.7)
 visreg(fit.nb, scale="response", ylim=c(0, 350), partial=FALSE, ylab="Cases", main="Negative binomial")
 points(aids$Year,aids$Cases,pch=19, cex=0.7)
 
+
+fit <- glm.nb(Cases~Year+I(Year^2), data=aids)
+
diff --git a/week-4/doctor-visits-hurdle-example.R b/week-4/doctor-visits-hurdle-example.R
@@ -35,14 +35,11 @@ summary(fit.hurdle)
 
 sum(predict(fit.hurdle, type = "prob")[,1])
 
-visits1 <- ifelse(nmes$visits >0,1,0)
 
-fit <- glm(visits ~ .,  family = 'binomial', data = nmes2)
 
 # First 5 expected counts
 predict(fit.hurdle, type = "response")[1:5]
 
-
 # ratio of non-zero probabilities (1 - type = 'prob' 0 prediction)
 predict(fit.hurdle, type = "zero")[1:5]
 
@@ -61,9 +58,10 @@ hist(round(predict(fit.hurdle, type = "prob"))*predict(fit.hurdle, type = "respo
 library(countreg)
 rootogram(fit1, max = 80)
 rootogram(fit.nb, max = 80)
-
 rootogram(fit.hurdle, max = 80) # fit up to count 80
 
+
+
 fit.hurdle.nb <- hurdle(visits ~ ., data = nmes, dist = "negbin")
 
 AIC(fit1)

diff --git a/week-5/.DS_Store b/week-5/.DS_Store
diff --git a/week-5/README.md b/week-5/README.md
@@ -0,0 +1,21 @@
+#Week 5
+
+Predictive modeling and talking about data pre-processing steps.
+
+[Slides](https://github.com/natelangholz/stat412-advancedregression/blob/master/week-5/slides-week-5.pdf)
+
+
+##Readings
+
+Some additional readings. The first is something we want stay away from a bit and the other two are incredibly interesting. 
+
+[We can do better...](https://www.smithsonianmag.com/innovation/can-computer-model-predict-first-round-this-years-march-madness-180968461/)
+
+[Statistical Modeling: The Two Cultures - Leo Breiman](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)
+
+[The data that transformed AI research - and possibly the world](https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/)
+
+
+##Problem Set
+
+Problem set 2 is due Monday April 14,2018 at 10pm.
diff --git a/week-5/problem-set-2/.DS_Store b/week-5/problem-set-2/.DS_Store
diff --git a/week-2/problem-set-1/problem-set-1.Rmd.R → week-5/problem-set-2/problem-set-1.R b/week-2/problem-set-1/problem-set-1.Rmd.R → week-5/problem-set-2/problem-set-1.R
diff --git a/week-5/problem-set-2/problem-set-2.Rmd b/week-5/problem-set-2/problem-set-2.Rmd
@@ -0,0 +1,71 @@
+---
+  output: 
+  html_document: 
+  highlight: pygments
+---
+
+
+
+### Risky Behavior
+  The data `risky_behaviors.dta` is from a randomized experiment that targeted couples at high risk of HIV infection. Counseling sessions were provided to the treatment group regarding practices that could reduce their likelihood of contracting HIV. Couples were randomized either to a control group, a group in which just the woman participated, or a group in which both members of the couple participated. The response variable to be examined after three months was “number of unprotected sex acts.”
+
+```{r}
+library(foreign)
+rb <- read.dta("http://www.stat.columbia.edu/~gelman/arm/examples/risky.behavior/risky_behaviors.dta", convert.factors=TRUE)
+```
+
+
+### 1
+**Estimate**: Model this outcome as a function of treatment assignment using a Poisson regression. Does the model fit well? Is there evidence of overdispersion?
+
+### 2
+**Estimate Extension**: Extend the model to include pre-treatment measures of the outcome and the additional pre-treatment variables included in the dataset. Does the model fit well? Is there evidence of overdispersion?
+
+### 3
+  **Overdispersion**: Fit an overdispersed (quasi-)Poisson model. Fit a negative binomial model. Compare the models to previous two you have fit. Finally, what do you conclude regarding effectiveness of the intervention?
+
+### 4
+  **Hurdle Model?**: Fit a hurdle model to this data. This is a classic data set for Poisson regression and overdispersion...i'm honestly curious if the hurdle model makes sense and improves over any of the other previous models you have built. Also compare rootograms for all. 
+
+
+### 5
+**Assumptions**: These data include responses from both men and women from the participating couples. Does this give you any concern?
+
+
+  * * *
+
+### Pulling Punches
+
+The two `.Rdata` files under week 4 come as an abbreviated version of punch profiles from a boxing system to measure acceleration of boxers during live fights. The `profiles` list from the first file below has each individual punch profile which has a list item being a 3 column data frame of time (in ms around the middle of the punch event), acceleration in x (forward-back in g's), and acceleration in y (side-to-side in g's). Also attached are some other fields which are of less importance and contribute to this being a somewhat messy data set.
+
+```{r two, eval = FALSE}
+load(file = 'week-4/punch_profiles.Rdata')
+load(file = 'week-4/punch_types.Rdata')
+```
+
+There are 2135 labeled punch profiles each with a labeled punch type. Use the `punch_types` data frame as ground truth for punch type (labeled 1-6) in addition to the boxers stance (orthodox or southpaw), and punching head (right or left). The punch types are below.
+
+```{r}
+###### PUNCH TYPES
+#1 - Cross
+#2 - Hook
+#3 - Jab
+#4 - Upper Cut
+#5 - Overhand (shouldn't be any of these)
+#6 - Unknown (shouldn't be any of these)
+```
+
+
+### 6
+**Features**: Create at least 10 new features from the punch profiles. They can be combinations of x and y acceleration or individually from either. Explain how these features have been constructed.
+
+### 7
+**Multinomial Model** Fit a multinomial model to estimate each of the punch types. Which of the punch types have the most difficulty in being separated?
+
+### 8
+**Logistic Regression** Consider bucketing the punches into two groups (straights and hooks). Are you able to improve accuracy in any way?
+
+
+
+
+
diff --git a/week-5/problem-set-2/problem-set-2.html b/week-5/problem-set-2/problem-set-2.html
diff --git a/week-5/slides-week-5.key b/week-5/slides-week-5.key
diff --git a/week-5/slides-week-5.pdf b/week-5/slides-week-5.pdf
diff --git a/week-6/.DS_Store b/week-6/.DS_Store
diff --git a/week-6/README.md b/week-6/README.md
@@ -0,0 +1,15 @@
+Thinking about overfitting and tree models...maybe
+
+[Slides](https://github.com/natelangholz/stat412-advancedregression/blob/master/week-6/slides-week-6.pdf)
+
+Some additional readings.
+
+[An Empirical Comparison of Supervised Learning Algorithms](https://www.cs.cornell.edu/%7Ecaruana/ctp/ct.papers/caruana.icml06.pdf)
+
+[Interesting visual outlining tree models](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
+
+Maybe the most well known predictive modeling exercise is the $1 million Netflix prize. Not only is it widely known that Netflix did not use the winning model because of engineering costs due to computational complexity but Netflix also had to settle a multi-million dollar lawsuit for privacy invasion.
+
+[Netflix never used its $1 Million Algorithm due to engineering costs](https://www.wired.com/2012/04/netflix-prize-costs/)
+
+[The Netflix Prize](https://www.thrillist.com/entertainment/nation/the-netflix-prize)
diff --git a/week-6/slides-week-6.pdf b/week-6/slides-week-6.pdf
Original file line number	Diff line number	Diff line change
Expand Up		@@ -54,3 +54,6 @@ points(aids$Year,aids$Cases,pch=19, cex=0.7)
		visreg(fit.nb, scale="response", ylim=c(0, 350), partial=FALSE, ylab="Cases", main="Negative binomial")
		points(aids$Year,aids$Cases,pch=19, cex=0.7)


		fit <- glm.nb(Cases~Year+I(Year^2), data=aids)