Skip to content

Commit

Permalink
Merge pull request #1 from natelangholz/master
Browse files Browse the repository at this point in the history
Give me the files!
  • Loading branch information
jjghockey authored May 11, 2018
2 parents 5fd94d3 + aa78807 commit bfbcfb0
Show file tree
Hide file tree
Showing 21 changed files with 1,532 additions and 7 deletions.
Binary file modified .DS_Store
Binary file not shown.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@
.Rhistory
.RData
.Ruserdata

/final-project/data_prep.R
/final-project/nyt-comments/
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,8 @@ Problem set due dates will be announced when each problem set it distributed.

Other important deadlines and dates during the term are:

* (Tbd) Exploratory data analysis
* (Tbd) Submit brief (e.g. one-page) progress report
* 6/4/2018 Final model submissions
* 6/6/2018 Final slide sumbmissions
* 6/6/2018 Final slide submissions
* 6/7/2018 Student presentations (Week 10)

## Resources
Expand Down
Binary file added final-project/.DS_Store
Binary file not shown.
13 changes: 13 additions & 0 deletions final-project/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Final Project: Kaggle Competition

[Final Project Kaggle page](https://www.kaggle.com/c/mas-412-final-project) is now live.

[https://www.kaggle.com/c/mas-412-final-project](https://www.kaggle.com/c/mas-412-final-project)

## Project Description

The MAS 412 final project will be this Kaggle competition to predict the number of upvotes a New York Times article comment will receive. This estimation will be a tool that can give us a gauge on public opinion. The response variable will be `recommendations` in the training/testing `comments` file.

Submissions for the final project will be live until June 4, 2018 at 11:59 pm.

There is also a presentation component that will happen final class period for 10 minutes each.
Binary file added week-2/problem-set-1/.DS_Store
Binary file not shown.
291 changes: 291 additions & 0 deletions week-2/problem-set-1/hw1_possible_sols.Rmd

Large diffs are not rendered by default.

905 changes: 905 additions & 0 deletions week-2/problem-set-1/hw1_possible_sols.html

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions week-4/british-doctors-overdispersion-example.R
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,6 @@ points(aids$Year,aids$Cases,pch=19, cex=0.7)
visreg(fit.nb, scale="response", ylim=c(0, 350), partial=FALSE, ylab="Cases", main="Negative binomial")
points(aids$Year,aids$Cases,pch=19, cex=0.7)


fit <- glm.nb(Cases~Year+I(Year^2), data=aids)

6 changes: 2 additions & 4 deletions week-4/doctor-visits-hurdle-example.R
Original file line number Diff line number Diff line change
Expand Up @@ -35,14 +35,11 @@ summary(fit.hurdle)

sum(predict(fit.hurdle, type = "prob")[,1])

visits1 <- ifelse(nmes$visits >0,1,0)

fit <- glm(visits ~ ., family = 'binomial', data = nmes2)

# First 5 expected counts
predict(fit.hurdle, type = "response")[1:5]


# ratio of non-zero probabilities (1 - type = 'prob' 0 prediction)
predict(fit.hurdle, type = "zero")[1:5]

Expand All @@ -61,9 +58,10 @@ hist(round(predict(fit.hurdle, type = "prob"))*predict(fit.hurdle, type = "respo
library(countreg)
rootogram(fit1, max = 80)
rootogram(fit.nb, max = 80)

rootogram(fit.hurdle, max = 80) # fit up to count 80



fit.hurdle.nb <- hurdle(visits ~ ., data = nmes, dist = "negbin")

AIC(fit1)
Expand Down
Binary file added week-5/.DS_Store
Binary file not shown.
21 changes: 21 additions & 0 deletions week-5/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#Week 5

Predictive modeling and talking about data pre-processing steps.

[Slides](https://github.com/natelangholz/stat412-advancedregression/blob/master/week-5/slides-week-5.pdf)


##Readings

Some additional readings. The first is something we want stay away from a bit and the other two are incredibly interesting.

[We can do better...](https://www.smithsonianmag.com/innovation/can-computer-model-predict-first-round-this-years-march-madness-180968461/)

[Statistical Modeling: The Two Cultures - Leo Breiman](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)

[The data that transformed AI research - and possibly the world](https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/)


##Problem Set

Problem set 2 is due Monday April 14,2018 at 10pm.
Binary file added week-5/problem-set-2/.DS_Store
Binary file not shown.
File renamed without changes.
71 changes: 71 additions & 0 deletions week-5/problem-set-2/problem-set-2.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
output:
html_document:
highlight: pygments
---



### Risky Behavior
The data `risky_behaviors.dta` is from a randomized experiment that targeted couples at high risk of HIV infection. Counseling sessions were provided to the treatment group regarding practices that could reduce their likelihood of contracting HIV. Couples were randomized either to a control group, a group in which just the woman participated, or a group in which both members of the couple participated. The response variable to be examined after three months was “number of unprotected sex acts.”

```{r}
library(foreign)
rb <- read.dta("http://www.stat.columbia.edu/~gelman/arm/examples/risky.behavior/risky_behaviors.dta", convert.factors=TRUE)
```


### 1
**Estimate**: Model this outcome as a function of treatment assignment using a Poisson regression. Does the model fit well? Is there evidence of overdispersion?

### 2
**Estimate Extension**: Extend the model to include pre-treatment measures of the outcome and the additional pre-treatment variables included in the dataset. Does the model fit well? Is there evidence of overdispersion?

### 3
**Overdispersion**: Fit an overdispersed (quasi-)Poisson model. Fit a negative binomial model. Compare the models to previous two you have fit. Finally, what do you conclude regarding effectiveness of the intervention?

### 4
**Hurdle Model?**: Fit a hurdle model to this data. This is a classic data set for Poisson regression and overdispersion...i'm honestly curious if the hurdle model makes sense and improves over any of the other previous models you have built. Also compare rootograms for all.


### 5
**Assumptions**: These data include responses from both men and women from the participating couples. Does this give you any concern?


* * *

### Pulling Punches

The two `.Rdata` files under week 4 come as an abbreviated version of punch profiles from a boxing system to measure acceleration of boxers during live fights. The `profiles` list from the first file below has each individual punch profile which has a list item being a 3 column data frame of time (in ms around the middle of the punch event), acceleration in x (forward-back in g's), and acceleration in y (side-to-side in g's). Also attached are some other fields which are of less importance and contribute to this being a somewhat messy data set.

```{r two, eval = FALSE}
load(file = 'week-4/punch_profiles.Rdata')
load(file = 'week-4/punch_types.Rdata')
```

There are 2135 labeled punch profiles each with a labeled punch type. Use the `punch_types` data frame as ground truth for punch type (labeled 1-6) in addition to the boxers stance (orthodox or southpaw), and punching head (right or left). The punch types are below.

```{r}
###### PUNCH TYPES
#1 - Cross
#2 - Hook
#3 - Jab
#4 - Upper Cut
#5 - Overhand (shouldn't be any of these)
#6 - Unknown (shouldn't be any of these)
```


### 6
**Features**: Create at least 10 new features from the punch profiles. They can be combinations of x and y acceleration or individually from either. Explain how these features have been constructed.

### 7
**Multinomial Model** Fit a multinomial model to estimate each of the punch types. Which of the punch types have the most difficulty in being separated?

### 8
**Logistic Regression** Consider bucketing the punches into two groups (straights and hooks). Are you able to improve accuracy in any way?





207 changes: 207 additions & 0 deletions week-5/problem-set-2/problem-set-2.html

Large diffs are not rendered by default.

Binary file modified week-5/slides-week-5.key
Binary file not shown.
Binary file added week-5/slides-week-5.pdf
Binary file not shown.
Binary file added week-6/.DS_Store
Binary file not shown.
15 changes: 15 additions & 0 deletions week-6/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Thinking about overfitting and tree models...maybe

[Slides](https://github.com/natelangholz/stat412-advancedregression/blob/master/week-6/slides-week-6.pdf)

Some additional readings.

[An Empirical Comparison of Supervised Learning Algorithms](https://www.cs.cornell.edu/%7Ecaruana/ctp/ct.papers/caruana.icml06.pdf)

[Interesting visual outlining tree models](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)

Maybe the most well known predictive modeling exercise is the $1 million Netflix prize. Not only is it widely known that Netflix did not use the winning model because of engineering costs due to computational complexity but Netflix also had to settle a multi-million dollar lawsuit for privacy invasion.

[Netflix never used its $1 Million Algorithm due to engineering costs](https://www.wired.com/2012/04/netflix-prize-costs/)

[The Netflix Prize](https://www.thrillist.com/entertainment/nation/the-netflix-prize)
Binary file added week-6/slides-week-6.pdf
Binary file not shown.

0 comments on commit bfbcfb0

Please sign in to comment.