-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from natelangholz/master
Give me the files!
- Loading branch information
Showing
21 changed files
with
1,532 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,3 +2,6 @@ | |
.Rhistory | ||
.RData | ||
.Ruserdata | ||
|
||
/final-project/data_prep.R | ||
/final-project/nyt-comments/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Final Project: Kaggle Competition | ||
|
||
[Final Project Kaggle page](https://www.kaggle.com/c/mas-412-final-project) is now live. | ||
|
||
[https://www.kaggle.com/c/mas-412-final-project](https://www.kaggle.com/c/mas-412-final-project) | ||
|
||
## Project Description | ||
|
||
The MAS 412 final project will be this Kaggle competition to predict the number of upvotes a New York Times article comment will receive. This estimation will be a tool that can give us a gauge on public opinion. The response variable will be `recommendations` in the training/testing `comments` file. | ||
|
||
Submissions for the final project will be live until June 4, 2018 at 11:59 pm. | ||
|
||
There is also a presentation component that will happen final class period for 10 minutes each. |
Binary file not shown.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
#Week 5 | ||
|
||
Predictive modeling and talking about data pre-processing steps. | ||
|
||
[Slides](https://github.com/natelangholz/stat412-advancedregression/blob/master/week-5/slides-week-5.pdf) | ||
|
||
|
||
##Readings | ||
|
||
Some additional readings. The first is something we want stay away from a bit and the other two are incredibly interesting. | ||
|
||
[We can do better...](https://www.smithsonianmag.com/innovation/can-computer-model-predict-first-round-this-years-march-madness-180968461/) | ||
|
||
[Statistical Modeling: The Two Cultures - Leo Breiman](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726) | ||
|
||
[The data that transformed AI research - and possibly the world](https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/) | ||
|
||
|
||
##Problem Set | ||
|
||
Problem set 2 is due Monday April 14,2018 at 10pm. |
Binary file not shown.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
--- | ||
output: | ||
html_document: | ||
highlight: pygments | ||
--- | ||
|
||
|
||
|
||
### Risky Behavior | ||
The data `risky_behaviors.dta` is from a randomized experiment that targeted couples at high risk of HIV infection. Counseling sessions were provided to the treatment group regarding practices that could reduce their likelihood of contracting HIV. Couples were randomized either to a control group, a group in which just the woman participated, or a group in which both members of the couple participated. The response variable to be examined after three months was “number of unprotected sex acts.” | ||
|
||
```{r} | ||
library(foreign) | ||
rb <- read.dta("http://www.stat.columbia.edu/~gelman/arm/examples/risky.behavior/risky_behaviors.dta", convert.factors=TRUE) | ||
``` | ||
|
||
|
||
### 1 | ||
**Estimate**: Model this outcome as a function of treatment assignment using a Poisson regression. Does the model fit well? Is there evidence of overdispersion? | ||
|
||
### 2 | ||
**Estimate Extension**: Extend the model to include pre-treatment measures of the outcome and the additional pre-treatment variables included in the dataset. Does the model fit well? Is there evidence of overdispersion? | ||
|
||
### 3 | ||
**Overdispersion**: Fit an overdispersed (quasi-)Poisson model. Fit a negative binomial model. Compare the models to previous two you have fit. Finally, what do you conclude regarding effectiveness of the intervention? | ||
|
||
### 4 | ||
**Hurdle Model?**: Fit a hurdle model to this data. This is a classic data set for Poisson regression and overdispersion...i'm honestly curious if the hurdle model makes sense and improves over any of the other previous models you have built. Also compare rootograms for all. | ||
|
||
|
||
### 5 | ||
**Assumptions**: These data include responses from both men and women from the participating couples. Does this give you any concern? | ||
|
||
|
||
* * * | ||
|
||
### Pulling Punches | ||
|
||
The two `.Rdata` files under week 4 come as an abbreviated version of punch profiles from a boxing system to measure acceleration of boxers during live fights. The `profiles` list from the first file below has each individual punch profile which has a list item being a 3 column data frame of time (in ms around the middle of the punch event), acceleration in x (forward-back in g's), and acceleration in y (side-to-side in g's). Also attached are some other fields which are of less importance and contribute to this being a somewhat messy data set. | ||
|
||
```{r two, eval = FALSE} | ||
load(file = 'week-4/punch_profiles.Rdata') | ||
load(file = 'week-4/punch_types.Rdata') | ||
``` | ||
|
||
There are 2135 labeled punch profiles each with a labeled punch type. Use the `punch_types` data frame as ground truth for punch type (labeled 1-6) in addition to the boxers stance (orthodox or southpaw), and punching head (right or left). The punch types are below. | ||
|
||
```{r} | ||
###### PUNCH TYPES | ||
#1 - Cross | ||
#2 - Hook | ||
#3 - Jab | ||
#4 - Upper Cut | ||
#5 - Overhand (shouldn't be any of these) | ||
#6 - Unknown (shouldn't be any of these) | ||
``` | ||
|
||
|
||
### 6 | ||
**Features**: Create at least 10 new features from the punch profiles. They can be combinations of x and y acceleration or individually from either. Explain how these features have been constructed. | ||
|
||
### 7 | ||
**Multinomial Model** Fit a multinomial model to estimate each of the punch types. Which of the punch types have the most difficulty in being separated? | ||
|
||
### 8 | ||
**Logistic Regression** Consider bucketing the punches into two groups (straights and hooks). Are you able to improve accuracy in any way? | ||
|
||
|
||
|
||
|
||
|
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
Thinking about overfitting and tree models...maybe | ||
|
||
[Slides](https://github.com/natelangholz/stat412-advancedregression/blob/master/week-6/slides-week-6.pdf) | ||
|
||
Some additional readings. | ||
|
||
[An Empirical Comparison of Supervised Learning Algorithms](https://www.cs.cornell.edu/%7Ecaruana/ctp/ct.papers/caruana.icml06.pdf) | ||
|
||
[Interesting visual outlining tree models](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) | ||
|
||
Maybe the most well known predictive modeling exercise is the $1 million Netflix prize. Not only is it widely known that Netflix did not use the winning model because of engineering costs due to computational complexity but Netflix also had to settle a multi-million dollar lawsuit for privacy invasion. | ||
|
||
[Netflix never used its $1 Million Algorithm due to engineering costs](https://www.wired.com/2012/04/netflix-prize-costs/) | ||
|
||
[The Netflix Prize](https://www.thrillist.com/entertainment/nation/the-netflix-prize) |
Binary file not shown.