-
Notifications
You must be signed in to change notification settings - Fork 0
/
8_rank_from_war.Rmd
173 lines (101 loc) · 4.03 KB
/
8_rank_from_war.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: 'Notebook2'
author: 'Alek Popovic, Firas Sada'
date: '2024-04-21'
output: html_document
---
# Predicting Top 100 Status from WAR (Wins Above Replacement)
This notebook demonstrates various modeling techniques to predict whether a baseball player's performance, as measured by the Wins Above Replacement (WAR) metric, qualifies them for a Top 100 status.
## Setup Libraries and Seed
Load the necessary libraries for modeling and set a seed to ensure reproducibility.
```{r}
library(caret)
library(glmnet)
library(rpart)
library(randomForest)
set.seed(1122)
```
## Data Loading and Initial Processing
Load the processed dataset and preview the data.
```{r}
df = read.csv('dataset/processed_data.csv')
head(df)
```
## Data Preparation
Refine the dataset to focus only on the Top100Status and WinsAboveReplacement columns. Remove rows where Top100Status is 'No' and convert Top100Status to an integer for modeling.
```{r}
df <- df[, c('Top100Status', 'WinsAboveReplacement')]
df <- df[df$Top100Status != 'No', ]
df$Top100Status <- as.integer(df$Top100Status)
df <- df[sample(nrow(df)), ]
head(df)
```
## Train-Test Split
Create an 80-20 train-test split to evaluate the performance of the models accurately.
```{r}
index <- createDataPartition(df$Top100Status, p=0.8, list=FALSE)
df_train <- df[index, ]
df_test <- df[-index, ]
```
## Model Training and Evaluation Function
Define a function to fit models using either Linear Regression or Regularized Regression methods (Ridge, Lasso) and calculate the Mean Absolute Error (MAE).
```{r}
fit_and_evaluate <- function(model_formula, df_train, df_test, alpha_value = NULL) {
if (is.null(alpha_value)) {
# Fit Linear Regression
model <- lm(model_formula, data = df_train)
predictions <- predict(model, newdata = df_test)
} else {
# Fit Ridge or Lasso Regression
model_matrix <- model.matrix(Top100Status ~ ., data = df_train)
y <- df_train$Top100Status
model <- glmnet(model_matrix, y, alpha = alpha_value)
# Find the lambda that minimizes the cross-validation error
cv_model <- cv.glmnet(model_matrix, y, alpha = alpha_value)
lambda_optimal <- cv_model$lambda.min
# Predict on test set
test_matrix <- model.matrix(Top100Status ~ ., data = df_test)
predictions <- predict(model, newx = test_matrix, s = lambda_optimal)
}
# Adjust prediction output for glmnet
if (!is.null(alpha_value)) {
predictions <- as.vector(predictions) # Convert matrix to vector if glmnet
}
# Calculate and return Mean Absolute Error
mae <- mean(abs(predictions - df_test$Top100Status))
return(mae)
}
```
## Model Performance Evaluation
Evaluate the Mean Absolute Error (MAE) for Linear Regression, Ridge Regression, and Lasso Regression.
```{r}
# Linear Regression
linear_mae <- fit_and_evaluate(Top100Status ~ ., df_train, df_test)
cat('Linear Regression MAE:', linear_mae, '\n')
# Ridge Regression (alpha = 0 for ridge)
ridge_mae <- fit_and_evaluate(Top100Status ~ ., df_train, df_test, alpha_value = 0)
cat('Ridge Regression MAE:', ridge_mae, '\n')
# Lasso Regression (alpha = 1 for lasso)
lasso_mae <- fit_and_evaluate(Top100Status ~ ., df_train, df_test, alpha_value = 1)
cat('Lasso Regression MAE:', lasso_mae, '\n')
```
## Decision Tree Model
Train a decision tree on the training data and evaluate its performance on the test set.
```{r}
tree <- rpart(Top100Status ~ ., data = df_train, method='anova')
# Predict on the test set
predictions <- predict(tree, newdata=df_test)
# Calculate the Mean Absolute Error (MAE)
mae <- mean(abs(predictions - df_test$Top100Status))
cat('Decision Tree Regression MAE:', mae, '\n')
```
## Random Forest Model
Implement a Random Forest model and compute the Mean Absolute Error (MAE) on the test data.
```{r}
forest <- randomForest(Top100Status ~ ., data = df_train, ntree=5)
# Predict on the test set
predictions <- predict(forest, newdata=df_test)
# Calculate the Mean Absolute Error (MAE)
mae <- mean(abs(predictions - df_test$Top100Status))
cat('Random Forest Regression MAE:', mae, '\n')
```