-
Notifications
You must be signed in to change notification settings - Fork 0
/
7_war_from_stats.Rmd
161 lines (101 loc) · 3.71 KB
/
7_war_from_stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: 'Notebook1'
author: 'Alek Popovic, Firas Sada'
date: '2024-04-21'
output: html_document
---
# Predicting WAR (Wins Above Replacement) from Player Stats
This notebook outlines various statistical methods to predict the Wins Above Replacement (WAR) statistic for baseball players based on their performance statistics.
## Setup Libraries and Seed
Load the necessary libraries and set a seed for reproducibility of results.
```{r}
library(caret)
library(glmnet)
library(rpart)
library(randomForest)
set.seed(1122)
```
## Data Loading
Load the processed dataset and display the first few rows.
```{r}
df = read.csv('dataset/processed_data.csv')
head(df)
```
## Data Preprocessing
Remove the 'Top100Status' column and shuffle the dataset to randomize the row order.
```{r}
# Remove the 'Top100Status' column
df <- df[, !colnames(df) %in% 'Top100Status']
# Shuffle the rows of 'df'
df <- df[sample(nrow(df)), ]
head(df)
```
## Train-Test Split
Create an 80-20 train-test split for model training and evaluation.
```{r}
# Create an 80-20 train-test split
index <- createDataPartition(df$WinsAboveReplacement, p=0.8, list=FALSE)
df_train <- df[index, ]
df_test <- df[-index, ]
```
## Model Training and Evaluation
Define a function to fit models and calculate Mean Absolute Error (MAE).
```{r}
fit_and_evaluate <- function(model_formula, df_train, df_test, alpha_value = NULL) {
if (is.null(alpha_value)) {
# Fit Linear Regression
model <- lm(model_formula, data = df_train)
predictions <- predict(model, newdata = df_test)
} else {
# Fit Ridge or Lasso Regression
model_matrix <- model.matrix(WinsAboveReplacement ~ ., data = df_train)
y <- df_train$WinsAboveReplacement
model <- glmnet(model_matrix, y, alpha = alpha_value)
# Find the lambda that minimizes the cross-validation error
cv_model <- cv.glmnet(model_matrix, y, alpha = alpha_value)
lambda_optimal <- cv_model$lambda.min
# Predict on test set
test_matrix <- model.matrix(WinsAboveReplacement ~ ., data = df_test)
predictions <- predict(model, newx = test_matrix, s = lambda_optimal)
}
# Adjust prediction output for glmnet
if (!is.null(alpha_value)) {
predictions <- as.vector(predictions) # Convert matrix to vector if glmnet
}
# Calculate and return Mean Absolute Error
mae <- mean(abs(predictions - df_test$WinsAboveReplacement))
return(mae)
}
```
## Model Performance
```{r}
# Linear Regression
linear_mae <- fit_and_evaluate(WinsAboveReplacement ~ ., df_train, df_test)
cat('Linear Regression MAE:', linear_mae, '\n')
# Ridge Regression (alpha = 0 for ridge)
ridge_mae <- fit_and_evaluate(WinsAboveReplacement ~ ., df_train, df_test, alpha_value = 0)
cat('Ridge Regression MAE:', ridge_mae, '\n')
# Lasso Regression (alpha = 1 for lasso)
lasso_mae <- fit_and_evaluate(WinsAboveReplacement ~ ., df_train, df_test, alpha_value = 1)
cat('Lasso Regression MAE:', lasso_mae, '\n')
```
## Decision Tree Model
Train a decision tree and evaluate its performance.
```{r}
tree <- rpart(WinsAboveReplacement ~ ., data = df_train, method='anova')
# Predict on the test set
predictions <- predict(tree, newdata=df_test)
# Calculate the Mean Absolute Error (MAE)
mae <- mean(abs(predictions - df_test$WinsAboveReplacement))
cat('Decision Tree Regression MAE:', mae, '\n')
```
## Random Forest Model
Train a random forest model with a small number of trees and assess its performance.
```{r}
forest <- randomForest(WinsAboveReplacement ~ ., data = df_train, ntree=5)
# Predict on the test set
predictions <- predict(forest, newdata=df_test)
# Calculate the Mean Absolute Error (MAE)
mae <- mean(abs(predictions - df_test$WinsAboveReplacement))
cat('Random Forest Regression MAE:', mae, '\n')
```