8_rank_from_war.Rmd

---
title: 'Notebook2'
author: 'Alek Popovic, Firas Sada'
date: '2024-04-21'
output: html_document
---

# Predicting Top 100 Status from WAR (Wins Above Replacement)

This notebook demonstrates various modeling techniques to predict whether a baseball player's performance, as measured by the Wins Above Replacement (WAR) metric, qualifies them for a Top 100 status.

## Setup Libraries and Seed

Load the necessary libraries for modeling and set a seed to ensure reproducibility.


```{r}

library(caret)
library(glmnet)
library(rpart)
library(randomForest)

set.seed(1122)

```

## Data Loading and Initial Processing

Load the processed dataset and preview the data.


```{r}

df = read.csv('dataset/processed_data.csv')
head(df)

```

## Data Preparation

Refine the dataset to focus only on the Top100Status and WinsAboveReplacement columns. Remove rows where Top100Status is 'No' and convert Top100Status to an integer for modeling.


```{r}

df <- df[, c('Top100Status', 'WinsAboveReplacement')]

df <- df[df$Top100Status != 'No', ]

df$Top100Status <- as.integer(df$Top100Status)

df <- df[sample(nrow(df)), ]

head(df)

```

## Train-Test Split

Create an 80-20 train-test split to evaluate the performance of the models accurately.


```{r}

index <- createDataPartition(df$Top100Status, p=0.8, list=FALSE)
df_train <- df[index, ]
df_test <- df[-index, ]

```


## Model Training and Evaluation Function

Define a function to fit models using either Linear Regression or Regularized Regression methods (Ridge, Lasso) and calculate the Mean Absolute Error (MAE).


```{r}

fit_and_evaluate <- function(model_formula, df_train, df_test, alpha_value = NULL) {
  
  if (is.null(alpha_value)) {
    # Fit Linear Regression
    model <- lm(model_formula, data = df_train)
    predictions <- predict(model, newdata = df_test)
  } else {
    # Fit Ridge or Lasso Regression
    model_matrix <- model.matrix(Top100Status ~ ., data = df_train)
    y <- df_train$Top100Status
    model <- glmnet(model_matrix, y, alpha = alpha_value)
    
    # Find the lambda that minimizes the cross-validation error
    cv_model <- cv.glmnet(model_matrix, y, alpha = alpha_value)
    lambda_optimal <- cv_model$lambda.min
    
    # Predict on test set
    test_matrix <- model.matrix(Top100Status ~ ., data = df_test)
    predictions <- predict(model, newx = test_matrix, s = lambda_optimal)
  }
  
  # Adjust prediction output for glmnet
  if (!is.null(alpha_value)) {
    predictions <- as.vector(predictions) # Convert matrix to vector if glmnet
  }
  
  # Calculate and return Mean Absolute Error
  mae <- mean(abs(predictions - df_test$Top100Status))
  return(mae)
  
}

```

## Model Performance Evaluation

Evaluate the Mean Absolute Error (MAE) for Linear Regression, Ridge Regression, and Lasso Regression.


```{r}

# Linear Regression
linear_mae <- fit_and_evaluate(Top100Status ~ ., df_train, df_test)
cat('Linear Regression MAE:', linear_mae, '\n')

# Ridge Regression (alpha = 0 for ridge)
ridge_mae <- fit_and_evaluate(Top100Status ~ ., df_train, df_test, alpha_value = 0)
cat('Ridge Regression MAE:', ridge_mae, '\n')

# Lasso Regression (alpha = 1 for lasso)
lasso_mae <- fit_and_evaluate(Top100Status ~ ., df_train, df_test, alpha_value = 1)
cat('Lasso Regression MAE:', lasso_mae, '\n')

```

## Decision Tree Model

Train a decision tree on the training data and evaluate its performance on the test set.


```{r}

tree <- rpart(Top100Status ~ ., data = df_train, method='anova')

# Predict on the test set
predictions <- predict(tree, newdata=df_test)

# Calculate the Mean Absolute Error (MAE)
mae <- mean(abs(predictions - df_test$Top100Status))
cat('Decision Tree Regression MAE:', mae, '\n')

```

## Random Forest Model

Implement a Random Forest model and compute the Mean Absolute Error (MAE) on the test data.

```{r}

forest <- randomForest(Top100Status ~ ., data = df_train, ntree=5)

# Predict on the test set
predictions <- predict(forest, newdata=df_test)

# Calculate the Mean Absolute Error (MAE)
mae <- mean(abs(predictions - df_test$Top100Status))
cat('Random Forest Regression MAE:', mae, '\n')

```