The focus of the project within this repository is to analyze and identify Draft Biases within the Major League Baseball (MLB) Amateur Draft. The Draft has been taking place since 1965, and in the past has featured up to 100 rounds (today's version features 20 rounds).
Aside from identifying these draft biases we sought to determine if there was a method for identifying a player's success level in the MLB. Our metric of success used was FanGraphs Wins Above Replacement (fWAR). Our predictors were comprised entirely of variables that included information of player demographics, and physical characteristics.
- Peter D. DePaul III
- Data Collection
- Data Cleaning
- EDA and Visualizations
- Model Creation
- Final Report
- Anish Ravilla
- Final Report
- Robin Lee
- EDA and Visualizations
- Final Report
- Alan Wong
- Kevin Kim
- Hongye Zhang
To find our dictionary of variables click below:
Our Data Collection process was performed utilizing baseballR
, and pybaseball
respectively. These processes can be found below:
Our data cleaning process was performed utilizing R and several packages (primarily those in tidyverse
)
The data files we used to build our models, and the raw data we collected are stored all within the file linked below:
To read the report on our findings click the link below:
bookdown
- Used for generating the report utilizing the Bookdown syntax language Link
ggplot2
- Used to create the visualizations and EDA in the Report Link
gridExtra
corrplot
data.table
- Utilized to decrease memory of our data objects to reduce processing time. Link
tidyverse
- Utilized for the data cleaning process Link
tidymodels
- Utilized for the creation of the boosted decision tree prediction model Link
xgboost
- The xgboost engine was used for the boosted decision tree prediction model Link
doParallel
- Used for parallel processing during the tuning process of the model hyperparameters Link
vip
- Used to create the variable importance plot for the model. Link
caret
- Utilized for the training process of the model Link
Boruta
- Used to confirm feature selection importance Link
kableExtra
- Used to create LaTeX formatted tables within the report. Link
maps
mapsdata
mapproj
reshape2