The goal of this project was to use machine learning to predict the outcomes of sumo wrestling bouts in Japanese Sumo Grand Tournaments.
Most of the project focused on data collection and exploration. On the prediction side, a simple logistic regression model ingesting a few key features can predict the outcomes of a grand tournament with about 61% accuracy. Better than random chance, but not enough margin to beat the betting markets. A lot of work can be done regarding prediction.
data/ : contains data collected from online, public database using the Beautiful Soup Library. Stored as pickle files from pandas dataframes.
plots/ : contains visualizations of data in Seaborn plots saved as png's.
tourneys/ : contains daily tourney head-to-head lineups for March (Haru) Basho 2017.
Machine Learning Files:
-
machine_learning.py : Script for doing various machine learning tasks, including model evaluation & predicting outcomes of new bouts.
-
ml_fxns.py : Module with helper functions for various machine learning tasks. Could include data pre-processing tasks, prediction tasks, etc.
Data Scraping Files:
-
rikishi_scrape.py : Module with functions used for scraping data with Beautiful Soup. Functions are used when scraping data from multiple html pages/sumo wrestlers.
-
scrape_multiple_h2h.py : Script to scrape head-to-head data for multiple sumo wrestlers.
-
scrape_multiple_rikishi.py : Script to scrape basic profile data for multiple sumo wrestlers.
Data Preparation Files:
-
data_extraction.py : Module with helper functions for processing information extracted from html tags using scraping libraries (e.g. Beautiful Soup).
-
database_ops.py : Module with helper functions to perform various operations with DataFrames.
-
feature_generation.py : Script to generate DataFrame containing feature data and labels.
-
filter_duplicates.py : Script to filter out the duplicate rows in raw head-to-head DataFrame generated by feature generation script.
ml_playground.ipynb : notebook for playing around with various machine learning tasks.
testing_playground.ipynb : notebook for testing miscellaneous pieces of code.
visualizations.ipynb : notebook for generating Seaborn visualizations of scraped data.