The aim of this project is to predict the probability of a subject taking the H1N1 and Seasonal flu vaccines according to the provided data. This project is built for the data science competition: https://www.drivendata.org/competitions/66/flu-shot-learning/.
In this readme, I will explain the steps I took to achieve my results in the competition:
- AUROC = 0.8442
- Top 14% of participants - as of writing this
Sub-folders:
- input_data - raw data from competition;
- interim_data - preprocessed data to be used in modelling;
- output_data - model predictions to submit;
- models - pickled models to be imported by the notebooks.
Main folder:
- All the notebooks (.ipynb) and respective scripts (.py) for the project;
- requirements.txt - project dependencies;
- EDA
- PREPROCESSING
- MODEL_SELECTION - performs cross-validation of models to select the best one;
- TUNING - tunes the selected models hyperparameters, to improve score;
- GENERAL - joins all the steps and performs predictions.
In order to solve the problem I have applied the following Data Science mindset:
- Explore the data using EDA - gain insight on the main aspects of the data such as distributions, trends, predictors, etc.
- Clean data in PREPROCESSING - apply the gained insight to preprocess the data and getting it ready for model consumption.
- Perform cross validation of MODELS - select the models I want to use; these models will be a basis to test different preprocessing assumptions and will eventually be part of the final model;
- Tune some of the models using OPTUNA;
- Get everything together and make predictions;
- Iterate through every step applying different preprocessing assumptions, model building techniques and trying to optimize the model to the AUROC metric.