Team : machine-learners
Team members:
- Madeleine HUEBER
- Adrian MARTINEZ LOPEZ
- Duru BEKTAS
This project is part of the course CS-433 Machine Learning at EPFL. The goal of the project is to predict if someone will have a cardiovascular disease based on a set of answers to a medical questionnaire. The dataset used comes from the Behavioral Risk Factor Surveillance System (BRFSS) and contains for each person the answers to 330 questions. The target variable is binary and indicates if the person has a cardiovascular disease or not. For this project we first did some data preprocessing to clean the data and handle missing values. Then we did some feature selection to reduce the number of features. Finally, we trained a logistic regression model to predict the target variable.
More details about the project can be found in the project paper project1_report.pdf
.
data/
: Directory containing datasets.
processed_x_test.npz
: Processed test dataset in compressed NumPy format.processed_x_train.npz
: Processed train dataset in compressed NumPy format.processed_y_train.npz
: Processed train labels in compressed NumPy format.test_dataset.npz
: Test dataset in compressed NumPy format.train_dataset.npz
: Training dataset in compressed NumPy format.train_targets.npz
: Labels for the training dataset in compressed NumPy format.
dataset_text_analysis/
: Directory containing auxiliar code, not used in the run script, related with the text processing of the dataset. It is not necessary to run these files, we provide them to give visibility on the data processing steps.
codebook15_llcp.txt
: Dataset Report describing every feature and their possible value labels.dataset_text_analysis.py
: Script that reads the dataset report and outputs a dictionary with weird values for every feature together with their right replacements (77 -> Nan
). This dictionary is stored as a json file calledvar_abnormal_values.json
var_abnormal_values.json
: Example of the json file generated by the script. This json is then manually modified to remove mistakes in the data processing and stored as a variable (config.ABNORMAL_FEATURE_VALUES
) in thesrc/config.py
file
src/
: Main directory for source code modules.
config.py
: Configuration file containing paths, settings, and helper data structures for the project.data_cleaning.py
: Module for cleaning data and handling outliers or missing values.data_preprocessing.py
: Functions for preprocessing data before training, such as normalization or encoding.evaluation.py
: Module to evaluate model predictions using various metrics.feature_engineering.py
: Contains functions to create and transform features.feature_type_detection.py
: Detects data types of features and helps with automated preprocessing.helpers.py
: Utility functions for tasks like saving files, managing logs, etc.model.py
: Contains functions for training, validating, and predicting with the model.
Project Root:
.gitattributes
and .gitignore
: Git configuration files for version control, specifying files to include or exclude.README.md
: Project documentation with instructions on setup, usage, and structure.implementations.py
: Functions from the part 1 of the projectrun.py
: Main script to execute the full pipeline, from data loading and preprocessing to model training and evaluation. If the processed dataset is already in thedata
folder, the preprocessing step will be skipped. Othwerwise, the original dataset will be loaded and preprocessed. Both datasets are present in thedata
folder.hyperparameter_selection.py
: Script to perform hyperparameter selection with 5-fold cross-validation.project1_report.pdf
: The project paper explaining our process
To use our model on this project, you will first need to clone the repository
git clone https://github.com/CS-433/ml-project-1-machine-learners/
The code required to have these packages installed :
- numpy
- matplotlib
To run our model, you will need to run the following command in the terminal:
python run.py --seed 42 --gamma 0.1 --max_iters 1000 --lambda_ 0 --undersampling_ratio 0.2
It will preprocess the data, train the model and output the predictions in the form of a csv file. The predictions will be saved as submission.csv
in the root folder.
You can specify the following arguments when running the model:
Argument | Description | Type | Default |
---|---|---|---|
--seed |
Set the seed for deterministic results | int |
42 |
--gamma |
Learning rate for training | float |
0.1 |
--max_iters |
Maximum number of iterations | int |
1000 |
--lambda_ |
Regularization parameter | float |
0 |
--undersampling_ratio |
Undersampling ratio to balance the classes | float |
0.2 |
To do some parameters exploration, you can run the following command in the terminal:
python hyperparameter_selection.py
It will test different values for the hyperparameters listed above and ouput which values of each hyperparameter give the best F-1 score using a 5-fold cross-validation.