Darvy Teav, Tracey Geneau, Shelly Girdhar Sakkerwal, Melissa Wegrzyn
https://www.kaggle.com/datasets/colewelkins/cardiovascular-disease/data
- Data Folder
- cardio_data_processed.csv is to be used with Cardiovascular_model.ipynb and Cardiovascular_model_auto_optimization.ipynb
- clean_cardio_data.csv is to be used with random_forest.ipynb and XGBoost.ipynb
- Image
- all 3 images are used in this report.
- Purpose:
The purpose of our model is to be able to predict if an individual is at risk of cardiovascular disease - Data Processing:
Our target variable is cardio - Feature variables:
Columns that we kepts were gender, cholestrol, glu, smoke, alc, active, cardio, age_years, bmi, elevated, hypertension Stage 1, Hypertension Sate 2, Normal - Removed Columns:
We removed age, api_hi, api_lo, id, cardio, bp_category, height, and weight - Cleaning up the data:
After remvoing redundancies and irrelavent data, we first used a box plot whisker to find all the outliers and remove them from our data set. The raw data start at 68205 rows down to 62505 rows. We encoded and one-hot encoded to categorized all nominal data.
- Model: Random Forest
To run the model, we ended up using the clean_cardio_data.csv. This csv file was created when we were using keras optimization tool. After splitting the data set into train and test sets, we scaled it and trained it. After making our prediction, the results gave us an accuracy of 68.70%. We can see that the test data was split quit evenly between cardio and non-cardiovascular disease. Recall for predicting cardiovascular disease is 63%, we would prefer a higher percentage. In terms of feature importance, Hypertension Stage 2 and age are the key drivers when it comes to making decisions.
- Model: XGBoost
To run the model, we ended up using the clean_cardio_data.csv. Just like our Random Forst model it gve the same accuracy of 68.59%. The Confusion Matrix and Classification report spitted out very similar results. It isn't too suprisingly considering that they are quite similar in how it makes decisions. We did compare the training accuracy with the validation accuracy to see if there is possible change of overfitting, but as it turns out the model is predicting as it should - Model: Keras Hypertuner and Multilayer Perceptrons
We first used the raw csv file (cardio_data_processed.csv) to clean up the data before running both. We ran the optimizer and the actual Neural Network on 2 different notebooks because the model would create extra layers from the optimizer when I wanted to test out the Deep Learning model. We ran optimizer a couple of times, and it would always output a different result. Once we saw that the best model was over 70% accuracy, we would stop the search and get a summary of the best model. Then we would use the combination to run the actual model. The model we chose was as follow:
Model: "sequential"
=================================================================
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 3) 39
dense_1 (Dense) (None, 5) 20
dense_2 (Dense) (None, 9) 54
dense_3 (Dense) (None, 1) 10
=================================================================
Total params: 123 (492.00 Byte)
Trainable params: 123 (492.00 Byte)
Non-trainable params: 0 (0.00 Byte)
Activation function: tanh
Epoch: 7
=================================================================
The model was able to give us an accuracy of 70.08%. We also did a comparison of the training and validation accuracy to see if there was overfitting and as it turns out there wasn't any. The confusion matrix had zeros for false negative and true negative. When it came to predicting at risk for cardiovascular disease we also got zeros based on the classification report. We decided to look at the ROC curve and calculate AUC to see if the model is accuracy is actually 70%. It turns out that AUC is 0.729 which is higher than the training accuracy. In addition, we inputted new information of a patient to see if the model can predict if someone is at risk of cardiovascular disease and it turned out that it was able to predict that the patient has cardiovascular disease. Recal was 0.65 for cardiovascular disease and 0.76 recall for patient without cardiovascular disease.
- Chosen model: Multilayer Perceptrons
We decided to go with the Multilayer Perceptron since it gave us the highest training accuracy and the highest recall which is the one we are most concern about as we don't want the least amount of false negative as possible. XGBoost on the other hand would have been the second option since it ran the fastest of all 3. Even though Multilayer Perceptron takes longer to run and train, there are always room for improvements. Before cleaning up the data, we trained our model to see what accuracy it would achieve, and it was also able to achieve a 70% accuracy. However, we did find that adding more levels/bins of age to the data did improve the accuracy by a little bit, but not significant enough. - Recommendations
We could collect additional information/data that also contribute to cardiovascular disease, such as tracking hours of sleep. For the current features, we can add extra categories such as exercise. Rather than just having yes and no, we can incorporate the hours of exercise per week. BMI is just one of the standard ratios to help evaluate an individual’s health. However, we also know that it isn't accurate. Instead of BMI, we can do a fat pinch test to incorporate fat index. For blood pressure, we can use a different type of measurement such as Mean Arterial Pressure instead of using the typical category of Normal, Hypertension, etc. Every time we receive a reasonable amount of new data, we can add it to the data set to retrain it to get possibly help increase accuracy and recall.