This is the final project for the Data Science Nanodegree. The accompaniying post can be found here.
This repository address the question: given a dataset containing the user activity in Sparkify, a music streaming service, is it possible to predict which users will churn?
Due limited computational resources, only a subset of the total data was used. Though the procedures here described should be valid to the overall dataset.
Exploratory data analysis was performed in term of gender, popular artist, level of account and pages clicked.
Then churn was defined as those users that visited the website: Cancellation Confirmation. With this churn defition some initial plotting was made in term of gender, level and lifetime in the service.
Afterwards, the original dataset was cleaned and transformed into useful features:
- lifetime: time in the service before cancelling
- total_songs: total songs listened
- num_thumb_up: number of thumbs up given in the service
- num_thumb_down: number of thumbs up given in the service
- add_to_playlist: total number of songs added to playlists
- add_friend: number of friends added
- avg_songs_played: average number of songs listened per session
- gender: gender
Resulting in a dataframe with barely 192 full rows/unique accounts.
Then a pipeline was introduced to perform feature scaling (using a standard scaler) and then model fitting using CV without any parameters to try. In this step, he metric used to determine a model's fit was Accuracy and F1, due to the imbalance in the output classes (churned vs not churned)
The different models tested with their default settings are as follows:
- Logistic Regression
- Gradient Boosted Trees
- Random Forest
- Decision Tree Classifier
After which the best one was selected and further tuned. The conclusions are based on such model.
Most of the trained models achieved moderate success in predicting churning behaviour, beating the baseline model.
Given the features processed it was found that the best model to further test was the Decision Tree Classifier with an Accuracy: 0.769 and an F-1 Score: 0.734 on the test set. Using F-1 helped to avoid the pitfall of using accuracy to quantify an imbalanced set.
Further tuning lead to overfitting in the training set, which lead to Accuracy: 0.718 and F-1 Score: 0.675 on the test set. Most likely due the small size of the data.
In terms of the feature importance, the 3 most relevant, in the default Decision Tree Classifier, are, in order, 'lifetime', 'add_friend' and 'total_songs'. Though this might change if the full dataset is used
There are several aspects of this analysis that could be expanded and that could change dramatically the outcome.
- Full dataset: the sample used is a less than 1% of the total dataset (12gb) thus the conclusions showed here are of limited validity
- numFolds: related to 1. in this case the number of folds was set to 3, but with larger datasets this could be increased to 10
- Features: more categorical features could be used
- Outlier detection: no efforts were made to detect outliers, which would be important to study in the complete dataset
- Imbalance handling: techniques such as over/under-sampling or different train/test rations could be useful to manage this point
- Missing data imputation: rows/accounts with missing values were deleted, but such values could be inputed in a later study
- mini_sparkify_event_data.rar
- README.md
- requirements.txt
- Sparkify.ipynb
- matplotlib==2.1.0
- pyspark==2.4.3
- seaborn==0.8.1
- python==3.6.3