Data Science NanoDegree Final Project: Sparkify

This is the final project for the Data Science Nanodegree. The accompaniying post can be found here.

Motivation

This repository address the question: given a dataset containing the user activity in Sparkify, a music streaming service, is it possible to predict which users will churn?

Methodology

Due limited computational resources, only a subset of the total data was used. Though the procedures here described should be valid to the overall dataset.

Exploratory data analysis was performed in term of gender, popular artist, level of account and pages clicked.

Then churn was defined as those users that visited the website: Cancellation Confirmation. With this churn defition some initial plotting was made in term of gender, level and lifetime in the service.

Afterwards, the original dataset was cleaned and transformed into useful features:

lifetime: time in the service before cancelling
total_songs: total songs listened
num_thumb_up: number of thumbs up given in the service
num_thumb_down: number of thumbs up given in the service
add_to_playlist: total number of songs added to playlists
add_friend: number of friends added
avg_songs_played: average number of songs listened per session
gender: gender

Resulting in a dataframe with barely 192 full rows/unique accounts.

Then a pipeline was introduced to perform feature scaling (using a standard scaler) and then model fitting using CV without any parameters to try. In this step, he metric used to determine a model's fit was Accuracy and F1, due to the imbalance in the output classes (churned vs not churned)

The different models tested with their default settings are as follows:

Logistic Regression
Gradient Boosted Trees
Random Forest
Decision Tree Classifier

After which the best one was selected and further tuned. The conclusions are based on such model.

Results

Most of the trained models achieved moderate success in predicting churning behaviour, beating the baseline model.

Given the features processed it was found that the best model to further test was the Decision Tree Classifier with an Accuracy: 0.769 and an F-1 Score: 0.734 on the test set. Using F-1 helped to avoid the pitfall of using accuracy to quantify an imbalanced set.

Further tuning lead to overfitting in the training set, which lead to Accuracy: 0.718 and F-1 Score: 0.675 on the test set. Most likely due the small size of the data.

In terms of the feature importance, the 3 most relevant, in the default Decision Tree Classifier, are, in order, 'lifetime', 'add_friend' and 'total_songs'. Though this might change if the full dataset is used

Outlook and improvements:

There are several aspects of this analysis that could be expanded and that could change dramatically the outcome.

Full dataset: the sample used is a less than 1% of the total dataset (12gb) thus the conclusions showed here are of limited validity
numFolds: related to 1. in this case the number of folds was set to 3, but with larger datasets this could be increased to 10
Features: more categorical features could be used
Outlier detection: no efforts were made to detect outliers, which would be important to study in the complete dataset
Imbalance handling: techniques such as over/under-sampling or different train/test rations could be useful to manage this point
Missing data imputation: rows/accounts with missing values were deleted, but such values could be inputed in a later study

Files

mini_sparkify_event_data.rar
README.md
requirements.txt
Sparkify.ipynb

Requirements

matplotlib==2.1.0
pyspark==2.4.3
seaborn==0.8.1
python==3.6.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science NanoDegree Final Project: Sparkify

Motivation

Methodology

Results

Outlook and improvements:

Files

Requirements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
Sparkify.ipynb		Sparkify.ipynb
mini_sparkify_event_data.rar		mini_sparkify_event_data.rar
requirements.txt		requirements.txt

InHouse-Banana/DS_NanoDegree_Final

Folders and files

Latest commit

History

Repository files navigation

Data Science NanoDegree Final Project: Sparkify

Motivation

Methodology

Results

Outlook and improvements:

Files

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages