Skip to content

The main goal of the project is to investigate the best way to predict a popularity of a track (Multi-classification) from spotify using several different Machine Learning Models (supervised) & Technics.

License

Notifications You must be signed in to change notification settings

kggold4/final-project-machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Final Project Machine Learning Course

By: @kggold4 @TalSomech

Spotify dataset classification

spotify-logo

Dataset link

Mark: 96

Code

Our main project is in the spotify_classification.ipynb notebook, also see the utils.py file (for our utils functions).

Goal

The goal of this project is to train machine learning models (supervised) that will classified the popularity of a spotify song to three classes:

  1. high popular
  2. medium popular
  3. non popular

Dataset:

Features:

  1. acousticness (Ranges from 0 to 1)
  2. artists (List of artists mentioned)
  3. danceability (Ranges from 0 to 1)
  4. duration_ms (Integer typically ranging from 200k to 300k)
  5. energy (Ranges from 0 to 1)
  6. explicit (0 = No explicit content, 1 = Explicit content) - Categorical.
  7. id (Id of track generated by Spotify) - Numerical.
  8. id_artists.
  9. instrumentalness (Ranges from 0 to 1).
  10. key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…).
  11. liveness (Ranges from 0 to 1).
  12. loudness (Float typically ranging from -60 to 0).
  13. mode (0 = Minor, 1 = Major).
  14. name (Name of the song).
  15. popularity (Ranges from 0 to 100).
  16. release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary).
  17. speechiness (Ranges from 0 to 1).
  18. tempo (Float typically ranging from 50 to 150).
  19. time_signature.
  20. valence (Ranges from 0 to 1).

Correlations between popularity and others features

correlations

NOTE: during the ordering of the data we applay the popularity for classification to be in the following format:

class real value class value
high popular 70 <= x 2
medium popular 40 <= x < 70 1
non popular x < 40 0

In order to see the distribution between the number of popularitry classes (unbalanced number of features in data):

pie

Models result (unbalanced number of features in data):

model accuracy
KNeighbors Classifier 74.20 %
Logistic Regression 72.32 %
XGB Classifier 77.74 %
MLP Classifier 70.82 %

Models result (balanced number of features in data):

model accuracy
KNeighbors Classifier 59.35 %
Logistic Regression 60.06 %
XGB Classifier 65.41 %
MLP Classifier 64.16 %

Models result between regular and balanced data trained (shown in the notebook):

results

Conclusions:

It's very difficult to precdict popularity of spotify tracks with the data we have in our data set, Even after we cleaned & normalized our data, and creation balanced and non-balanced training data for our models, We still see that the accuracy of our models is moderate.

About

The main goal of the project is to investigate the best way to predict a popularity of a track (Multi-classification) from spotify using several different Machine Learning Models (supervised) & Technics.

Topics

Resources

License

Stars

Watchers

Forks