Netflix is a subscription-based streaming service that provides its members with access to a vast library of movies and TV shows. With such a large content catalog, it can be challenging for users to find content that matches their preferences. To address this issue, Netflix uses data analysis and machine learning techniques such as clustering to group their content into similar categories. This project utilizes unsupervised machine learning algorithms to cluster Netflix movies and TV shows based on various attributes like genre, cast, and plot.
The Netflix Movies and TV Shows Clustering project aims to improve the user experience on Netflix by providing personalized content recommendations. It utilizes unsupervised machine learning techniques to group the platform's vast library of content into similar categories. By organizing the content library into clusters, Netflix can suggest titles that are more likely to match user interests, leading to increased user engagement and satisfaction.
- The majority of content on Netflix is suitable for mature audiences, with a TV-MA rating being the most common.
- The United States is the country with the highest number of productions available on Netflix, followed by India and the United Kingdom.
- Dramas, Comedies, and Documentaries are the most common genres of content on Netflix.
- The correlation heatmap shows a moderate positive correlation between the duration of a movie and its release year.
- A content-based recommender system was built using cosine similarity to make personalized recommendations to users based on the type of show they watched.
Model | Number of clusters | Silhouette Score | Calinski-Harabasz Score | Davies-Bouldin Score |
---|---|---|---|---|
K-Means Clustering | 7 | 0.00500 | 22.0021 | 10.7600 |
Hierarchical Clustering | 5 | 0.00048 | 18.1425 | 12.1666 |
DBSCAN Clustering | 17 | -0.01480 | 2.8595 | 1.4252 |
- Python: Used for data analysis, preprocessing, and model building.
- Pandas: Employed for data manipulation and analysis.
- Matplotlib and Seaborn: Utilized for data visualization.
- Scikit-learn: Utilized for implementing machine learning algorithms such as K-Means, Hierarchical Clustering, and PCA.
- K-Means Clustering
- Agglomerative Clustering
- DBSCAN Clustering
- Clustering helps Netflix provide personalized recommendations to users, improving user engagement.
- Understanding user preferences through clustering enables Netflix to optimize content production and licensing decisions.
- Unsupervised learning techniques are essential for analyzing large datasets and deriving meaningful insights without labeled data.
This project was completed as part of the Data Science Trainee program at AlmaBetter.