In this project, we aim to explore the reasons for a movie's failure by analyzing over 42,000 films, looking at everything from box office numbers to plot patterns. Using data from Wikipedia summaries, IMDb, and TV Tropes, we're investigating what really makes a movie stumble - whether it's poor timing, problematic storytelling, or casting choices. We're particularly interested in how factors like cast diversity, director's track records, and genre selections influence a film's success or failure. By analyzing numbers on movie budgets, audience ratings, and overused plot devices, we aim to uncover patterns that are problematic for films.
- What metrics (e.g., low ratings, limited number of ratings, revenue vs budget) best indicate movie failure?
-
How do actor demographics and lack of diversity impact audience disengagement and contribute to box office underperformance?
-
Is thematic consistency in director filmographies a predictor of movie failure?
-
How does genre choice influence a movie's failure, particularly in different cultural contexts?
-
How does poor release timing (e.g., season, holiday periods) affect a movie's likelihood of failing?
-
What recurring plot patterns appear most frequently in critically panned films?
-
Which trope combinations consistently lead to negative reception by genre?
Our main dataset is the CMU Movie Summary Corpus, which contains 42K movie plot summaries from Wikipedia. To complement this dataset, we will utilize additional datasets to enhance our analysis.
Dataset | Description |
---|---|
IMDb Non-Commercial | Movie and TV show data including titles, ratings, crew, and cast. |
TV Tropes | 30K narrative tropes with 1.9M examples, linked to IMDb metadata |
TMDB (Kaggle) | 1M movies with metadata including cast, crew, budget, and revenue. |
Mappings of ethnicity IDs to corresponding names | We used an SPARQL query to retrieve ethnicity IDs and their names from Wikidata |
To create our main dataset, we inspected the CMU Movie Corpus Dataset and identified gaps in the data, such as revenue data. To address this, we merged it with the TMDB dataset using movie titles and release years as common identifiers. The resulting dataset includes 49,516 movies. The IMDb ID column is important because it serves as a unique identifier for a movie, enabling us to merge it with the Tropes dataset. Additionally, we created a file linking directors and actors to movies, using data from IMDb and CMU, to support cast and crew analysis.
To reproduce these preprocessed files, place the necessary datasets in the data
folder, navigate to src/scripts
, and run:
python preprocess_data.py
We first calculated key financial metrics. Return on Investment (ROI) was computed as
Current analysis examines metric distributions (ratings, revenue, profit ratios) through histograms and kernel density estimation, investigates relationships through scatter plots of audience metrics versus financial performance, and quantifies correlations between vote_average, vote_count, revenue, budget, and profit through matrix analysis. Future work could develop a composite failure score combining financial and reception metrics, employ clustering and machine learning for pattern identification.
To address how actors' demographics diversity impacts movie failure, we plan to use multiple regression analysis to quantify the impact of gender, ethnicity, and age diversity on failure metrics (revenue and average rating), expressed mathematically as
Clustering algorithms (e.g., k-means) will group movies based on diversity metrics, identifying clusters linked to high failure rates. For visualization, we will use interactive parallel coordinates plots to simultaneously visualize multiple diversity metrics alongside failure indicators and identify trends or patterns across movies. The interactivity will enable highlighting specific movie samples.
A director's filmography can be characterized by the diversity of genres in their films. A failure indicator for each director can be constructed by averaging the revenues or ratings of their films across genres. The first phase of the analysis involves assembling these profiles. The next step is to perform clustering on them to identify patterns in film failure related to directors' filmographies. Clustering techniques such as the K-Nearest Neighbours (KNN) algorithm are employed to classify directors based on their filmographies. The silhouette score is used to evaluate the quality of the clusters, helping to determine distinct career patterns. Cluster centroids and medoids are displayed to illustrate the typical patterns or trends found within each group.
Current analysis uses violin plots for profit distributions, scatter plots for rating-popularity relationships, ROI analysis, and 5-year moving averages for genre evolution. Further refinements could include: regional market segmentation to compare genre performance across cultures, developing a composite risk score combining financial and critical metrics, analyzing genre hybridization effects on failure rates, and identifying genre-specific budget thresholds for optimal risk-return profiles. This would create a more comprehensive understanding of how genres perform in different contexts and market conditions.
Current analysis employs violin plots for seasonal and monthly distributions, temporal trend analysis, and success/failure rate tracking. Potential enhancements include: analyzing holiday-specific effects, creating a competition index based on concurrent releases, examining genre-timing interactions, studying regional variations in optimal release windows, and developing a predictive model incorporating marketing spend and critical reviews. This would provide deeper insights into how timing decisions impact movie performance across different contexts.
To investigate the relationship between narrative tropes and audience reception, we established a rating threshold of 6.0 on a 10-point scale to distinguish between low and high-rated films. Our first step was to identify the 20 most common tropes in low-rated movies. Then, we analyze tropes within specific genres, we focused on Horror, Adventure, and Comedy films for this initial analysis. We calculated a ratio of trope occurrence in low-rated films compared to high-rated films. The results were visualized using bar plots showing tropes that might contribute to negative audience reception. Next steps include completing the plots for all genres and analyzing combinations of tropes.
Deliverable | Expected Date |
---|---|
Data preprocessing | 13/11/2024 |
Data analysis | 14/11/2024 |
Setup Web | 22/11/2024 |
Group visualizations | 13/12/2024 |
Storytelling | 19/12/2024 |
- JX: Questions 1, 2
- RL: Questions 4, 5
- RW: Questions 3
- AZ: Questions 6, 7
- AO: Questions 6, 7
├── data <- Project data files
│ │ cmu_tmdb.csv
│ │ movie_actors.csv
│ │ movie_directors_actors.csv
│ │ cmu_tropes.csv
│ │ wikidata_ethnicities.csv
│ │
│ ├───cmu
│ │ character.metadata.tsv
│ │ movie.metadata.tsv
│ │ name.clusters.txt
│ │ plot_summaries.txt
│ │ tvtropes.clusters.txt
│ │
│ ├───imdb
│ │ name.basics.tsv
│ │ title.basics.tsv
│ │ title.crew.tsv
│ │ title.principals.tsv
│ │ title.ratings.tsv
│ │
│ ├───tmdb
│ │ TMDB_movie_dataset_v11.csv
│ │
│ └───tropes
│ film_imdb_match.csv
│ tropes.csv
│
├── src <- Source code
│ ├── utils <- Utility directory
│ ├── scripts <- Shell scripts
│
├── results.ipynb <- a well-structured notebook showing the results
│
├── .gitignore <- List of files ignored by git
├── pip_requirements.txt <- File for installing python dependencies
└── README.md