diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..0aaa74b Binary files /dev/null and b/.DS_Store differ diff --git a/AssignmentDprep.md b/AssignmentDprep.md new file mode 100644 index 0000000..4cfbc9d --- /dev/null +++ b/AssignmentDprep.md @@ -0,0 +1,381 @@ +## Research Motivation + +The relationship between the number of episodes a TV show is set to have +and its average rating is a crucial yet insufficiently studied area in +the field of media research. Since competition among streaming platforms +and TV networks is rising, uncovering and understanding any factor that +may influence TV show rating is paramount for optimizing content. +Moreover, as adult shows may benefit from having more episodes due to +possibly having more complex or mature story lines, researching whether +the effect of episode count on ratings differ for this genre offers +additional value to this research. This study therefore aims to answer +the question: “To what extent does the number of a TV show’s episodes +impact its average rating, and does this differ between adult titles and +non-adult titles?” The insights gained from this research could assist +producers in making more informed decisions with regards to episode +count when creating content. + +A multiple linear regression will be the applied research method, with +average show rating as the dependent variable. The independent variables +will consist of the continuous variable “number of episodes”, as well as +the dummy variable “adult title” (with 1 for adult shows, 0 for +non-adult shows). By including the interaction term episodesXadult, we +can also assess a potential difference in effect between adult versus +non-adult movies. This linear regression method effectively addresses +the objective of this research as it quantifies the impact of episode +count ratings while also allowing an interaction term to assess whether +this effect differs for the adult genre. + +## Data exploration + +This report provides an overview of the 3 IMDb datasets that we are +using in our research. We explore the raw data files and explain the +variables to understand the structure and content of the data. + +The following packages are required for this project: + + library(tidyr) + library(dplyr) + library(readr) + library(knitr) + library(ggplot2) + library(kableExtra) + +### Load the data files + +Load the ‘title basics’, ‘title ratings’ & ‘title episode’ datasets. + +### Explanation of the data files + +#### title.basics.tsv.gz + +This file contains basic information about the titles from the movies +and TV shows in the IMDb database. + + + ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Variables in title.basics
VariableDescription
tconstAlphanumeric unique identifier of the +title.
titleTypeType of title (e.g., movie, short, +tvseries, tvepisode).
primaryTitleThe most popular title at the time of +release.
originalTitleTitle in the original language.
isAdultIndicates whether the title is adult +content (0: No, 1: Yes).
startYearThe year the title was first +released.
endYearThe year the title ended (NA for +non-series).
runtimeMinutesRuntime of the title in minutes.
genresIncludes up to three genres associated +with the title.
+ +View the first rows of the data. + + ## # A tibble: 6 × 9 + ## tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres + ## + ## 1 tt0000001 short Carmencita Carmencita 0 1894 NA 1 Documentar… + ## 2 tt0000002 short Le clown et ses chiens Le clown et ses chiens 0 1892 NA 5 Animation,… + ## 3 tt0000003 short Pauvre Pierrot Pauvre Pierrot 0 1892 NA 5 Animation,… + ## 4 tt0000004 short Un bon bock Un bon bock 0 1892 NA 12 Animation,… + ## 5 tt0000005 short Blacksmith Scene Blacksmith Scene 0 1893 NA 1 Comedy,Sho… + ## 6 tt0000006 short Chinese Opium Den Chinese Opium Den 0 1894 NA 1 Short + +
+ + +
+ +Figure 1 shows us a difference between does contain “is adult” and does +not contain “is adult” titles. The vast majority of titles are not only +for adults. More than 10 million titles dont have the ‘is adult’ stamp, +on the other hand there are around 350,000 titles that do contain the +‘is adult’ stamp. + +#### title.ratings.tsv.gz + +This file contains user ratings and the number of votes for each title. + + + + + + + + + + + + + + + + + + + + + + + +
Variables in title.ratings
VariableDescription
tconstAlphanumeric unique identifier of the +title.
averageRatingWeighted average of all user ratings.
numVotesNumber of votes the title has +received.
+ +View the first rows of the data + + ## # A tibble: 6 × 3 + ## tconst averageRating numVotes + ## + ## 1 tt0000001 5.7 2088 + ## 2 tt0000002 5.6 283 + ## 3 tt0000003 6.5 2092 + ## 4 tt0000004 5.4 184 + ## 5 tt0000005 6.2 2825 + ## 6 tt0000006 5 196 + +Analyse the data + +![Figure 2: Distribution of Average +Ratings](AssignmentDprep_files/figure-markdown_strict/unnamed-chunk-8-1.png) +Figure 2 shows the distribution of ratings for the te titels, the +highest frequency takes place between grades 6.0 and 8.0 with a peak +around 7.5. Furthermore, there are fewer lower ratings for the titels. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Number of Titles per Voting Category
vote_categorycount
0-1001107998
101-1,000278778
1,001-10,00075587
10,001-50,00010496
50,001-100,0002149
100,001+2796
+ +Table 4 shows how many votes the titles received. The majority has less +then 100 votes, there are about 5000 titels with more then 50.000 votes. + +#### title.episode.tsv.gz + +This file contains information about TV show episodes. + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Variables in title.episode
VariableDescription
tconstAlphanumeric identifier of the +episode.
parentTconstIdentifier of the parent TV series.
seasonNumberThe season number the episode belongs +to.
episodeNumberThe episode number within the season.
+ +View the first rows of the data + + ## # A tibble: 6 × 4 + ## tconst parentTconst seasonNumber episodeNumber + ## + ## 1 tt0031458 tt32857063 NA NA + ## 2 tt0041951 tt0041038 1 9 + ## 3 tt0042816 tt0989125 1 17 + ## 4 tt0042889 tt0989125 NA NA + ## 5 tt0043426 tt0040051 3 42 + ## 6 tt0043631 tt0989125 2 16 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Number of TV Series per Episode Category
episode_categorycount
1-571583
6-1048359
11-2035832
21-5027500
51-10012024
100+14438
+ + + + + + + + + + + + + + + + + + + +
Summary Statistics for Number of Episodes per TV +Series
MinimumMaximumMeanMedian
11859340.568
+ +Tables 5 and 6 give us clarity on how many episodes the TV series have. +The dataset contains a maximum number of episodes of 18593 and the +average number of episodes per TV series is 8. + +### Merging the data + +View first rows of the merged data set + + ## # A tibble: 6 × 13 + ## tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres episode_count + ## + ## 1 tt0035599 tvSeries Voice of Firestone Te… Voice of Fir… 0 1943 1947 15 1 + ## 2 tt0035803 tvSeries The German Weekly Rev… Die Deutsche… 0 1940 1945 NA Docum… 8 + ## 3 tt0038276 tvSeries You Are an Artist You Are an A… 0 1946 1955 15 Talk-… 7 + ## 4 tt0039120 tvSeries Americana Americana 0 1947 1949 30 Famil… 4 + ## 5 tt0039121 tvSeries Birthday Party Birthday Par… 0 1947 1949 30 Family NA + ## 6 tt0039122 tvSeries The Borden Show The Borden S… 0 1947 NA 30 Comed… 6 + ## # ℹ 3 more variables: averageRating , numVotes , vote_category + +### Cleaning the data + +View first rows of the cleaned data set + + ## # A tibble: 6 × 9 + ## tconst titleType primaryTitle originalTitle isAdult episode_count averageRating numVotes vote_category + ## + ## 1 tt0035803 tvSeries The German Weekly Review Die Deutsche … 0 8 8 63 0-100 + ## 2 tt0039120 tvSeries Americana Americana 0 4 2.7 18 0-100 + ## 3 tt0039123 tvSeries Kraft Theatre Kraft Televis… 0 587 8 224 101-1,000 + ## 4 tt0039125 tvSeries Public Prosecutor Public Prosec… 0 18 5.9 35 0-100 + ## 5 tt0040021 tvSeries Actor's Studio Actor's Studio 0 65 6.9 93 0-100 + ## 6 tt0040028 tvSeries Talent Scouts Talent Scouts 0 55 6.1 26 0-100 + +### Filter TV series with a minimum of 25 votes + +View first rows of the Filtered data set + + ## # A tibble: 6 × 9 + ## tconst titleType primaryTitle originalTitle isAdult episode_count averageRating numVotes vote_category + ## + ## 1 tt0035803 tvSeries The German Weekly Review Die Deutsche … 0 8 8 63 0-100 + ## 2 tt0039123 tvSeries Kraft Theatre Kraft Televis… 0 587 8 224 101-1,000 + ## 3 tt0039125 tvSeries Public Prosecutor Public Prosec… 0 18 5.9 35 0-100 + ## 4 tt0040021 tvSeries Actor's Studio Actor's Studio 0 65 6.9 93 0-100 + ## 5 tt0040028 tvSeries Talent Scouts Talent Scouts 0 55 6.1 26 0-100 + ## 6 tt0040034 tvSeries Candid Camera Candid Camera 0 13 7 157 101-1,000 + +### References + +IMDb Datasets: diff --git a/AssignmentDprep.pdf b/AssignmentDprep.pdf new file mode 100644 index 0000000..4feee09 Binary files /dev/null and b/AssignmentDprep.pdf differ diff --git a/README.md b/README.md new file mode 100644 index 0000000..38e6d04 --- /dev/null +++ b/README.md @@ -0,0 +1,38 @@ +# IMDB How many episodes should you make? +This project estimates how the number of episodes affect the rating of a TV show, and if this differs between adult and non-adult titles. We created a tool to accurately predict the average rating. This tool helps producers making more informed decisions with regards to the episode count when creating TV shows. + +## Research question +"To what extent does the number of a TV show's episodes impact its average rating, and does this differ between adult titles and non-adult titles?" + +## Research motivation +The relationship between the number of episodes a TV show is set to have and its average rating is a crucial yet insufficiently studied area in the field of media research. Since competition among streaming platforms and TV networks is rising, uncovering and understanding any factor that may influence TV show rating is paramount for optimizing content. Moreover, as adult shows may benefit from having more episodes due to possibly having more complex or mature story lines, researching whether the effect of episode count on ratings differ for this genre offers additional value to this research. This study therefore aims to answer the question: "To what extent does the number of a TV show's episodes impact its average rating, and does this differ between adult titles and non-adult titles?" The insights gained from this research could assist producers in making more informed decisions with regards to episode count when creating content. On top of that, the findings on how episode count affects viewer rating can help the academic community better understand the psychology behind the concept of liking, while helping economists design more accurate economic models when it comes to the media industry. + +## Research method +A multiple linear regression will be the applied research method, with average show rating as the dependent variable. The independent variables will consist of the continuous variable "number of episodes", as well as the dummy variable "adult title" (with 1 for adult shows, 0 for non-adult shows). By including the interaction term episodesXadult, we can also assess a potential difference in effect between adult versus non-adult movies. This linear regression method effectively addresses the objective of this research as it quantifies the impact of episode count ratings while also allowing an interaction term to assess whether this effect differs for the adult genre. + +## Results + +~~ + +## Repository overview + +~~ + +## Running instructions + +~~ + + +## Resources +#### IMDB datasets +https://datasets.imdbws.com/title.basics.tsv.gz +https://datasets.imdbws.com/title.episode.tsv.gz +https://datasets.imdbws.com/title.ratings.tsv.gz + +## About +#### Authors +Team 7: +- [Martijn Hendriks](https://github.com/MartijnHendriks), e-mail: m.hendriks@tilburguniversity.edu +- [Mauro de Kort](https://github.com/Maurodekort), e-mail: m.dekort_3@tilburguniversity.edu +- [Sem Niezink](https://github.com/semniezinktil), e-mail: s.d.niezink@tilburguniversity.edu +- [Ruben van der Thiel](https://github.com/rubenvanderthiel), e-mail: r.r.t.vdrthiel@tilburguniversity.edu diff --git a/gen/.DS_Store b/gen/.DS_Store new file mode 100644 index 0000000..5257205 Binary files /dev/null and b/gen/.DS_Store differ diff --git a/src/.DS_Store b/src/.DS_Store new file mode 100644 index 0000000..e3c05cc Binary files /dev/null and b/src/.DS_Store differ