course-dprep / investigating_imdb_ratings Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 2.0 948 KB

Tilburg University project team investigating IMDb ratings for movies and tv series.

Home Page: https://github.com/course-dprep/investigating_imdb_ratings

Makefile 15.84% R 84.16%

imdb movies ratings tvseries

investigating_imdb_ratings's Introduction

Contributors:

Lex Vogels
Floris van Berloo
Kay van der Linden
Thomas Gadellaa
Mehmet Eren Erdoğan

Investigating the predictors of IMDb Ratings for Movies & Series

Project Description

In the current digital era of entertainment, where rating systems have a significant impact on many aspects, it is crucial to understand which factors might influence these ratings. It is clear that external factors such as viewer's satisfaction, perceived quality of the title (movie or tv series), and the place of watching the title might all affect the rating given by the viewer. However, from the perspective of creators, it might be way more interesting to look at factors that can internally be decided. What is the influence of certain actors on title ratings? Which genres tend to be rated higher? And does the consumer enjoy longer or shorter titles?

The current research aims to deliver useful insights for content makers about which factors actually influence ratings and whether there are differences between the way in which movies are rated versus tv series. Therefore, we have formulated the following research question:

What is the effect of title genres, actors' rating, runtime of the title, and time elapsed since release on the average rating of the title, and how does the effect depend on whether the title is a movie or a tv series?

Data Availability and Provenance Statements

The project benefits from open-source data that's available on https://datasets.imdbws.com. The available data is presented by IMDb, which is an organisation that is independent from the creators of the movies and tv series we are researching. Furthermore, IMDb is one of the most popular platforms for rating movies worldwide. Due to these factors, the information extracted from IMDb can be considered relatively trustworthy. Furthermore, a dataset that was previously created using information from https://www.the-numbers.com/ will be used to measure "Star power" of actors. Since this information is also available to the public, we can use it for our research. The dataset has been downloaded from a dropbox link that can be shared/used by future researchers.

Summary of Availability

All used data sources are publicly available on IMDb Developer and the-numbers (dropbox). Due to this, the research can easily be replicated and extended during future research projects on IMDb ratings and which variables affect a rating of a movie and/or tv series.

Details on each Data Source

Analysing and predicting movies & tv series ratings, we use four seperate datasets that will be prepared and merged, such that an analysis can be done on one final dataset. The four datasets are listed below, including details on which variables are visible in the dataset and which variables we believe are valuable to our research. Therefore, this also elaborates on some of the cleaning process.

Title basics From the Title basics dataset we use numerous columns; either as independent variable, moderator, or control variable. Firstly, Runtime and Genres will be used as two of our independent variables in our research. Additionally, Title Type will function as a moderator (movies vs. tv series). Finally, we are using the year the title was launched as a control variable, because, for example, titles published longer ago have generally received more reviews from so-called "laggards" (late adopters), which might result in lower average ratings than titles that have been published very recently.
Title ratings The Title ratings dataset forms the basis for our dependent variable Average Rating, while Number of Votes can and will be used in our analysis.
Name basics From the Name basics dataset we will extract which actors/actresses are linked to which movie and/or tv series titles. In this way, we can analyse the effect of these people on the rating of a title. Therefore, the actor/actress will be one of the independent variables.
Star power This dataset will be used to compute the average ranking of all the ranked actors in the movie/series. Additionally, we will create a dummy variable for whether actors are considered "super stars".

Repository overview

├── README.md
├── makefile
├── .gitignore
├── data
├── gen
│   ├── analysis
│   ├── data-preparation
│   └── paper
└── src
    ├── analysis
    ├── data-preparation
    └── paper

Dataset list and variable structure of final dataset

In our research, we have used the following data sets:

title_basics.tsv
title_ratings.tsv
name_basics.tsv
starPower.csv

Prior to data cleaning, we have extensively explored the data sets. If you wish to learn more about the data sets, their variable structure, and summary statistics, please find the file here and knit the .Rmd to HTML format: src/data-preparation/r_markdown_data_acquisition.Rmd.

Listed below are all 17 variables after cleaning the datasets and running analyses

Variable	Description
tconst	Identifier variable
averageRating	The rating of a TV series/movie
numvotes	The number of votes on the rating
titleType	Specifies whether it is a TV series or a movie
primaryTitle	Title of a TV series/movie
startYear	Year of release
runtimeMinutes	Duration of a TV series/movie in minutes
genres	All different genres of the TV series/movie
time_since_release	Calculation of the current year (2023) - startYear
Drama	Dummy variable for the genre Drama
Comedy	Dummy variable for the genre Comedy
Documentary	Dummy variable for the genre Documentary
Romance	Dummy variable for the genre Romance
Action	Dummy variable for the genre Action
Other	Dummy variable for the genre Other
mean_ranking	The ranking of an actor according to IMDB data
is_superstar	Dummy variable for superstars, according to the starPower dataset

Running the code

To run the code, follow these instructions:

Fork this repository
Open your command line / terminal and run the following code:

git clone https://github.com/{your username}/investigating_imdb_ratings

Set your working directory to investigating_imdb_ratings and run the following command:

make

In our repository, make is structured as follows:

a. Firstly, there are three makefiles. The makefile in the root repository starts data-preparation, and analysis.

b. The data-preparation makefile follows the following structure, ensuring that everything is cleaned and merged properly step-by-step:

c. Finally, analysis is done according to the structure below:

To clean the data of all raw and data files created during the process, run the following code in the command line / terminal:

make clean

Analysis

After the data set is ready for analysis, we've run linear regression analyses to evaluate the effect of actors on the averageRating column and the difference of contributors to the averageRating of tvSeries and movies of our data set.

Initial Regression Output:

When looking at the output, we notice that all the variables have a significant effect on the average rating except Comedy genre.

We've tried for different transformations and found that boxcox transformation is the most suitable. Here's the output of transformed variable:

When we read the model, the mean_ranking and is_superstar variables are in a negative relationship with the dependent variable.
Romance and Action genres also impact the averageRating negatively.

Regression Output for the Series

When the model for linear regression for the tvSeries is analyzed, it can be seen that is_superstar plays no significant role in the rating of the series.
Drama category positively and significantly contributes to the average rating of the series.
Romance is significantly correlated but in a negative direction.

For detailed information, gen/analysis/output/linear_regression_analysis.html file is helpful.

investigating_imdb_ratings's People

Contributors

Watchers

Forkers

goikonomou25 vscanturk

investigating_imdb_ratings's Issues

create the new gen folders

make sure the gen folder is shown again by creating it manually, how hannes had showed us.
Make empty .gitkeep files

Floris RMarkdown Issue

Description

Evaluate the name.basics

Deliverables

R markdown

Lex RMarkdown Issue

think about the treshold for the number of votes

think about the treshold for the number of votes, are we going to filter for a minimum number of votes? talk to the team via whatsapp if done

Create visualitsation for our MAKEFILE

Deliverable: A first version and idea creation on how to build the Makefile

Goal is to have a setup for the makefile such that later on we can easily create project automation.

Create new Genre Variable in the Merged Dataset

In de IMDb dataset the genres are specified in one column. We want to create a dummy variable where we split the genres in the 5 most popular (-> data driven approach). Further more we leave one Genre out, e.g., Comedy which is set as the baseline for our regression.

filter for extreme average ratings (<1, >10)

check if all average ratings are between 1 and 10

Delete all template folders

Floris clean-up

This issue is meant for Floris.

In the issue list, there are multiple duplicates and issues that have to do with Floris. Right now, I (Kay) do not know which of these should be kept and which can be deleted. Therefore, Floris should clean his issues and branches, to keep the repository clean.

Deliverable: Floris updates/deletes/finishes his issues and branches.

De-stringing the 'knownfortitles' column

De-string the knownfortitles column to convert the wide dataset to a long dataset.

new branch Floris

Add update log (in ReadMe)

In the self-study material they recommend to add when the last update was (see image). Maybe we can also include a short update log of what is/was done on which date.

Upload Starpower dataset

From my thesis I have received a dataset that contains the unique identifier of the crew of a title. In this dataset the 'Rank' and the 'Earnings are specified per unique identifier. We can use both these variable for the regression.

Important
Update the Readme so it also specifies that we use this dataset.

Update README

Fully update README with current project updates.

Remove the originalTitle, isAdult, filter the titleType & split the genre.

Description

Deal with data wrangling of title_basics dataset. It should be cleaned to be merged.

Deliverables

A clean dataset.

Creating issues first week

Goal

Creating issues for the team members

Deliverable

Issues for coaching sessions 1 & 2

Discuss for the weekly zoom meeting

Description

Let's make a tradition of having a zoom meeting each week. The time and the day will be decided after tuesdays class.

Deliverables

Zoom meeting time and day.

Ask Hannes about the "title split" & duplicate movies.

Description

During the merge of our datasetes, we've split the 'knownforTitles' column into 4 and merged 'title_1' with 'tconst'. We should ask about whether this process is correct or not.

In addition, we should also ask for removing the duplicate movies with unique() function.

To be honest, I think these won't make crucial differences but asking Hannes about them is a +1 for the group so...

Deliverables

Nothing to be honest.

Update the README & try to push it.

Description

The README that's in the master branch contains almost nothing. A group member should focus on building that README according to the rules in the website. However, to do so, the project scope must be clearly defined.

Notes

Whoever editing the README, please try to keep it as simple as possible both in terms of project scope and the README in general.

Deliverables

A fresh README.

Push the script.

Description

Hey, you can find the script with small comments on it. You need to manually download the other databases from imdb if you want to fool around with them.

Deliverables

Script.

Filter the primaryProfession column for actor or actress

In the name_basics dataset, use a filter on this column to identify actors and actressess. We can connect those to another dataset in order to specify if an actor/actress is a superstar or not

Ask Hannes about README push

Description

While we uploaded the README with Git Push, no pull requests showed up in the master branch. We need to ask whether we made a mistake with the git push.

Required

Explanation to the group via Whatsapp about Hannes' response.

Check title principles data set: are we still going to use it and are there any useful variables?

The title_principles dataset seems to be unnecessary and useless after we split the "knownfortitles" column in the name_basics dataset. Can we still use the title_principles dataset, or not?

Format the Rmarkdown properly and implement code in it

Description

After all the merging and obtaining the final data, we have to gather around the code and give it a convenient shape in the R markdown file.

Deliverables

A clean, end-to-end running R markdown file.

Merge name_basics with starPower datasets

At first we are going to merge the name_basics dataset with the starPower dataset in order to determine to which extend this merger will be successful and useful.

Play around with the dataset.

Description

You can find the dataset and the script following in the dataset branch. You have to manually download the constituting parts from imdb.com such as title.principal and vice versa. Feel free to filter, to aggregate, to compare and to come up with new ideas on how to develope it further.

Deliverables

Altered datasets if possible?

Have a discussion on whether to create a dataset for series too.

Description

Are we going to compare the movies with series? We have to decide if this will take too much time and effort.

Deliverables

If decided on, an issue for creating the series dataset.

Update the About section and README

This is pretty self explanatory, but as Hannes mentioned during the coaching session of 19 September, we need to have a clear and concise description in the about section (which can be find under Code).

Additionally, delete "old text" section from the README

Floris RMarkdown issue

Setting up the readme structure

Goal

Get a structure for the readme file and writing an introduction

Deliverables

A readme structure uploaded to github

Housekeep the Project

Description

Just remove the unnecessary issues, branches and update the dashboard!

Deliverables

A clean project interface.

create new variable: mean ranking

after the datasets are merged, a new dataset needs to be made. group by title, and create a new variable with the mean of the rankings of the actors

Create new Gen folder

when i was deleting the template folders, accidently the gen folder was deleted so now i will make a new gen folder

Push the data and script.

Description

The steps that we've run through during our zoom session can be found at the script. It has quite few steps and easy to replicate. The dataset is also pushed.

Deliverables

Script code and dataset.

Delete .NA rows from the merged dataset

After FLoris and I have merged the datasets, we will remove the unusable lines of data.

make unit of analysis title for merged data set (actor rankings)

now we have the merged data set with the actor rankings, and now make the unit of analysis the titles by a unique function on the titles, so that every title is shown once with the mean ranking of the actors. after this the datasets can be merged by title

Thomas - Markdown (Title_principals)

Write a short analysis of the IMDb dataset: tilters_principals.tsv.
This is done in a separate branch which then will be pushed to the master branch.

Meren RMarkdown Issue

Description

Issue for working on Markdown data.

Deliverables

Update on the R Markdown

Searching for Boxofficemojo data

Goal

Search for boxofficemojo data to to further extend the variables

Deliverable

Dataset of marketing spend by different movie/series producers

Create new Dummy var - Superstar (In the star power dataset)

Based on the rank of an actor, I will create a new dummy variable where 1 = to superstar and 0 = not a superstar. The Red file has already been created in a previous file, and is then updated in this branch.

Create a new years since release variable in the title_basics dataset

As a second created variable we want to introduce a variable that measures the time passed since the release date. Since the title_basics dataset has the variable "startYear" we can use this to create the new variable "time_since_release".

create new gen folder

Thomas RMarkdown Issue

Description

Principles dataset

Deliverables

R markdown

Delete the birthyear and year of death column from title_name_basics

These columns won't be useful in our analysis. Hence why we have to delete them.

Push data and script together

Description

I'm trying to push both the script and the data together. If I fail, ignore this.

Deliverables

Script and the dataset.

put the .gitignore back in the repository

Hannes helped me in class with a git problem, yet he accidently was working in the main branch.
before i was able to change from branch i had to push some files, in this process the gitignore file was removed accidently so i am putting it back now

Deciding on the variables

Goal

Deciding on the variables to use for the dataset

Deliverables

Sourcecode for the variables

Create a branch Floris

Create the R Markdown File

Description

We have already created a nice script file but it needs some updates like displaying summary statistics and comments on variables etc. Someone should take the script and structure it as Hannes showed us in the class.

Deliverables

An R Markdown file.

Push data and the script together

Description

I am trying to push both the script and the data altogether. This is an experiment.

Deliverables

Script and the database

course-dprep / investigating_imdb_ratings Goto Github PK

investigating_imdb_ratings's Introduction

Investigating the predictors of IMDb Ratings for Movies & Series

Project Description

Data Availability and Provenance Statements

Summary of Availability

Details on each Data Source

Repository overview

Dataset list and variable structure of final dataset

Running the code

Analysis

Initial Regression Output:

Regression Output for the Series

investigating_imdb_ratings's People

Contributors

Watchers

Forkers

investigating_imdb_ratings's Issues

Recommend Projects

Recommend Topics

Recommend Org