Code Monkey home page Code Monkey logo

investigating_imdb_ratings's Introduction

Open in Visual Studio Code

Contributors:

  • Lex Vogels
  • Floris van Berloo
  • Kay van der Linden
  • Thomas Gadellaa
  • Mehmet Eren Erdoğan

Investigating the predictors of IMDb Ratings for Movies & Series

Project Description

In the current digital era of entertainment, where rating systems have a significant impact on many aspects, it is crucial to understand which factors might influence these ratings. It is clear that external factors such as viewer's satisfaction, perceived quality of the title (movie or tv series), and the place of watching the title might all affect the rating given by the viewer. However, from the perspective of creators, it might be way more interesting to look at factors that can internally be decided. What is the influence of certain actors on title ratings? Which genres tend to be rated higher? And does the consumer enjoy longer or shorter titles?

The current research aims to deliver useful insights for content makers about which factors actually influence ratings and whether there are differences between the way in which movies are rated versus tv series. Therefore, we have formulated the following research question:

What is the effect of title genres, actors' rating, runtime of the title, and time elapsed since release on the average rating of the title, and how does the effect depend on whether the title is a movie or a tv series?

Data Availability and Provenance Statements

The project benefits from open-source data that's available on https://datasets.imdbws.com. The available data is presented by IMDb, which is an organisation that is independent from the creators of the movies and tv series we are researching. Furthermore, IMDb is one of the most popular platforms for rating movies worldwide. Due to these factors, the information extracted from IMDb can be considered relatively trustworthy. Furthermore, a dataset that was previously created using information from https://www.the-numbers.com/ will be used to measure "Star power" of actors. Since this information is also available to the public, we can use it for our research. The dataset has been downloaded from a dropbox link that can be shared/used by future researchers.

Summary of Availability

All used data sources are publicly available on IMDb Developer and the-numbers (dropbox). Due to this, the research can easily be replicated and extended during future research projects on IMDb ratings and which variables affect a rating of a movie and/or tv series.

Details on each Data Source

Analysing and predicting movies & tv series ratings, we use four seperate datasets that will be prepared and merged, such that an analysis can be done on one final dataset. The four datasets are listed below, including details on which variables are visible in the dataset and which variables we believe are valuable to our research. Therefore, this also elaborates on some of the cleaning process.

  1. Title basics From the Title basics dataset we use numerous columns; either as independent variable, moderator, or control variable. Firstly, Runtime and Genres will be used as two of our independent variables in our research. Additionally, Title Type will function as a moderator (movies vs. tv series). Finally, we are using the year the title was launched as a control variable, because, for example, titles published longer ago have generally received more reviews from so-called "laggards" (late adopters), which might result in lower average ratings than titles that have been published very recently.
  2. Title ratings The Title ratings dataset forms the basis for our dependent variable Average Rating, while Number of Votes can and will be used in our analysis.
  3. Name basics From the Name basics dataset we will extract which actors/actresses are linked to which movie and/or tv series titles. In this way, we can analyse the effect of these people on the rating of a title. Therefore, the actor/actress will be one of the independent variables.
  4. Star power This dataset will be used to compute the average ranking of all the ranked actors in the movie/series. Additionally, we will create a dummy variable for whether actors are considered "super stars".

Repository overview

├── README.md
├── makefile
├── .gitignore
├── data
├── gen
│   ├── analysis
│   ├── data-preparation
│   └── paper
└── src
    ├── analysis
    ├── data-preparation
    └── paper

Dataset list and variable structure of final dataset

In our research, we have used the following data sets:

  1. title_basics.tsv
  2. title_ratings.tsv
  3. name_basics.tsv
  4. starPower.csv

Prior to data cleaning, we have extensively explored the data sets. If you wish to learn more about the data sets, their variable structure, and summary statistics, please find the file here and knit the .Rmd to HTML format: src/data-preparation/r_markdown_data_acquisition.Rmd.

Listed below are all 17 variables after cleaning the datasets and running analyses

Variable Description
tconst Identifier variable
averageRating The rating of a TV series/movie
numvotes The number of votes on the rating
titleType Specifies whether it is a TV series or a movie
primaryTitle Title of a TV series/movie
startYear Year of release
runtimeMinutes Duration of a TV series/movie in minutes
genres All different genres of the TV series/movie
time_since_release Calculation of the current year (2023) - startYear
Drama Dummy variable for the genre Drama
Comedy Dummy variable for the genre Comedy
Documentary Dummy variable for the genre Documentary
Romance Dummy variable for the genre Romance
Action Dummy variable for the genre Action
Other Dummy variable for the genre Other
mean_ranking The ranking of an actor according to IMDB data
is_superstar Dummy variable for superstars, according to the starPower dataset

Running the code

To run the code, follow these instructions:

  1. Fork this repository
  2. Open your command line / terminal and run the following code:
git clone https://github.com/{your username}/investigating_imdb_ratings
  1. Set your working directory to investigating_imdb_ratings and run the following command:
make
  1. In our repository, make is structured as follows:

a. Firstly, there are three makefiles. The makefile in the root repository starts data-preparation, and analysis.

Makefiles

b. The data-preparation makefile follows the following structure, ensuring that everything is cleaned and merged properly step-by-step:

Data preparation structure

c. Finally, analysis is done according to the structure below:

Analysis structure

  1. To clean the data of all raw and data files created during the process, run the following code in the command line / terminal:
make clean

Analysis

After the data set is ready for analysis, we've run linear regression analyses to evaluate the effect of actors on the averageRating column and the difference of contributors to the averageRating of tvSeries and movies of our data set.

Initial Regression Output:

First Regression

  • When looking at the output, we notice that all the variables have a significant effect on the average rating except Comedy genre.

We've tried for different transformations and found that boxcox transformation is the most suitable. Here's the output of transformed variable:

Box-Cox Regression

  • When we read the model, the mean_ranking and is_superstar variables are in a negative relationship with the dependent variable.

  • Romance and Action genres also impact the averageRating negatively.

Regression Output for the Series

Series Linear Regression

  • When the model for linear regression for the tvSeries is analyzed, it can be seen that is_superstar plays no significant role in the rating of the series.

  • Drama category positively and significantly contributes to the average rating of the series.

  • Romance is significantly correlated but in a negative direction.

For detailed information, gen/analysis/output/linear_regression_analysis.html file is helpful.

investigating_imdb_ratings's People

Contributors

florisvberloo avatar github-classroom[bot] avatar hannesdatta avatar kayvanderlinden1 avatar lexvogels avatar m-eren-erdogan avatar thomasgadellaa avatar

Watchers

 avatar

investigating_imdb_ratings's Issues

create the new gen folders

make sure the gen folder is shown again by creating it manually, how hannes had showed us.
Make empty .gitkeep files

Create visualitsation for our MAKEFILE

Deliverable: A first version and idea creation on how to build the Makefile

Goal is to have a setup for the makefile such that later on we can easily create project automation.

Create new Genre Variable in the Merged Dataset

In de IMDb dataset the genres are specified in one column. We want to create a dummy variable where we split the genres in the 5 most popular (-> data driven approach). Further more we leave one Genre out, e.g., Comedy which is set as the baseline for our regression.

Floris clean-up

This issue is meant for Floris.

In the issue list, there are multiple duplicates and issues that have to do with Floris. Right now, I (Kay) do not know which of these should be kept and which can be deleted. Therefore, Floris should clean his issues and branches, to keep the repository clean.

Deliverable: Floris updates/deletes/finishes his issues and branches.

Add update log (in ReadMe)

In the self-study material they recommend to add when the last update was (see image). Maybe we can also include a short update log of what is/was done on which date.

image

Upload Starpower dataset

From my thesis I have received a dataset that contains the unique identifier of the crew of a title. In this dataset the 'Rank' and the 'Earnings are specified per unique identifier. We can use both these variable for the regression.

Important
Update the Readme so it also specifies that we use this dataset.

Update README

Fully update README with current project updates.

Discuss for the weekly zoom meeting

Description

Let's make a tradition of having a zoom meeting each week. The time and the day will be decided after tuesdays class.

Deliverables

Zoom meeting time and day.

Ask Hannes about the "title split" & duplicate movies.

Description

During the merge of our datasetes, we've split the 'knownforTitles' column into 4 and merged 'title_1' with 'tconst'. We should ask about whether this process is correct or not.

In addition, we should also ask for removing the duplicate movies with unique() function.

To be honest, I think these won't make crucial differences but asking Hannes about them is a +1 for the group so...

Deliverables

Nothing to be honest.

Update the README & try to push it.

Description

The README that's in the master branch contains almost nothing. A group member should focus on building that README according to the rules in the website. However, to do so, the project scope must be clearly defined.

Notes

Whoever editing the README, please try to keep it as simple as possible both in terms of project scope and the README in general.

Deliverables

A fresh README.

Push the script.

Description

Hey, you can find the script with small comments on it. You need to manually download the other databases from imdb if you want to fool around with them.

Deliverables

Script.

Ask Hannes about README push

Description

While we uploaded the README with Git Push, no pull requests showed up in the master branch. We need to ask whether we made a mistake with the git push.

Required

Explanation to the group via Whatsapp about Hannes' response.

Play around with the dataset.

Description

You can find the dataset and the script following in the dataset branch. You have to manually download the constituting parts from imdb.com such as title.principal and vice versa. Feel free to filter, to aggregate, to compare and to come up with new ideas on how to develope it further.

Deliverables

Altered datasets if possible?

Update the About section and README

This is pretty self explanatory, but as Hannes mentioned during the coaching session of 19 September, we need to have a clear and concise description in the about section (which can be find under Code).

Additionally, delete "old text" section from the README

*Housekeep* the Project

Description

Just remove the unnecessary issues, branches and update the dashboard!

Deliverables

A clean project interface.

create new variable: mean ranking

after the datasets are merged, a new dataset needs to be made. group by title, and create a new variable with the mean of the rankings of the actors

Create new Gen folder

when i was deleting the template folders, accidently the gen folder was deleted so now i will make a new gen folder

Push the data and script.

Description

The steps that we've run through during our zoom session can be found at the script. It has quite few steps and easy to replicate. The dataset is also pushed.

Deliverables

Script code and dataset.

make unit of analysis title for merged data set (actor rankings)

now we have the merged data set with the actor rankings, and now make the unit of analysis the titles by a unique function on the titles, so that every title is shown once with the mean ranking of the actors. after this the datasets can be merged by title

Thomas - Markdown (Title_principals)

Write a short analysis of the IMDb dataset: tilters_principals.tsv.
This is done in a separate branch which then will be pushed to the master branch.

Meren RMarkdown Issue

Description

Issue for working on Markdown data.

Deliverables

Update on the R Markdown

Searching for Boxofficemojo data

Goal

Search for boxofficemojo data to to further extend the variables

Deliverable

Dataset of marketing spend by different movie/series producers

Push data and script together

Description

I'm trying to push both the script and the data together. If I fail, ignore this.

Deliverables

Script and the dataset.

put the .gitignore back in the repository

Hannes helped me in class with a git problem, yet he accidently was working in the main branch.
before i was able to change from branch i had to push some files, in this process the gitignore file was removed accidently so i am putting it back now

Create the R Markdown File

Description

We have already created a nice script file but it needs some updates like displaying summary statistics and comments on variables etc. Someone should take the script and structure it as Hannes showed us in the class.

Deliverables

An R Markdown file.

Push data and the script together

Description

I am trying to push both the script and the data altogether. This is an experiment.

Deliverables

Script and the database

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.