The main objective of this project is to analyze and clean the IMDb movie dataset to gain insights into the performance and characteristics of popular movies. The analysis focuses on understanding various factors such as ratings, revenue, genres, and the performance of directors and actors.
- Load the Dataset: Read the IMDb movie dataset using pandas.
- Remove Unnecessary Columns: Remove the 'Description' column as it is not required for the analysis.
- Handle Missing Values:
- Revenue (Millions): Filled missing values with the 30th percentile.
- Metascore: Filled missing values with the 45th percentile.
-
Data Visualization:
- Created a heatmap to visualize the distribution of missing values in the dataset using seaborn.
- Alternative visualization using missingno library.
-
Insights and Analysis:
- Analyzed the top movies based on the number of votes received.
- Calculated average metrics such as Metascore, number of votes, duration, and revenue.
- Explored the distribution of movies across different genres.
- Identified the top directors based on the revenue generated by their movies.
- Listed the most popular actors.
- Examined the total revenue generated by movies in different years.
- Compared the average Metascores (critical ratings) of different genres.
A dashboard has been created in Power BI to visualize and analyze the IMDb movie dataset.
-
Top Movies Based on People Votes:
- The top movies based on the number of votes received are "The Dark Knight," "Inception," "The Dark Knight Rises," "Interstellar," and "The Avengers." These movies have received over 1 million votes on IMDB, indicating their widespread popularity and acclaim.
-
Average Metrics:
- The dashboard displays the average Metascore (58.86), average number of votes (169.81K), average duration (113.17 minutes), and average revenue (299.43 million USD) across the movies included in the analysis.
-
Genre Performance:
- The data shows the distribution of movies across different genres, with Action (29.3%), Drama (19.5%), and Comedy (17.5%) being the most prominent genres represented.
-
Director Records:
- The dashboard highlights the top directors based on the revenue generated by their movies, with J.J. Abrams, David Yates, and Christopher Nolan topping the list.
-
Top Actors:
- A list of the most popular actors is provided, with names like Michael, James, Jason, Jennifer, and Robert appearing frequently.
-
Revenue by Year:
- The data includes information on the total revenue generated by movies in different years, with peaks observed in 2008, 2012, and 2016.
-
Top Genres Based on Metascores:
- The dashboard compares the average Metascores (critical ratings) of different genres, providing insights into the critically acclaimed genres.
-
Python
- pandas
- numpy
- matplotlib
- seaborn
- missingno
-
Power BI for dashboard creation
IMDB-Movie-Data.csv
: Original datasetImdb cleaned.csv
: Cleaned dataset after handling missing values
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# Load the dataset
df = pd.read_csv("IMDB-Movie-Data.csv")
# Remove the 'Description' column
df.drop(columns=['Description'], inplace=True)
# Handle missing values in 'Revenue (Millions)' and 'Metascore'
per = np.nanpercentile(df['Revenue (Millions)'], 30)
df['Revenue (Millions)'].fillna(per, inplace=True)
p45 = np.nanpercentile(df['Metascore'], 45)
df['Metascore'].fillna(p45, inplace=True)
# Save the cleaned dataset
df.to_csv('Imdb cleaned.csv')
# Visualization of missing values
plt.figure(figsize=(10,10))
sns.heatmap(df.isnull(), annot=True, cmap='Blues')
plt.show()
# Alternative visualization using missingno
msno.matrix(df)
plt.show()