This repository contains database integration and data preprocessing scripts for storing and organizing information related to movies, streaming platforms, box office details, writers, stars, and directors in a MySQL database. It also contains how the pipeline will be deployed and monitored.
The datasets are available on Kaggle.com and the IMDb developer website and are in CSV formatted files.
Final_Movie_Industry_kaggle.csv
Final_Movie_streaming_kaggle.csv
Final_director_New.csv
Final_star_New.csv
Final_writer_New.csv
title_movie.basics.csv
- Multiple CSV files were created into a dataframe for the preprocessing analysis.
- The CSV files that were put into a dataframe (df_streaming, df_title, df_director, df_star, df_writer, and df_movie) were read using the pd.read_csv() function from the pandas library. These dataframes contain information about movies, directors, stars, writers, box office details, and streaming platforms.
- Database connection parameters, such as db_username, db_password, db_host, and db_database are used to establish a connection to a MySQL database.
- MySQL database connection is established using the create_engine function from the SQLAlchemy library.
- All the dataframes were inserted into the corresponding MySQL tables using the to_sql method. The function, if_exists='replace' parameter is used to ensure to replace the tables with new data if it already exists.
- The structure of each table is defined by SQL statements.
- The original datasets used '\N' to denote missing values. These were replaced with NaN (Not a Number) to make them compatible with Pandas for further analysis and processing.
- Columns with more than 90% missing values were removed to streamline the datasets. This step enhances the datasets' usability by focusing on more complete and relevant information.
- Columns containing multiple categorical values separated by commas were split. Each category now appears in separate rows, aligning with the principles of a relational database.
- The processed datasets were saved as new CSV files, preserving their original naming convention with the addition of '_New' to denote the processed state.
Data: Contains CSV files with information on movies, directors, stars, writers, box office details, and streaming platforms. Scripts: Contains Python scripts for reading CSV files and inserting data into a MySQL database. SQL: Contains SQL scripts for creating database tables.
The processed datasets can be directly used for various data science projects, including but not limited to:
- Exploratory Data Analysis (EDA)
- Building recommendation systems
- Analyzing trends in the film and television industry
- They are also structured to be compatible with MySQL, making them suitable for database-related projects and learning exercises.
After successfully developing and testing the film industry data analysis pipeline, the next phase involves deploying it on a cloud service for enhanced scalability, accessibility, and reliability. This deployment, on platforms like Azure or AWS, offers advantages such as efficient resource utilization, collaborative work, and seamless user performance. Key steps include careful planning of the deployment strategy, choosing between Azure or AWS based on project needs, expertise, and budget considerations. Implementing robust security measures, including access controls and encryption, is crucial to safeguard sensitive data. These steps aim to transition the project into a cloud-deployed, scalable, and robust data pipeline, contributing to more informed decision-making processes and an improved user experience.
Feel free to fork this repository and adapt the preprocessing scripts to your specific needs. Contributions to further improve the scripts or to extend the functionality are welcome.
Creative Commons is the nonprofit behind the open licenses and other legal tools that allow creators to share their work. Our legal tools are free to use. https://creativecommons.org/publicdomain/zero/1.0/