ADS-507-Project

Overview

This repository contains database integration and data preprocessing scripts for storing and organizing information related to movies, streaming platforms, box office details, writers, stars, and directors in a MySQL database. It also contains how the pipeline will be deployed and monitored.

Data Source

The datasets are available on Kaggle.com and the IMDb developer website and are in CSV formatted files.

Final_Movie_Industry_kaggle.csv

Final_Movie_streaming_kaggle.csv

Final_director_New.csv

Final_star_New.csv

Final_writer_New.csv

title_movie.basics.csv

Preprocessing

Multiple CSV files were created into a dataframe for the preprocessing analysis.
The CSV files that were put into a dataframe (df_streaming, df_title, df_director, df_star, df_writer, and df_movie) were read using the pd.read_csv() function from the pandas library. These dataframes contain information about movies, directors, stars, writers, box office details, and streaming platforms.
Database connection parameters, such as db_username, db_password, db_host, and db_database are used to establish a connection to a MySQL database.
MySQL database connection is established using the create_engine function from the SQLAlchemy library.
All the dataframes were inserted into the corresponding MySQL tables using the to_sql method. The function, if_exists='replace' parameter is used to ensure to replace the tables with new data if it already exists.
The structure of each table is defined by SQL statements.
The original datasets used '\N' to denote missing values. These were replaced with NaN (Not a Number) to make them compatible with Pandas for further analysis and processing.
Columns with more than 90% missing values were removed to streamline the datasets. This step enhances the datasets' usability by focusing on more complete and relevant information.
Columns containing multiple categorical values separated by commas were split. Each category now appears in separate rows, aligning with the principles of a relational database.
The processed datasets were saved as new CSV files, preserving their original naming convention with the addition of '_New' to denote the processed state.

File Structure

Data: Contains CSV files with information on movies, directors, stars, writers, box office details, and streaming platforms. Scripts: Contains Python scripts for reading CSV files and inserting data into a MySQL database. SQL: Contains SQL scripts for creating database tables.

Usage

The processed datasets can be directly used for various data science projects, including but not limited to:

Exploratory Data Analysis (EDA)
Building recommendation systems
Analyzing trends in the film and television industry
They are also structured to be compatible with MySQL, making them suitable for database-related projects and learning exercises.

Deployment and Monitoring Pipeline

After successfully developing and testing the film industry data analysis pipeline, the next phase involves deploying it on a cloud service for enhanced scalability, accessibility, and reliability. This deployment, on platforms like Azure or AWS, offers advantages such as efficient resource utilization, collaborative work, and seamless user performance. Key steps include careful planning of the deployment strategy, choosing between Azure or AWS based on project needs, expertise, and budget considerations. Implementing robust security measures, including access controls and encryption, is crucial to safeguard sensitive data. These steps aim to transition the project into a cloud-deployed, scalable, and robust data pipeline, contributing to more informed decision-making processes and an improved user experience.

Contribution

Feel free to fork this repository and adapt the preprocessing scripts to your specific needs. Contributions to further improve the scripts or to extend the functionality are welcome.

License

Creative Commons is the nonprofit behind the open licenses and other legal tools that allow creators to share their work. Our legal tools are free to use. https://creativecommons.org/publicdomain/zero/1.0/

smahmudrahat / ads-507-project Goto Github PK

ads-507-project's Introduction

ADS-507-Project

Overview

Data Source

Preprocessing

File Structure

Usage

Deployment and Monitoring Pipeline

Contribution

License

ads-507-project's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent