Code Monkey home page Code Monkey logo

ads-507-project's Introduction

ADS-507-Project

Overview

This repository contains database integration and data preprocessing scripts for storing and organizing information related to movies, streaming platforms, box office details, writers, stars, and directors in a MySQL database. It also contains how the pipeline will be deployed and monitored.

Data Source

The datasets are available on Kaggle.com and the IMDb developer website and are in CSV formatted files.

Final_Movie_Industry_kaggle.csv

Final_Movie_streaming_kaggle.csv

Final_director_New.csv

Final_star_New.csv

Final_writer_New.csv

title_movie.basics.csv

Preprocessing

  1. Multiple CSV files were created into a dataframe for the preprocessing analysis.
  2. The CSV files that were put into a dataframe (df_streaming, df_title, df_director, df_star, df_writer, and df_movie) were read using the pd.read_csv() function from the pandas library. These dataframes contain information about movies, directors, stars, writers, box office details, and streaming platforms.
  3. Database connection parameters, such as db_username, db_password, db_host, and db_database are used to establish a connection to a MySQL database.
  4. MySQL database connection is established using the create_engine function from the SQLAlchemy library.
  5. All the dataframes were inserted into the corresponding MySQL tables using the to_sql method. The function, if_exists='replace' parameter is used to ensure to replace the tables with new data if it already exists.
  6. The structure of each table is defined by SQL statements.
  7. The original datasets used '\N' to denote missing values. These were replaced with NaN (Not a Number) to make them compatible with Pandas for further analysis and processing.
  8. Columns with more than 90% missing values were removed to streamline the datasets. This step enhances the datasets' usability by focusing on more complete and relevant information.
  9. Columns containing multiple categorical values separated by commas were split. Each category now appears in separate rows, aligning with the principles of a relational database.
  10. The processed datasets were saved as new CSV files, preserving their original naming convention with the addition of '_New' to denote the processed state.

File Structure

Data: Contains CSV files with information on movies, directors, stars, writers, box office details, and streaming platforms. Scripts: Contains Python scripts for reading CSV files and inserting data into a MySQL database. SQL: Contains SQL scripts for creating database tables.

Usage

The processed datasets can be directly used for various data science projects, including but not limited to:

  • Exploratory Data Analysis (EDA)
  • Building recommendation systems
  • Analyzing trends in the film and television industry
  • They are also structured to be compatible with MySQL, making them suitable for database-related projects and learning exercises.

Deployment and Monitoring Pipeline

After successfully developing and testing the film industry data analysis pipeline, the next phase involves deploying it on a cloud service for enhanced scalability, accessibility, and reliability. This deployment, on platforms like Azure or AWS, offers advantages such as efficient resource utilization, collaborative work, and seamless user performance. Key steps include careful planning of the deployment strategy, choosing between Azure or AWS based on project needs, expertise, and budget considerations. Implementing robust security measures, including access controls and encryption, is crucial to safeguard sensitive data. These steps aim to transition the project into a cloud-deployed, scalable, and robust data pipeline, contributing to more informed decision-making processes and an improved user experience.

Contribution

Feel free to fork this repository and adapt the preprocessing scripts to your specific needs. Contributions to further improve the scripts or to extend the functionality are welcome.

License

Creative Commons is the nonprofit behind the open licenses and other legal tools that allow creators to share their work. Our legal tools are free to use. https://creativecommons.org/publicdomain/zero/1.0/

ads-507-project's People

Contributors

vpierson100 avatar amyou518 avatar smahmudrahat avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.