Code Monkey home page Code Monkey logo

web-scraping's Introduction

Web scraping project on Walt Disney movies

Description:

We will scrape movie information (producers, directors, cast, budget, etc.) of Walt Disney movies from Wikipedia pages. The project is divided into four exercises:

  1. Get movie information on a famous Walt Disney Movie: Toy Story 3:
    • Code stored in Toy_story_3.py
  2. Get the list of all Walt Disney movies:
    • In addition to recovering movie list from the link above, we pull movie specific details following step 1 for each movie.
    • Code stored in Disney_movies.py
    • Data stored in JSON file Disney_data.json
  3. Data cleaning: The cleaned data is stored in the JSON file Disney_data_clean.json (or alternatively, in a pickle file disney_movie_data_cleaned_more.pickle)
    • Convert dates into datetime objects:
      • Preferred format: June 27, 1941
      • Edge case covered: "27 June 1941" or "June 27, 1941 ( 1941-06-27 ) [1]"
    • Convert Running time into an integer
      • Originally, in a string format: "64 minutes"
      • Will be convered into an integer with the minutes_to_integer(running_time) method
    • Remove the wiki references/citations: [1], [2], [3], etc.
      • Present in many places.
      • Removed the citations and ignored the cited weblinks
    • Repair the inconsistencies in the "Starring" list --> Split up the long strings, e.g., movie "The Great Locomotive Chase"
      • No comma separation between dnamees
      • Edge case 1: "Produced by": "David Blocker Larry Brezner Mark Frost"
      • Edge case 2: "Starring": "Shia LaBeouf Stephen Dillane Peter Firth Elias Koteas",
    • Replace number range in the "Budget" and "Box office" with numbers
      • Original, in string format: "$950,000" or "$267.4 million"
      • Edge cases: "$3.355 million (worldwide rentals) [2]" or "$950,000 [2]" or "$3.7 million (U.S. rental) + $575,000 (foreign rental) [3]" or "60 million Norwegian Kroner (around $8.7 million in 1989)"
      • All monetary values converted to float with the money_conversion(money) method
  4. Merge the cleaned movie list data (in step #3) with IMDB/Metascore/Rotten Tomatoes ratings of the Disney movies: Here, instead of scraping these websites for ratings separately, we will use publicly available APIs (e.g., OMDb API). - Data file saved as pickle file Disney_data_cleaned.pickle
  5. Finally, let's save data in JSON and CSV formats - Two new data files saved: Disney_data_final.json and Disney_data_final.csv

Python packages

To perform the exercises listed above, the following Python packages are installed.

  • Requests
  • BeautifulSoup
  • JSON
  • regex
  • Pytest
  • Pickle
  • urllib
  • Pandas
Disclaimer: I already confirmed what is allowed concerning web scraping in wikipedia's rules present in their robots.txt. All is well here. Original source: Youtube tutorial by Keith Galli

web-scraping's People

Contributors

sumitdeole avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.