Code Monkey home page Code Monkey logo

movie_rating_prediction's Introduction

.===================================================================================

Predict IMDB movie rating

by Chuan Sun (sundeepblue at gmail dot com)

https://twitter.com/sundeepblue

Scrapy project @ NYC Data Science Academy

8/14/2016

===================================================================================

STEP 1:

Fetch a list of 5000 movie titles and budgets from www.the-numbers.com

This step will generate a JSON file 'movie_budget.json'

$ scrapy crawl movie_budget -o movie_budget.json

===================================================================================

STEP 2:

Load 5000+ movie titles from the JSON file 'movie_budget.json'

Then search those titles from IMDB website to get the real IMDB movie links

It will generate a JSON file 'fetch_imdb_url.json' containing movie-link pairs

$ scrapy crawl fetch_imdb_url -o fetch_imdb_url.json

===================================================================================

STEP 3:

Scrape 5000+ IMDB movie information

This step will load the JSON file 'fetch_imdb_url.json', go into each movie page, and grab data

This step will generate a JSON file 'imdb_output.json' (20M) containing detailed info of 5000+ movies

It will also download all available posters for all movies.

A total of 4907 posters can be downloaded (998MB). Note that I am not sure if I can upload all those posters into github, so I only uploaded a few. You can see from my code how to use scrapy to grab them all.

$ scrapy crawl imdb -o imdb_output.json

===================================================================================

STEP 4:

Perform face recognition to count face numbers from all posters

This step will save result into JSON file 'image_and_facenumber_pair_list.json'

$ python detect_faces_from_posters.py

===================================================================================

STEP 5:

Load two JSON files 'imdb_output.json' and 'image_and_facenumber_pair_list.json'

Parse all variables into valid format.

Generate a final CSV table containing 28 variables that can be loaded in R or Pandas

The output will be a CSV file 'movie_metadata.csv' (1.5MB)

"movie_title"
"color"
"num_critic_for_reviews"
"movie_facebook_likes" "duration"
"director_name"
"director_facebook_likes"
"actor_3_name" "actor_3_facebook_likes"
"actor_2_name"
"actor_2_facebook_likes"
"actor_1_name" "actor_1_facebook_likes"
"gross"
"genres"
"num_voted_users"
"cast_total_facebook_likes" "facenumber_in_poster"
"plot_keywords"
"movie_imdb_link"
"num_user_for_reviews"
"language"
"country"
"content_rating"
"budget"
"title_year"
"imdb_score"
"aspect_ratio"

$ python parse_scraped_data.py

===================================================================================

STEP 6:

Load the 'movie_metadata.csv' file in RStudio, and perform EDA and LASSO regression

$ > run the RStudio

$ > load the file 'movie_rating_prediction.R'

movie_rating_prediction's People

Contributors

sundeepblue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

movie_rating_prediction's Issues

ImportError: bad magic number in 'movie': b'\x03\xf3\r\n'

Hello @sundeepblue
would you be able to help me with the below error? i have never coded before and working on a project to do exactly what you have achieved here. When you a moment can you assist?

MacBook-Air:movie mac$ scrapy crawl movie_budget -o movie_budget.json
Traceback (most recent call last):
File "/Users/mac/anaconda/bin/scrapy", line 6, in
sys.exit(scrapy.cmdline.execute())
File "/Users/mac/anaconda/lib/python3.6/site-packages/scrapy/cmdline.py", line 109, in execute
settings = get_project_settings()
File "/Users/mac/anaconda/lib/python3.6/site-packages/scrapy/utils/project.py", line 68, in get_project_settings
settings.setmodule(settings_module_path, priority='project')
File "/Users/mac/anaconda/lib/python3.6/site-packages/scrapy/settings/init.py", line 292, in setmodule
module = import_module(module)
File "/Users/mac/anaconda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 978, in _gcd_import
File "", line 961, in _find_and_load
File "", line 936, in _find_and_load_unlocked
File "", line 205, in _call_with_frames_removed
File "", line 978, in _gcd_import
File "", line 961, in _find_and_load
File "", line 950, in _find_and_load_unlocked
File "", line 655, in _load_unlocked
File "", line 674, in exec_module
File "", line 888, in get_code
File "", line 455, in _validate_bytecode_header
ImportError: bad magic number in 'movie': b'\x03\xf3\r\n'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.