Code Monkey home page Code Monkey logo

public-amazon-crawler's Introduction

Amazon Crawler

A relatively simple amazon.com crawler written in python. It has the following features:

  • supports hundreds of simultaneous requests, depending on machine's limits
  • supports using proxy servers
  • supports scaling to multiple machines orchestrating the crawl and keeping in sync
  • can be paused and restarted without losing its place
  • logs progress and warning conditions to a file for later analysis

It was used to pull over 1MM+ products and their images from amazon in a few hours. Read more.

Getting it Setup

After you get a copy of this codebase pulled down locally (either downloaded as a zip or git cloned), you'll need to install the python dependencies:

pip install -r requirements.txt

Then you'll need to go into the settings.py file and update a number of values:

  • Database Name, Host and User - Connection information for storing products in a postgres database
  • Redis Host, Port and Database - Connection information for storing the URL queue in redis
  • Proxy List as well as User, Password and Port - Connection information for your list of proxy servers

Once you've updated all of your connection information, you'll need to run the following at the command line to setup the postgres table that will store the product records:

python models.py

The fields that are stored for each product are the following:

  • title
  • product_url (URL for the detail page)
  • listing_url (URL of the subcategory listing page we found this product on)
  • price
  • primary_img (the URL to the full-size primary product image)
  • crawl_time (the timestamp of when the crawl began)

How it Works

You begin the crawler for the first time by running:

python crawler.py start

This runs a function that looks at all of the category URLs stored in the start-urls.txt file, and then explodes those out into hundreds of subcategory URLs it finds on the category pages. Each of these subcategory URLs is placed in the redis queue that holds the frontier listing URLs to be crawled.

Then the program spins up the number of threads defined in settings.max_threads and each one of those threads pops a listing URL from the queue, makes a request to it and then stores the (usually) 10-12 products it finds on the listing page. It also looks for the "next page" URL and puts that in the queue.

Restarting the crawler

If you're restarting the crawler and don't want it to go back to the beginning, you can simply run it with

python crawler.py

This will skip the step of populating the URL queue with subcategory links, and assumes that there are already URLs stored in redis from a previous instance of the crawler.

This is convenient for making updates to crawler or parsing logic that only affect a few pages, without going back to the beginning and redoing all of your previous crawling work.

Piping Output to a Logfile

If you'd like to redirect the logging output into a logfile for later analysis, run the crawler with:

python crawler.py [start] > /var/log/crawler.log

Known Limitations

Amazon uses many different styles of markup depending on the category and product type. This crawler focused mostly on the "Music, Movies & Games" category as well as the "Sports & Outdoors" category.

The extractors for finding product listings and their details will likely need to be changed to crawl different categories, or as the site's markup changes over time.

public-amazon-crawler's People

Contributors

hartleybrody avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.