Code Monkey home page Code Monkey logo

scrapy-autoextract's Introduction

Scrapy & Autoextract API integration

This library integrates ScrapingHub's AI Enabled Automatic Data Extraction into a Scrapy spider using a downloader middleware. The middleware adds the result of AutoExtract to response.meta['autoextract'] for consumption by the spider.

Installation

pip install scrapy-autoextract

scrapy-autoextract requires Python 3.5+

Configuration

  1. Add the AutoExtract downloader middleware in the settings file:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy_autoextract.AutoExtractMiddleware': 543,
    }
    

Note that this should be the last downloader middleware to be executed.

Usage

The middleware is opt-in and can be explicitly enabled per request, with the {'autoextract': {'enabled': True}} request meta. All the options below can be set either in the project settings file, or just for specific spiders, in the custom_settings dict.

Available settings:

  • AUTOEXTRACT_USER [mandatory] is your AutoExtract API key
  • AUTOEXTRACT_URL [optional] the AutoExtract service url. Defaults to autoextract.scrapinghub.com.
  • AUTOEXTRACT_TIMEOUT [optional] sets the response timeout from AutoExtract. Defaults to 660 seconds. Can also be defined by setting the "download_timeout" in the request.meta.
  • AUTOEXTRACT_PAGE_TYPE [mandatory] defines the kind of document to be extracted. Current available options are "product" and "article". Can also be defined on spider.page_type, or {'autoextract': {'pageType': '...'}} request meta. This is required for the AutoExtract classifier to know what kind of page needs to be extracted.

Within the spider, consuming the AutoExtract result is as easy as:

def parse(self, response):
    yield response.meta['autoextract']

Limitations

When using the AutoExtract middleware, there are some limitations.

  • The incoming spider request is rendered by AutoExtract, not just downloaded by Scrapy, which can change the result - the IP is different, headers are different, etc.
  • Only GET requests are supported
  • Custom headers and cookies are not supported (i.e. Scrapy features to set them don't work)
  • Proxies are not supported (they would work incorrectly, sitting between Scrapy and AutoExtract, instead of AutoExtract and website)
  • AutoThrottle extension can work incorrectly for AutoExtract requests, because AutoExtract timing can be much larger than time required to download a page, so it's best to use AUTHTHROTTLE_ENABLED=False in the settings.
  • Redirects are handled by AutoExtract, not by Scrapy, so these kinds of middlewares might have no effect
  • Retries should be disabled, because AutoExtract handles them internally (use RETRY_ENABLED=False in the settings) There is an exception, if there are too many requests sent in a short amount of time and AutoExtract returns HTTP code 429. For that case it's best to use RETRY_HTTP_CODES=[429].

scrapy-autoextract's People

Contributors

kmike avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.