Client interface for Scrapinghub API

https://secure.travis-ci.org/scrapinghub/python-scrapinghub.png?branch=master

Requirements

Python 2.6 or above
Requests library

Usage

The scrapinghub module is a Python library for communicating with the Scrapinghub API.

First, you connect to Scrapinghub:

>>> from scrapinghub import Connection
>>> conn = Connection('APIKEY')
>>> conn
Connection('APIKEY')

You can list the projects available to your account:

>>> conn.project_ids()
[u'123', u'456']

And select a particular project to work with:

>>> project = conn['123']
>>> project
Project(Connection('APIKEY'), '123')
>>> project.id
'123'

To schedule a spider run (it returns the job id):

>>> project.schedule('myspider', arg1='val1')
u'4ca37770a1a3a24c45000005'

To get the list of spiders in the project:

>>> project.spiders()
[u'spider1', u'spider2']

To get all finished jobs:

>>> jobs = project.jobs(state='finished')

jobs is a JobSet. JobSet objects are iterable and, when iterated, return an iterable of Job objects, so you typically use it like this:

>>> for job in jobs:
...     # do something with job

Or, if you just want to get the job ids:

>>> [x.id for x in jobs]
[u'4c916f80e8bd6f68c2000000', u'4c9170fae8bd6f6cac000000', u'4c9190bde8bd6f761c000000']

To select a specific job:

>>> job = project.job('4cdacfe7a1a3a27d7a000000')
>>> job.id
'4cdacfe7a1a3a27d7a000000'

To retrieve all scraped items from a job:

>>> for item in job.items():
...     # do something with item (it's just a dict)

To retrieve all log entries from a job:

>>> for logitem in job.log():
...     # logitem is a dict with logLevel, message, time

To get job info:

>>> job.info['spider']
'myspider'
>>> job.info['started_time']
'2010-09-28T15:09:57.629000'
>>> job.info['tags']
[]
>>> job.info['fields_count]['description']
1253

To mark a job with tag consumed:

>>> job.update(add_tag='consumed')

To mark several jobs with tag consumed (JobSet also supports the update() method):

>>> project.jobs(state='finished').update(add_tag='consumed')

To delete a job:

>>> job.delete()

To delete several jobs (JobSet also supports the update() method):

>>> project.jobs(state='finished').delete()

scraping-xx / python-scrapinghub Goto Github PK

python-scrapinghub's Introduction

Client interface for Scrapinghub API

Requirements

Usage

python-scrapinghub's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent