scrapinghub / python-scrapinghub Goto Github PK

View Code? Open in Web Editor NEW

202.0 117.0 63.0 1.63 MB

A client interface for Scrapinghub's API

Home Page: https://python-scrapinghub.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 99.82% Makefile 0.18%

python-scrapinghub's Introduction

Client interface for Scrapinghub API

The scrapinghub is a Python library for communicating with the Scrapinghub API.

Requirements

Python 2.7 or Python 3.5+

Installation

The quick way:

pip install scrapinghub

You can also install the library with MessagePack support, it provides better response time and improved bandwidth usage:

pip install scrapinghub[msgpack]

Documentation

Documentation is available online via Read the Docs or in the docs directory.

python-scrapinghub's People

Contributors

Stargazers

Watchers

Forkers

andrix alepharchives tija adw0rd alexcepoi scraping-xx parsing jmaynier athomsen-local imclab publicbull chekunkov dpnova ms5 satish1337 robic heamon7 stummjr jesuslosada ellipsys vidascontadas kmmao starrify iananich tank-crd duyamin optionalg elacuesta reginababo victor-torres alanpmullane vmruiz ejulio rennerocha gallaecio umar895 hermit-crab burnzz kingking888 pauljkaras noviluni pardo daledali parchanco timf2004 bulatbulat48 joskid adivittala phrfpeixoto hfisher22 maryelo webscrapist vijayaragavanmg dishantsethi python-repository-hub kiyky tiger1998 vijayraghava11 felipecustodio branchvincent

python-scrapinghub's Issues

iter behaving like list?

Based on the description in the docs, I'd expect the iter method on Items to instantly return a generator, then perform network requests as next is called on it, a bit like a MongoDB cursor. But instead it blocks whilst performing a lot of network activity and using substantial amounts of memory before returning the generator, which suggests it's loading the entire collection of items in the background, and is more like how I'd expect the list method to work. Is this correct? If this is how iter is supposed to work, this means it's almost impossible to work with even moderately large collections of scraped items.

requests.exceptions.ReadTimeout on `jobq.list`

So, for one of the projects I'm working on - I'm writing an automation script that decides how many and which jobs to send the scrapinghub.

To do precisely that, I need to get the number of pending or running jobs from scrapinghub. So, I decided to keep this simple and this was the code I came up with (using the jobq api)-

    def get_num_pending_or_running_jobs():
        num_pending_jobs = len(list(project.jobq.list(state='pending')))
        num_running_jobs = len(list(project.jobq.list(state='running')))
        return num_pending_jobs + num_running_jobs

This resulted in a ReadTimeout after about a couple hours of running:-

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='storage.scrapinghub.com', port=443): Read timed out. (read timeout=60.0). Can't afford this as getting the list of pending or running jobs is absolutely necessary for the project I'm working on.

Does anyone know why this is happening? Does this happen very often?

Any help would be appreciated

How to run a job?

I can't see how to run a job. There's two examples in the docs. In the project section:

For example, to schedule a spider run (it returns a job object):

>>> project.jobs.run('spider1', job_args={'arg1':'val1'})
<scrapinghub.client.Job at 0x106ee12e8>>

and in the spider section:

Like project instance, spider instance has jobs field to work with the spider's jobs.

To schedule a spider run:

>>> spider.jobs.run(job_args={'arg1:'val1'})
<scrapinghub.client.Job at 0x106ee12e8>>

Neither works, both throw AttributeError: 'Jobs' object has no attribute 'run'

Unclear iteraction between iter skip values and start/end times

For jobs.iter but could apply elsewhere.

If start=10 and startts=100000 will it

skip 10 results after time 100000 or will it
return results that are at least 10 from the start and after time 100000
return results that are at least 10 from the most recent and no older than time 100000 (re: ambiguity in #136 )?

The same question about endts and start.

Allow exporting items data in CSV format

I am using the client to request items but I would like to limit them because some of the fields are way too big. So I only need a few fields but I want all the items. For example, this works fine for CSV if I declare the fields parameter:

$ curl -uAPIKEY: "https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name,venue"

"Ole Miss","Vaught Hemingway Stadium"
"Kansas State","Bill Snyder Family Stadium"
"LSU","Tiger Stadium"

But when I try it with the client I get:

Python 3.6.3 (default, Oct  3 2017, 21:45:48)
>>> import scrapinghub
>>> scrapinghub.__version__
'2.0.3'
>>> client = scrapinghub.ScrapinghubClient(APIKEY)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])

requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url:
https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name%2Cvenue

scrapinghub.client.exceptions.ScrapinghubAPIError: No acceptable
content types matching header 'application/x-msgpack' and format 'csv'
The following are supported: application/x-msgpack, application/xml,
text/csv, application/json, application/x-jsonlines

Ok, so let's try without msgpack:

>>> client = scrapinghub.ScrapinghubClient(APIKEY, use_msgpack=False)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])

File "scrapinghub/hubstorage/serialization.py", line 25, in jldecode
    yield loads(line)
json.decoder.JSONDecodeError: Extra data: line 1 column 11 (char 10)

So the problem is that the response is assumed by the client to be JSON and it tries to decode the string:

'"Ole Miss","Vaught Hemingway Stadium"'

Ok, let's try it with json now:

>>> items = job.items.list(format='json', fields=['name,venue'])
>>> items

[[{'name': 'Ole Miss', 'venue': 'Vaught Hemingway Stadium',
   'venue_address': 'All-American Dr, University, MS 38677, EUA',
   'date': 1542857400000.0,...

Well, there's no error, but we get all the fields instead of just the two we request, effectively the fields parameter is ignored.

So maybe we could patch scrapinghub/hubstorage/resourcetype.py:apirequest() to check for the csv format to bypass the json decoding but it would actually be better if the backend api supported this field subset declaration for other formats, namely json.

I see that the api supports max_fields and we know that csv supports field limiting so maybe it's not a big deal to get the api to support field limiting for json as well.

Get job items without _type field

Is there a way to download job data without _type?

Improve the documentation about the possible parameters of items.iter()

I’m trying to understand the possible parameters of Job.items.iter() but it’s not that clear to me:

Why is count documented as a parameter on its own? (I assume the rest of the pagination parameters are assumed to be part of apiparameters)
What is requests_params for?
I see other parameters mentioned in the documentation that don’t seem part of the Items API, pagination or meta. For example, filter (in the 4th example of the Items documentation, with list instead of iter but I assume they accept the same arguments)

Cannot get spider settings

I can set specific spider settings in the Scrapinghub UI but cannot get them with this library.

I use the spider settings to store some keys that need to match a database key, that are then used while loading the scraped data in an ETL.

I was expecting something like:

spider = project.spiders.get('myspider')
spider.settings()

spider = project.spiders.get('myspider')
spider.metadata.get('spider_settings')

Related issues:
Issue #84 was about accessing spider settings from the job that was possible through job.metadata.get('job_settings')

Add a check for jobs running locally or in the cloud

Currently, we may have settings for spiders running locally and other settings for spiders running in Scrapy Cloud (Dash).
Usually, I add a check like if 'SHUB_JOBKEY' not in os.environ:.
However it may not be the best one and , if for some reason this env is deprecated, I need to update my checks.
If would be nice to have this kind of check, not sure if in this library or somewhere else.

No default timeout for HTTP calls

Looks like python-requests uses timeout=None by default.
Calls to the API can hang a long time (forever?)

It would be helpful to add a timeouts for GET and POST calls

Do startts and endts limit results or adjust them?

Again primarily a question about jobs.iter. Unfortunately this is somewhat related to #138 and #136, so I'll repeat some comments.

Let's say ts is the timestamp for June 13 2019.

If I do jobs.iter(startts=ts, count=100) will this

Return 100 jobs where the most recent job is the first job chronologically before ts?
Return 100 jobs where the oldest job is the first job chronologically after ts?
Return 100 jobs starting from the most recent job, where no job is before ts?

Again, similar question about endts.

Documentation conflicts regarding jobs.iter limits

In the documentation for jobs.iter it says

retrieve all jobs for a spider
.>>> spider.jobs.iter()

however if there's an implicit limit of 1000 results then this will not return all jobs for a spider, only if there are equal to or fewer than 1000 jobs for the spider. I don't believe there's a method in the API to get all jobs for a spider at this time.

Difference between jobs.iter and jobs.iter_last

The documentation for jobs.iter says

by default :meth:Jobs.iter returns maximum last 1000 results.
which implies that it will return the most recent (first 1000 sorted in decreasing chronological order).

The documentation for jobs.iter_last says

Iterate through last jobs for each spider.
which implies that it will return the jobs with the highest chronological value (that is to say, the first n jobs sorted in decreasing chronological order).

I would guess that they're not both returning the latest jobs, however with limited job retention I can't imagine a situation where you'd actually want the results in increasing chronological order.

How to retrieve only items "_key". Is nodata supported in `items.iter()`?

I found an example with which I can retreive only the specified meta from job items:
https://doc.scrapinghub.com/api/items.html?highlight=metadata#examples

This could be handy to parallelize download and fix the issue with filters, collections (since we cannot properly divide items before download after filter is applied, or for collections which have disordered or duplicated keys) scrapinghub/arche#13 by fetching only keys (I think it should be fast), breaking them on batches and then downloading the data.

Naturally, getting just keys should be much faster to solve the issues above. I tested locally with curl for 13kk keys:
CPU times: user 2.76 s, sys: 1.56 s, total: 4.33 s
Wall time: 9min 55s

That's not really fast, but anyway - is there nodata?

Support for new Scrapinghub API

Current version of python-scrapinghub supports old API http://doc.scrapinghub.com/oldapi.html.
It would be nice to add support to new API - http://doc.scrapinghub.com/api.html. Are there any other libs that work with new API version? If not, I can work on this issue and create Pull Request to this lib.

msgpack errors when using iter() with intervals between each batch call

Good Day!

I've encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here's a strip down version of the code for reproduction of the issue:

import pandas as pd

from scrapinghub import ScrapinghubClient

def read_job_items_by_chunk(jobkey, chunk=10000):
    """In order to prevent OOM issues, the jobs' data must be read in
    chunks.

    This will return a generator of pandas DataFrames.
    """

    client = ScrapinghubClient("APIKEY123")

    item_generator = client.get_job(jobkey).items.iter()

    while item_generator:
        yield pd.DataFrame(
            [next(item_generator) for _ in range(chunk)]
        )

for df_chunk in read_job_items_by_chunk('123/123/123'):
    # having a small chunk-size like 10000 won't have any problems

for df_chunk in read_job_items_by_chunk('123/123/123', chunk=25000):
    # having a bug chunk-size like 25000 will throw out errors like the one below

Here's the common error it throws:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 67: invalid start byte

Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
TypeError: unhashable type: 'dict'

I find that the workaround/solution for this would be to have a lower value for chunk. So far, 1000 works great.

This uses scrapy:1.5 stack in Scrapy Cloud.

I'm guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.

May I ask if there might be a solution for this? since a much bigger chunk size will help with the speed of our jobs.

I've marked it as bug for now as this is quite an unexpected/undocumented behavior.

Cheers!

Include stats property to Job instance

More than once I needed to get the stats of a job using scrapinghub library, but it is not well documented that I can find this information using job.metadata.get("scrapystats") (the information is there but it took a while for me to figure out how to get the job stats).

As this seems (at least for my latest usages of the library), something that it is used frequently, what do you think to include a new property in Job to return the dictionary with the job stats (similar to items.Items.stats()).

Make msg_pack default

Considering the huge difference in performance, I propose to make it a default option.

It seems like it will require cleaning msg_pack not installled messages from the code, updating tests and setup.py\tox.

And docs.

Jobs.iter_last start and start_after parameters undocumented

If I look at the jobs.iter documentation I would assume that start is an integer count of recent jobs to skip, but I get this error

Invalid "start" value: Key '0' is not a spider key, must have 2 part(-s)

which makes it seem like it might be expecting a job id here minus the project part?

I can't find the code for start_after/startafter anywhere. It could be either a number of jobs to skip or a timestamp. If it's a timetamp I'm not sure if "after" means "after in sort order" or "after chronologically (job time > start time)".

`TypeError: init() got an unexpected keyword argument 'encoding'`

Hi there!

When doing:

pip install scrapinghub[msgpack]

I get the following error:

from os import environ

import msgpack
import scrapinghub as sh

print(
    "",
    f"msgpack.version = {msgpack.version!r}",
    f"sh.__version__ = {sh.__version__!r}",
    "",
    sep="\n",
)

job = sh.ScrapinghubClient(environ.get("SH_APIKEY")).get_job("432787/1/1")
job.items.list(count=1)

SH_APIKEY=... python foobar.py

msgpack.version = (1, 0, 0)
sh.__version__ = '2.3.0'

Traceback (most recent call last):
  File "foobar.py", line 15, in <module>
    job.items.list(count=1)
  File "/home/alexander-matsievsky/.miniconda3/envs/demandmatrix/lib/python3.8/site-packages/scrapinghub/client/proxy.py", line 39, in list
    return list(self.iter(*args, **kwargs))
  File "/home/alexander-matsievsky/.miniconda3/envs/demandmatrix/lib/python3.8/site-packages/scrapinghub/client/proxy.py", line 114, in iter
    for entry in self._origin.iter_values(
  File "/home/alexander-matsievsky/.miniconda3/envs/demandmatrix/lib/python3.8/site-packages/scrapinghub/hubstorage/serialization.py", line 28, in mpdecode
    unpacker = Unpacker(encoding='utf8')
  File "msgpack/_unpacker.pyx", line 317, in msgpack._cmsgpack.Unpacker.__init__
TypeError: __init__() got an unexpected keyword argument 'encoding'

class for job's key

I have written JobKey class that represents the key to the job on ScrapingHub service in project_id/spider_id/job_number format. Also, it offers methods to quickly get an ID of the spider/project, tuple representation, etc.

You can find it' code here.

Maybe it can be somehow modified/improved to get included into python-scrapinghub library?

Adding a tag actually replaces all existing tags with the new.

I pull jobs lacking the consumed tag:

jobs = sh.jobs project: '4024', state: 'finished', lacks_tag: 'consumed'

Jobs when pulled, are tagged with "consumed":

`curl -u #{ENV['SCRAPINGHUB_API_KEY']}: https://dash.scrapinghub.com/api/jobs/update.json -d project=4024 -d job=#{job['id']} -d add_tag=consumed`

When I respond to the pulled data, they're tagged with approved or rejected:

`curl -u #{ENV['SCRAPINGHUB_API_KEY']}: https://dash.scrapinghub.com/api/jobs/update.json -d project=4024 -d job=#{job['id']} -d add_tag=approved`

But, after being tagged with approved or rejected, jobs lose their consumed tag, and are again returned when querying for jobs missing the consumed tag.

jobs = sh.jobs project: '4024', state: 'finished', lacks_tag: 'consumed'

How is this aspect of the API meant to be used?

Schedule periodic jobs

We want to able to schedule periodic jobs through shub (scrapinghub/shub#220).
For that, we'll need to add this feature here as well.

Probably, it'll be something related to jobs (https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/client/jobs.py) or, maybe, a new PeriodicJobs.

Omitting `_key` hangs `_BatchWriter`

Failing

# reprex.py
import os

import scrapinghub

store = (
    scrapinghub.ScrapinghubClient(os.getenv('SH_APIKEY'))
    .get_project(1234567890)
    .collections.get_store("ok_to_mess_around_with")
)

writer = store.create_writer()

writer.write({"foo": "bar"})
print("write")

writer.write({"fizz": "buzz"})
print("write")

writer.flush()
print("flush")

python reprex.py 

write
write
^CTraceback (most recent call last):
...
KeyboardInterrupt

Passing

import os

import scrapinghub

store = (
    scrapinghub.ScrapinghubClient(os.getenv('SH_APIKEY'))
    .get_project(1234567890)
    .collections.get_store("ok_to_mess_around_with")
)

writer = store.create_writer()

writer.write({"_key": "foo", "foo": "bar"})
print("write")

writer.write({"_key": "fizz", "fizz": "buzz"})
print("write")

writer.flush()
print("flush")

python reprex.py

write
write
flush

Collections key not found with library

I'm curious about the difference between Collection.get() and Collection.iter(key=[KEY])

>>> key = '456/789'
>>> store = project.collections.get_store('trump')
>>> store.set({'_key': key, 'value': 'abc'})
>>> print(store.list(key=[key]))

[{'value': 'abc', '_key': '456/789'}]  # https://storage.scrapinghub.com/collections/9328/s/trump?key=456%2F789&meta=_key

>>> try:
>>>     print(store.get(key))
>>> except scrapinghub.client.exceptions.NotFound as e:
>>>     print(getattr(e, 'http_error', e))

404 Client Error: Not Found for url: https://storage.scrapinghub.com/collections/9328/s/trump/456/789

I assume that Collection.get() is a handy shortcut for the key-filtered .iter() function so I guess the point of my issue is that .get() will raise an exception if given bad input, for example slashes

Long intervals during resource iteration can lead to issues

Hello.

Recently there was this issue #121 for which a batch read workaround was implemented. I am now experiencing from what I believe to be same or similar issue but now while using JSON instead of msgpack. Basically when I do for item in job.items.iter(..., count=X, ...): if there are long intervals during iteration the count can end up being ignored. I was able to reproduce it with the following snippet:

sh_client = ScrapinghubClient(APIKEY, use_msgpack=False)
take = 10_000
job_id = '168012/276/1'
for i, item in enumerate(sh_client.get_job(job_id).items.iter(count=take, meta='_key')):
    print(f'\r{i} ({item["_key"]})', end='')

    if i == 3000:
        print('\nsleeping')
        time.sleep(60*3)
    
    if i > take:
        print('\nWTF')
        break

With the sleep part removed the WTF section does not fire and the iterator stops on 168012/276/1/9999th item.

This seem to be more of a ScrapyCloud API platform problem but I am reporting it here to track nonetheless.

For now I am assuming resource/collections iteration is not robust if any delays are possible client side during retrieval (I haven't tested any other potential issues) and I will try either preloading all at once (.list()) or using .list_iter() when makes sense as a habit.

Use SHUB_JOBAUTH environment variable in utils.parse_auth method

Currently, the parse_auth method tries to get the API Key from the SH_APIKEY environment variable, which needs to be manually set either in spider's code or in the [docker] image's code. A regular practice is to create dummy users and associate them with the project so that real contributors don't have to share their API Keys.

Another option is to use the credentials provided by the SHUB_JOBAUTH, defined during runtime when executing jobs in the Scrapy Cloud platform.

Although it's possible to use Collections and Frontera, this is not a regular Dash API Key but a JWT Token generated in runtime by JobQ service, which works only for a part of our API endpoints (JobQ/Hubstorage).

I'd like to contribute with a Pull Request adding support for this ephemeral API Key.

Adding new endpoints

What is the best way to add new endpoints to the client that the app/hubstorage apis already support?

For example: https://app.scrapinghub.com/api/v2/projects/206606/scripts

Able to use items.iter() start argument with same project_id/spider_id but different job_id

from scrapinghub import ScrapinghubClient

c = ScrapinghubClient(api_key)
[item for item in c.get_job("312967/1/7").items.iter(count=100, start="312967/1/3/1")]

This one returns items from '312967/1/3' while in my opinion the iter should fail because the job keys are different.

jobs.run(spider, job_args = {arg1: val1}) can't have val1 as a list.

I have tried to pass a list in val1 argument and I was unable to run my spider for multiple links which I pass as an value in job_args.

The way I got around that was to use repr to convert my list to a string and then on the spider side evaluate the string using ast.literal_eval and then run the spider.

Can somebody help me with a more "scrapy" solution.

collections.get_store is not working as documented

Upon going through collections doc we see 2. call .get_store(<somename>) to create or access the named collection you want (the collection will be created automatically if it doesn't exist) ; you get a "store" object back,
But when you try this

>>> store = collections.get_store('store_which_does_not_exist')
>>> store.get('key_which_does_not_exist')
DEBUG:https://storage.scrapinghub.com:443 "GET /collections/462630/s/store_which_does_not_exist/key_which_does_not_exist HTTP/1.1" 404 46
2021-02-04 13:33:20 [urllib3.connectionpool] DEBUG: https://storage.scrapinghub.com:443 "GET /collections/462630/s/store_which_does_not_exist/key_which_does_not_exist HTTP/1.1" 404 46
DEBUG:<Response [404]>: b'unknown collection store_which_does_not_exist\n'
2021-02-04 13:33:20 [HubstorageClient] DEBUG: <Response [404]>: b'unknown collection store_which_does_not_exist\n'
*** scrapinghub.client.exceptions.NotFound: unknown collection store_which_does_not_exist

When we .set some value to store which doesn’t exist, store is created and then the values are stored.

>>> store.set({'_key': 'some_key', 'value': 'some_value'})
DEBUG:https://storage.scrapinghub.com:443 "POST /collections/462630/s/store_which_does_not_exist HTTP/1.1" 200 0
2021-02-04 13:36:56 [urllib3.connectionpool] DEBUG: https://storage.scrapinghub.com:443 "POST /collections/462630/s/store_which_does_not_exist HTTP/1.1" 200 0
According to docs, shouldn’t the store be created when we call .get_store ?

Update tests deps to latest versions and unpin deps

#110 shows that there are some old dependencies, so the idea is:

Update all tests dependencies (pytest and others) to the latest versions
Unpin all the versions (since not updated libraries are prone to security vulnerabilities)

Retries are not done for collections/items/requests/logs iterators.

Good day. We have a project that makes extensive use of the API and at certain times "429 Too many requests for user" occur and are raised with no indication of any retries, exponential or otherwise, taking place. In fact it seems no HTTP errors are retried for resource iterators whatsoever:
https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/hubstorage/resourcetype.py#L134-L143
Although I see other places do handle such:
https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/hubstorage/client.py#L20-L37
Is this by design?

UPD:
So at this point: https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/hubstorage/client.py#L113-L116
the iterator request executed as is without retrier wrapping as it deemed not idempotent. At a later point here: https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/hubstorage/resourcetype.py#L134-L143 it is not handled at all (scrapinghub.client.exceptions.ScrapinghubAPIError('Too many requests for user') in our case).

Obtaining error messages

Based upon the documentation I have read, I don't see anything for obtaining error messages when there are errors from executing a spider. Did I overlook something?

Exception raised when using latest version of request library.

When using the scrapinghub library with the latest version of requests (1.1.0) I got
the following exception:

  File "export_csv.py", line 16, in <module>
    main()
  File "export_csv.py", line 10, in main
    print project.spiders()
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/scrapinghub.py", line 189, in spiders
    result = self._get('spiders', 'json', params)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/scrapinghub.py", line 154, in _get
    return self._request_proxy._get(method, format, params, headers, raw)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/scrapinghub.py", line 73, in _get
    return self._request(url, None, headers, format, raw)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/scrapinghub.py", line 100, in _request
    auth=self.auth, prefetch=False)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'prefetch'

The problem seems to be that scrapinghub-python has a dependency the latest version of requests: https://github.com/scrapinghub/python-scrapinghub/blob/master/requirements.txt,
however the 'prefetch' parameter is not longer used in the latest version: https://github.com/kennethreitz/requests/blob/master/requests/sessions.py.

It used to be in previous version: https://github.com/kennethreitz/requests/blob/3c0b94047c1ccfca4ac4f2fe32afef0ae314094e/requests/sessions.py

The script that triggers this exception is the following

#!/usr/bin/env python
from scrapinghub import Connection

APIKEY = 'xxxx'
PROJECT_ID = xxx

def main():
    conn = Connection(APIKEY)
    project = conn[PROJECT_ID]
    print project.spiders()
    jobs = project.jobs(state='finished')
    for job in jobs:
        print job

if __name__ == '__main__':
    main()

count=1 doesn't limit item count when iter/list invoked on scrapinghub.client.jobs.Job object

http://python-scrapinghub.readthedocs.io/en/2.0.1/client/apidocs.html#module-scrapinghub.client.items

>>> job = client.get_job('XXXXX')
>>> len(list(job.items.iter(count=1)))
6021
>>> len(list(job.items.list(count=1)))
6021

Getting multiple errors in app console on executing scraper

Hi,

I have added a scraper and added it in scrapinghub. After executing the scraper I can see an entry of my job id in completed jobs section. The error column in the completed jobs section is showing 2 errors.
On checking job log I got error details which is added below.

Error 1

Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred result = f(*args, **kw) File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 185, in open_spider uri = self.urifmt % self._get_uri_params(spider) KeyError: 'date'

Error 2

Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred result = f(*args, **kw) File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 210, in item_scraped slot = self.slot AttributeError: 'FeedExporter' object has no attribute 'slot'

Please let me know if there is any fix for this.

Thanks in advance.

No handlers could be found for logger "scrapinghub"

Is this the proper behavior for logging?

>>> import scrapinghub
>>> scrapinghub.__version__
'1.4.0'
>>> scrapinghub.logger
<logging.Logger object at 0x24540d0>
>>> scrapinghub.logger.info('adsf')
>>> scrapinghub.logger.error('adsf')
No handlers could be found for logger "scrapinghub"
>>>

Cannot access the Scrapinghub overwritten spider settings

Hello,

I would like to access the settings overwritten for a spider job through the Scrapinhub interface. Ideally, I would like those to be accessible as an attribute of scrapinghub.client.jobs.Job because those can be different for different jobs for the same spider.

I saw that through the Restful API you can specify the job settings when running them (https://doc.scrapinghub.com/api/jobs.html#run-json), but didn't find job_settings as an attribute of any class.

Is there a way to access the job settings through python-scrapinghub ?

Filter all tags instead of just a subset

We're looking into using the has_tag param in the jobs.iter() method: https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/client/jobs.py#L89

However, it behaves as an OR operator. Is there a way to have an AND operator for this API?

We're looking to pass a list of tags to have a more accurate filtering of jobs, so having the AND operator would really help.

Thanks!

Use stacklevel=2 on Connection warning about password usage

python-scrapinghub/scrapinghub/legacy.py

Line 66 in 31be8cb

warnings.warn("A lot of endpoints support authentication only via apikey.")

I saw the warning in the logs, and I think a stacklevel=2 would make it slightly more useful (although not much more, since the root issue probably comes from higher in the stack).

Encoding of serialization changed in version 2.3.1

Encountered a breaking change after updating to version 2.3.1 from 2.3.0 while using the ScrapinghubClient to obtain items from a job. The response is not decoded to UTF-8 anymore and broke our code.

2.3.0...2.3.1#diff-f993fb71a6b64f99f11314f1b1e960adL28

The change:

old: unpacker = Unpacker(encoding='utf8')
new: unpacker = Unpacker()

Is this going to be reverted in a future update? Why was this changed?

Cannot run Spiders from Script

Hi,

I have a scrapy project which works great. I am trying to migrate it to ScrapingHub.
I want to be able to launch spiders from a script (see code below), but it is not working. (not accessing the Spider parse() function):

SCRIPT:
def main():
...
yield crawler.crawl(quotes_spider.QuotesSpider)
crawler.start()

Is it possible to do it this way? If so, how? If not, how can I run a script which calls Spiders?

Thank you

iter() not handling meta=None gracefully

I got a bit of a trouble when having the small code snippet below:

def get_spider_items(jobkey, meta=None):
    CLIENT.get_job(jobkey).items.iter(meta=meta)  # CLIENT was instantiated before

get_spider_items("1/2/3")

This results in something like:

  File "/app/python/lib/python3.6/site-packages/scrapinghub/client/proxy.py", line 113, in iter
    drop_key = '_key' not in apiparams.get('meta', [])
TypeError: argument of type 'NoneType' is not iterable

A workaround would something like the one below, although this would be much better when the package handles it:

meta = derive_something_with_meta() or []
get_spider_items("1/2/3", meta=meta)

Serialization error when item contains timezone-aware datetime object

When using dateparser>=0.6, the parser produces timezone-aware datetime objects if there's a timezone in the input string.

Example spider:

# -*- coding: utf-8 -*-
from datetime import datetime

import scrapy
import dateparser



class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        return {'url': response.url,
                'ts': dateparser.parse(
                        datetime.utcnow().isoformat()+'Z')}

Error seen on Scrapy Cloud:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/local/lib/python3.6/site-packages/sh_scrapy/extension.py", line 47, in item_scraped
    self._write_item(item)
  File "/usr/local/lib/python3.6/site-packages/sh_scrapy/writer.py", line 78, in write_item
    self._write('ITM', item)
  File "/usr/local/lib/python3.6/site-packages/sh_scrapy/writer.py", line 46, in _write
    default=jsondefault
  File "/usr/local/lib/python3.6/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/local/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/local/lib/python3.6/site-packages/scrapinghub/hubstorage/serialization.py", line 43, in jsondefault
    delta = o - EPOCH
TypeError: can't subtract offset-naive and offset-aware datetimes

Return timestamp values as datetime objects

Several Scrapinghub API endpoints accept or return timestamps, currently as UNIX timestamp in milliseconds.

It would be great to have those values as datetime.datetime objects in the results so that consumers of python-scrapinghub calls do not have to convert them for interpretation.

Passing datetime.datetime objects in methods allowing filtering on timestamps, e.g. where startts and endts arguments are supported, would be very handy too.

'Items' object is not callable

I am using python scrapinghub. Now I am facing two problems.

I am getting en error.

Traceback (most recent call last):
  File "shop_info/test.py", line 58, in <module>
    for item in job.items():
TypeError: 'Items' object is not callable

I want to pass a JSON object as an argument. I tried with this.
project.jobs.run('shop_info', job_args={'input_data': input_data})

input_data is a json object.

project.jobs(spider='myspider', state='finished', count=-1)

I will only get jobs with a state of finished but this may include jobs with a close_reason of shutdown or something other than "finished".

I would like to be able to do:

project.jobs(spider='myspider', close_reason='finished', count=-1)

which would of course assume that state is finished as well.