Code Monkey home page Code Monkey logo

python-scrapinghub's Introduction

Client interface for Scrapinghub API

image

image

image

image

The scrapinghub is a Python library for communicating with the Scrapinghub API.

Requirements

  • Python 2.7 or Python 3.5+

Installation

The quick way:

pip install scrapinghub

You can also install the library with MessagePack support, it provides better response time and improved bandwidth usage:

pip install scrapinghub[msgpack]

Documentation

Documentation is available online via Read the Docs or in the docs directory.

python-scrapinghub's People

Contributors

alexcepoi avatar andresp99999 avatar bertinatto avatar burnzz avatar chekunkov avatar dangra avatar ejulio avatar elacuesta avatar eliasdorneles avatar gallaecio avatar hermit-crab avatar immerrr avatar jesuslosada avatar kalessin avatar laurentsenta avatar manycoding avatar ms5 avatar noviluni avatar omab avatar pablohoffman avatar pardo avatar pawelmhm avatar qrilka avatar rafaelcapucho avatar redapple avatar shaneaevans avatar torymur avatar victor-torres avatar void avatar vshlapakov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-scrapinghub's Issues

Jobs.iter_last start and start_after parameters undocumented

If I look at the jobs.iter documentation I would assume that start is an integer count of recent jobs to skip, but I get this error

Invalid "start" value: Key '0' is not a spider key, must have 2 part(-s)

which makes it seem like it might be expecting a job id here minus the project part?

I can't find the code for start_after/startafter anywhere. It could be either a number of jobs to skip or a timestamp. If it's a timetamp I'm not sure if "after" means "after in sort order" or "after chronologically (job time > start time)".

requests.exceptions.ReadTimeout on `jobq.list`

So, for one of the projects I'm working on - I'm writing an automation script that decides how many and which jobs to send the scrapinghub.

To do precisely that, I need to get the number of pending or running jobs from scrapinghub. So, I decided to keep this simple and this was the code I came up with (using the jobq api)-

    def get_num_pending_or_running_jobs():
        num_pending_jobs = len(list(project.jobq.list(state='pending')))
        num_running_jobs = len(list(project.jobq.list(state='running')))
        return num_pending_jobs + num_running_jobs

This resulted in a ReadTimeout after about a couple hours of running:-

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='storage.scrapinghub.com', port=443): Read timed out. (read timeout=60.0). Can't afford this as getting the list of pending or running jobs is absolutely necessary for the project I'm working on.

Does anyone know why this is happening? Does this happen very often?

Any help would be appreciated

Collections key not found with library

I'm curious about the difference between Collection.get() and Collection.iter(key=[KEY])

>>> key = '456/789'
>>> store = project.collections.get_store('trump')
>>> store.set({'_key': key, 'value': 'abc'})
>>> print(store.list(key=[key]))

[{'value': 'abc', '_key': '456/789'}]  # https://storage.scrapinghub.com/collections/9328/s/trump?key=456%2F789&meta=_key

>>> try:
>>>     print(store.get(key))
>>> except scrapinghub.client.exceptions.NotFound as e:
>>>     print(getattr(e, 'http_error', e))

404 Client Error: Not Found for url: https://storage.scrapinghub.com/collections/9328/s/trump/456/789

I assume that Collection.get() is a handy shortcut for the key-filtered .iter() function so I guess the point of my issue is that .get() will raise an exception if given bad input, for example slashes

`TypeError: __init__() got an unexpected keyword argument 'encoding'`

Hi there!

When doing:

pip install scrapinghub[msgpack]

I get the following error:

from os import environ

import msgpack
import scrapinghub as sh

print(
    "",
    f"msgpack.version = {msgpack.version!r}",
    f"sh.__version__ = {sh.__version__!r}",
    "",
    sep="\n",
)

job = sh.ScrapinghubClient(environ.get("SH_APIKEY")).get_job("432787/1/1")
job.items.list(count=1)
SH_APIKEY=... python foobar.py

msgpack.version = (1, 0, 0)
sh.__version__ = '2.3.0'

Traceback (most recent call last):
  File "foobar.py", line 15, in <module>
    job.items.list(count=1)
  File "/home/alexander-matsievsky/.miniconda3/envs/demandmatrix/lib/python3.8/site-packages/scrapinghub/client/proxy.py", line 39, in list
    return list(self.iter(*args, **kwargs))
  File "/home/alexander-matsievsky/.miniconda3/envs/demandmatrix/lib/python3.8/site-packages/scrapinghub/client/proxy.py", line 114, in iter
    for entry in self._origin.iter_values(
  File "/home/alexander-matsievsky/.miniconda3/envs/demandmatrix/lib/python3.8/site-packages/scrapinghub/hubstorage/serialization.py", line 28, in mpdecode
    unpacker = Unpacker(encoding='utf8')
  File "msgpack/_unpacker.pyx", line 317, in msgpack._cmsgpack.Unpacker.__init__
TypeError: __init__() got an unexpected keyword argument 'encoding'

Allow exporting items data in CSV format

I am using the client to request items but I would like to limit them because some of the fields are way too big. So I only need a few fields but I want all the items. For example, this works fine for CSV if I declare the fields parameter:

$ curl -uAPIKEY: "https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name,venue"

"Ole Miss","Vaught Hemingway Stadium"
"Kansas State","Bill Snyder Family Stadium"
"LSU","Tiger Stadium"

But when I try it with the client I get:

Python 3.6.3 (default, Oct  3 2017, 21:45:48)
>>> import scrapinghub
>>> scrapinghub.__version__
'2.0.3'
>>> client = scrapinghub.ScrapinghubClient(APIKEY)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])

requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url:
https://storage.scrapinghub.com/items/244066/83/3?format=csv&fields=name%2Cvenue

scrapinghub.client.exceptions.ScrapinghubAPIError: No acceptable
content types matching header 'application/x-msgpack' and format 'csv'
The following are supported: application/x-msgpack, application/xml,
text/csv, application/json, application/x-jsonlines

Ok, so let's try without msgpack:

>>> client = scrapinghub.ScrapinghubClient(APIKEY, use_msgpack=False)
>>> job = client.get_job('244066/83/3')
>>> items = job.items.list(format='csv', fields=['name,venue'])

File "scrapinghub/hubstorage/serialization.py", line 25, in jldecode
    yield loads(line)
json.decoder.JSONDecodeError: Extra data: line 1 column 11 (char 10)

So the problem is that the response is assumed by the client to be JSON and it tries to decode the string:

'"Ole Miss","Vaught Hemingway Stadium"'

Ok, let's try it with json now:

>>> items = job.items.list(format='json', fields=['name,venue'])
>>> items

[[{'name': 'Ole Miss', 'venue': 'Vaught Hemingway Stadium',
   'venue_address': 'All-American Dr, University, MS 38677, EUA',
   'date': 1542857400000.0,...

Well, there's no error, but we get all the fields instead of just the two we request, effectively the fields parameter is ignored.

So maybe we could patch scrapinghub/hubstorage/resourcetype.py:apirequest() to check for the csv format to bypass the json decoding but it would actually be better if the backend api supported this field subset declaration for other formats, namely json.

I see that the api supports max_fields and we know that csv supports field limiting so maybe it's not a big deal to get the api to support field limiting for json as well.

Exception raised when using latest version of request library.

When using the scrapinghub library with the latest version of requests (1.1.0) I got
the following exception:

  File "export_csv.py", line 16, in <module>
    main()
  File "export_csv.py", line 10, in main
    print project.spiders()
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/scrapinghub.py", line 189, in spiders
    result = self._get('spiders', 'json', params)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/scrapinghub.py", line 154, in _get
    return self._request_proxy._get(method, format, params, headers, raw)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/scrapinghub.py", line 73, in _get
    return self._request(url, None, headers, format, raw)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/scrapinghub.py", line 100, in _request
    auth=self.auth, prefetch=False)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/home/andresport/code/scrapinghub/slybot_test/lib/python2.6/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'prefetch'

The problem seems to be that scrapinghub-python has a dependency the latest version of requests: https://github.com/scrapinghub/python-scrapinghub/blob/master/requirements.txt,
however the 'prefetch' parameter is not longer used in the latest version: https://github.com/kennethreitz/requests/blob/master/requests/sessions.py.

It used to be in previous version: https://github.com/kennethreitz/requests/blob/3c0b94047c1ccfca4ac4f2fe32afef0ae314094e/requests/sessions.py

The script that triggers this exception is the following

#!/usr/bin/env python
from scrapinghub import Connection

APIKEY = 'xxxx'
PROJECT_ID = xxx

def main():
    conn = Connection(APIKEY)
    project = conn[PROJECT_ID]
    print project.spiders()
    jobs = project.jobs(state='finished')
    for job in jobs:
        print job

if __name__ == '__main__':
    main()

Make msg_pack default

Considering the huge difference in performance, I propose to make it a default option.

It seems like it will require cleaning msg_pack not installled messages from the code, updating tests and setup.py\tox.

And docs.

Adding a tag actually replaces all existing tags with the new.

I pull jobs lacking the consumed tag:

jobs = sh.jobs project: '4024', state: 'finished', lacks_tag: 'consumed'

Jobs when pulled, are tagged with "consumed":

`curl -u #{ENV['SCRAPINGHUB_API_KEY']}: https://dash.scrapinghub.com/api/jobs/update.json -d project=4024 -d job=#{job['id']} -d add_tag=consumed`

When I respond to the pulled data, they're tagged with approved or rejected:

`curl -u #{ENV['SCRAPINGHUB_API_KEY']}: https://dash.scrapinghub.com/api/jobs/update.json -d project=4024 -d job=#{job['id']} -d add_tag=approved`

But, after being tagged with approved or rejected, jobs lose their consumed tag, and are again returned when querying for jobs missing the consumed tag.

jobs = sh.jobs project: '4024', state: 'finished', lacks_tag: 'consumed'

How is this aspect of the API meant to be used?

Cannot get spider settings

I can set specific spider settings in the Scrapinghub UI but cannot get them with this library.

I use the spider settings to store some keys that need to match a database key, that are then used while loading the scraped data in an ETL.

I was expecting something like:

spider = project.spiders.get('myspider')
spider.settings()

or

spider = project.spiders.get('myspider')
spider.metadata.get('spider_settings')

Related issues:
Issue #84 was about accessing spider settings from the job that was possible through job.metadata.get('job_settings')

Do startts and endts limit results or adjust them?

Again primarily a question about jobs.iter. Unfortunately this is somewhat related to #138 and #136, so I'll repeat some comments.

Let's say ts is the timestamp for June 13 2019.

If I do jobs.iter(startts=ts, count=100) will this

  • Return 100 jobs where the most recent job is the first job chronologically before ts?
  • Return 100 jobs where the oldest job is the first job chronologically after ts?
  • Return 100 jobs starting from the most recent job, where no job is before ts?

Again, similar question about endts.

Method to iterate all jobs/items/whatever

The AWS boto client has a Paginator to help iterating within api result limits, which while clunky is very nice to have since it's hard to get it wrong.

A method to iterate/list all results or else a Paginator that hides the pagination parameters which are easy to get wrong would be super helpful. The 1000 job limit is an issue in all projects I've needed to use this in.

Long intervals during resource iteration can lead to issues

Hello.

Recently there was this issue #121 for which a batch read workaround was implemented. I am now experiencing from what I believe to be same or similar issue but now while using JSON instead of msgpack. Basically when I do for item in job.items.iter(..., count=X, ...): if there are long intervals during iteration the count can end up being ignored. I was able to reproduce it with the following snippet:

sh_client = ScrapinghubClient(APIKEY, use_msgpack=False)
take = 10_000
job_id = '168012/276/1'
for i, item in enumerate(sh_client.get_job(job_id).items.iter(count=take, meta='_key')):
    print(f'\r{i} ({item["_key"]})', end='')

    if i == 3000:
        print('\nsleeping')
        time.sleep(60*3)
    
    if i > take:
        print('\nWTF')
        break

With the sleep part removed the WTF section does not fire and the iterator stops on 168012/276/1/9999th item.

This seem to be more of a ScrapyCloud API platform problem but I am reporting it here to track nonetheless.

For now I am assuming resource/collections iteration is not robust if any delays are possible client side during retrieval (I haven't tested any other potential issues) and I will try either preloading all at once (.list()) or using .list_iter() when makes sense as a habit.

Add a check for jobs running locally or in the cloud

Currently, we may have settings for spiders running locally and other settings for spiders running in Scrapy Cloud (Dash).
Usually, I add a check like if 'SHUB_JOBKEY' not in os.environ:.
However it may not be the best one and , if for some reason this env is deprecated, I need to update my checks.
If would be nice to have this kind of check, not sure if in this library or somewhere else.

How to retrieve only items "_key". Is nodata supported in `items.iter()`?

I found an example with which I can retreive only the specified meta from job items:
https://doc.scrapinghub.com/api/items.html?highlight=metadata#examples

This could be handy to parallelize download and fix the issue with filters, collections (since we cannot properly divide items before download after filter is applied, or for collections which have disordered or duplicated keys) scrapinghub/arche#13 by fetching only keys (I think it should be fast), breaking them on batches and then downloading the data.

Naturally, getting just keys should be much faster to solve the issues above. I tested locally with curl for 13kk keys:
CPU times: user 2.76 s, sys: 1.56 s, total: 4.33 s
Wall time: 9min 55s

That's not really fast, but anyway - is there nodata?

jobs.run(spider, job_args = {arg1: val1}) can't have val1 as a list.

I have tried to pass a list in val1 argument and I was unable to run my spider for multiple links which I pass as an value in job_args.

The way I got around that was to use repr to convert my list to a string and then on the spider side evaluate the string using ast.literal_eval and then run the spider.

Can somebody help me with a more "scrapy" solution.

iter behaving like list?

Based on the description in the docs, I'd expect the iter method on Items to instantly return a generator, then perform network requests as next is called on it, a bit like a MongoDB cursor. But instead it blocks whilst performing a lot of network activity and using substantial amounts of memory before returning the generator, which suggests it's loading the entire collection of items in the background, and is more like how I'd expect the list method to work. Is this correct? If this is how iter is supposed to work, this means it's almost impossible to work with even moderately large collections of scraped items.

No handlers could be found for logger "scrapinghub"

Is this the proper behavior for logging?

>>> import scrapinghub
>>> scrapinghub.__version__
'1.4.0'
>>> scrapinghub.logger
<logging.Logger object at 0x24540d0>
>>> scrapinghub.logger.info('adsf')
>>> scrapinghub.logger.error('adsf')
No handlers could be found for logger "scrapinghub"
>>> 

Include stats property to Job instance

More than once I needed to get the stats of a job using scrapinghub library, but it is not well documented that I can find this information using job.metadata.get("scrapystats") (the information is there but it took a while for me to figure out how to get the job stats).

As this seems (at least for my latest usages of the library), something that it is used frequently, what do you think to include a new property in Job to return the dictionary with the job stats (similar to items.Items.stats()).

Cannot run Spiders from Script

Hi,

I have a scrapy project which works great. I am trying to migrate it to ScrapingHub.
I want to be able to launch spiders from a script (see code below), but it is not working. (not accessing the Spider parse() function):

SCRIPT:
def main():
...
yield crawler.crawl(quotes_spider.QuotesSpider)
crawler.start()

Is it possible to do it this way? If so, how? If not, how can I run a script which calls Spiders?

Thank you

Getting multiple errors in app console on executing scraper

Hi,

I have added a scraper and added it in scrapinghub. After executing the scraper I can see an entry of my job id in completed jobs section. The error column in the completed jobs section is showing 2 errors.
On checking job log I got error details which is added below.

Error 1

Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred result = f(*args, **kw) File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 185, in open_spider uri = self.urifmt % self._get_uri_params(spider) KeyError: 'date'

Error 2

Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred result = f(*args, **kw) File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 210, in item_scraped slot = self.slot AttributeError: 'FeedExporter' object has no attribute 'slot'

Please let me know if there is any fix for this.

Thanks in advance.

'Items' object is not callable

I am using python scrapinghub. Now I am facing two problems.

  1. I am getting en error.
Traceback (most recent call last):
  File "shop_info/test.py", line 58, in <module>
    for item in job.items():
TypeError: 'Items' object is not callable
  1. I want to pass a JSON object as an argument. I tried with this.
    project.jobs.run('shop_info', job_args={'input_data': input_data})

input_data is a json object.

Difference between jobs.iter and jobs.iter_last

The documentation for jobs.iter says

by default :meth:Jobs.iter returns maximum last 1000 results.
which implies that it will return the most recent (first 1000 sorted in decreasing chronological order).

The documentation for jobs.iter_last says

Iterate through last jobs for each spider.
which implies that it will return the jobs with the highest chronological value (that is to say, the first n jobs sorted in decreasing chronological order).

I would guess that they're not both returning the latest jobs, however with limited job retention I can't imagine a situation where you'd actually want the results in increasing chronological order.

Cannot access the Scrapinghub overwritten spider settings

Hello,

I would like to access the settings overwritten for a spider job through the Scrapinhub interface. Ideally, I would like those to be accessible as an attribute of scrapinghub.client.jobs.Job because those can be different for different jobs for the same spider.

I saw that through the Restful API you can specify the job settings when running them (https://doc.scrapinghub.com/api/jobs.html#run-json), but didn't find job_settings as an attribute of any class.

Is there a way to access the job settings through python-scrapinghub ?

Return timestamp values as datetime objects

Several Scrapinghub API endpoints accept or return timestamps, currently as UNIX timestamp in milliseconds.

It would be great to have those values as datetime.datetime objects in the results so that consumers of python-scrapinghub calls do not have to convert them for interpretation.

Passing datetime.datetime objects in methods allowing filtering on timestamps, e.g. where startts and endts arguments are supported, would be very handy too.

Documentation conflicts regarding jobs.iter limits

In the documentation for jobs.iter it says

retrieve all jobs for a spider
.>>> spider.jobs.iter()

however if there's an implicit limit of 1000 results then this will not return all jobs for a spider, only if there are equal to or fewer than 1000 jobs for the spider. I don't believe there's a method in the API to get all jobs for a spider at this time.

project.jobs close_reason support needed

I would like to get the last "finished" job for a spider.

But if I do:

project.jobs(spider='myspider', state='finished', count=-1)

I will only get jobs with a state of finished but this may include jobs with a close_reason of shutdown or something other than "finished".

I would like to be able to do:

project.jobs(spider='myspider', close_reason='finished', count=-1)

which would of course assume that state is finished as well.

How to run a job?

I can't see how to run a job. There's two examples in the docs. In the project section:

For example, to schedule a spider run (it returns a job object):

>>> project.jobs.run('spider1', job_args={'arg1':'val1'})
<scrapinghub.client.Job at 0x106ee12e8>>

and in the spider section:

Like project instance, spider instance has jobs field to work with the spider's jobs.

To schedule a spider run:

>>> spider.jobs.run(job_args={'arg1:'val1'})
<scrapinghub.client.Job at 0x106ee12e8>>

Neither works, both throw AttributeError: 'Jobs' object has no attribute 'run'

Retries are not done for collections/items/requests/logs iterators.

Good day. We have a project that makes extensive use of the API and at certain times "429 Too many requests for user" occur and are raised with no indication of any retries, exponential or otherwise, taking place. In fact it seems no HTTP errors are retried for resource iterators whatsoever:
https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/hubstorage/resourcetype.py#L134-L143
Although I see other places do handle such:
https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/hubstorage/client.py#L20-L37
Is this by design?

UPD:
So at this point: https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/hubstorage/client.py#L113-L116
the iterator request executed as is without retrier wrapping as it deemed not idempotent. At a later point here: https://github.com/scrapinghub/python-scrapinghub/blob/master/scrapinghub/hubstorage/resourcetype.py#L134-L143 it is not handled at all (scrapinghub.client.exceptions.ScrapinghubAPIError('Too many requests for user') in our case).

collections.get_store is not working as documented

Upon going through collections doc we see 2. call .get_store(<somename>) to create or access the named collection you want (the collection will be created automatically if it doesn't exist) ; you get a "store" object back,
But when you try this

>>> store = collections.get_store('store_which_does_not_exist')
>>> store.get('key_which_does_not_exist')
DEBUG:https://storage.scrapinghub.com:443 "GET /collections/462630/s/store_which_does_not_exist/key_which_does_not_exist HTTP/1.1" 404 46
2021-02-04 13:33:20 [urllib3.connectionpool] DEBUG: https://storage.scrapinghub.com:443 "GET /collections/462630/s/store_which_does_not_exist/key_which_does_not_exist HTTP/1.1" 404 46
DEBUG:<Response [404]>: b'unknown collection store_which_does_not_exist\n'
2021-02-04 13:33:20 [HubstorageClient] DEBUG: <Response [404]>: b'unknown collection store_which_does_not_exist\n'
*** scrapinghub.client.exceptions.NotFound: unknown collection store_which_does_not_exist

When we .set some value to store which doesn’t exist, store is created and then the values are stored.

>>> store.set({'_key': 'some_key', 'value': 'some_value'})
DEBUG:https://storage.scrapinghub.com:443 "POST /collections/462630/s/store_which_does_not_exist HTTP/1.1" 200 0
2021-02-04 13:36:56 [urllib3.connectionpool] DEBUG: https://storage.scrapinghub.com:443 "POST /collections/462630/s/store_which_does_not_exist HTTP/1.1" 200 0
According to docs, shouldnt the store be created when we call .get_store ?

Serialization error when item contains timezone-aware datetime object

When using dateparser>=0.6, the parser produces timezone-aware datetime objects if there's a timezone in the input string.

Example spider:

# -*- coding: utf-8 -*-
from datetime import datetime

import scrapy
import dateparser



class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        return {'url': response.url,
                'ts': dateparser.parse(
                        datetime.utcnow().isoformat()+'Z')}

Error seen on Scrapy Cloud:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/local/lib/python3.6/site-packages/sh_scrapy/extension.py", line 47, in item_scraped
    self._write_item(item)
  File "/usr/local/lib/python3.6/site-packages/sh_scrapy/writer.py", line 78, in write_item
    self._write('ITM', item)
  File "/usr/local/lib/python3.6/site-packages/sh_scrapy/writer.py", line 46, in _write
    default=jsondefault
  File "/usr/local/lib/python3.6/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/local/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/local/lib/python3.6/site-packages/scrapinghub/hubstorage/serialization.py", line 43, in jsondefault
    delta = o - EPOCH
TypeError: can't subtract offset-naive and offset-aware datetimes

Use SHUB_JOBAUTH environment variable in utils.parse_auth method

Currently, the parse_auth method tries to get the API Key from the SH_APIKEY environment variable, which needs to be manually set either in spider's code or in the [docker] image's code. A regular practice is to create dummy users and associate them with the project so that real contributors don't have to share their API Keys.

Another option is to use the credentials provided by the SHUB_JOBAUTH, defined during runtime when executing jobs in the Scrapy Cloud platform.

Although it's possible to use Collections and Frontera, this is not a regular Dash API Key but a JWT Token generated in runtime by JobQ service, which works only for a part of our API endpoints (JobQ/Hubstorage).

I'd like to contribute with a Pull Request adding support for this ephemeral API Key.

msgpack errors when using iter() with intervals between each batch call

Good Day!

I've encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here's a strip down version of the code for reproduction of the issue:

import pandas as pd

from scrapinghub import ScrapinghubClient

def read_job_items_by_chunk(jobkey, chunk=10000):
    """In order to prevent OOM issues, the jobs' data must be read in
    chunks.

    This will return a generator of pandas DataFrames.
    """

    client = ScrapinghubClient("APIKEY123")

    item_generator = client.get_job(jobkey).items.iter()

    while item_generator:
        yield pd.DataFrame(
            [next(item_generator) for _ in range(chunk)]
        )

for df_chunk in read_job_items_by_chunk('123/123/123'):
    # having a small chunk-size like 10000 won't have any problems

for df_chunk in read_job_items_by_chunk('123/123/123', chunk=25000):
    # having a bug chunk-size like 25000 will throw out errors like the one below

Here's the common error it throws:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 67: invalid start byte

Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
TypeError: unhashable type: 'dict'

I find that the workaround/solution for this would be to have a lower value for chunk. So far, 1000 works great.

This uses scrapy:1.5 stack in Scrapy Cloud.

I'm guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.

May I ask if there might be a solution for this? since a much bigger chunk size will help with the speed of our jobs.

I've marked it as bug for now as this is quite an unexpected/undocumented behavior.

Cheers!

Improve the documentation about the possible parameters of items.iter()

I’m trying to understand the possible parameters of Job.items.iter() but it’s not that clear to me:

  • Why is count documented as a parameter on its own? (I assume the rest of the pagination parameters are assumed to be part of apiparameters)
  • What is requests_params for?
  • I see other parameters mentioned in the documentation that don’t seem part of the Items API, pagination or meta. For example, filter (in the 4th example of the Items documentation, with list instead of iter but I assume they accept the same arguments)

Update tests deps to latest versions and unpin deps

#110 shows that there are some old dependencies, so the idea is:

  1. Update all tests dependencies (pytest and others) to the latest versions
  2. Unpin all the versions (since not updated libraries are prone to security vulnerabilities)

Unclear iteraction between iter skip values and start/end times

For jobs.iter but could apply elsewhere.

If start=10 and startts=100000 will it

  • skip 10 results after time 100000 or will it
  • return results that are at least 10 from the start and after time 100000
  • return results that are at least 10 from the most recent and no older than time 100000 (re: ambiguity in #136 )?

The same question about endts and start.

Obtaining error messages

Based upon the documentation I have read, I don't see anything for obtaining error messages when there are errors from executing a spider. Did I overlook something?

Omitting `_key` hangs `_BatchWriter`

Failing

# reprex.py
import os

import scrapinghub

store = (
    scrapinghub.ScrapinghubClient(os.getenv('SH_APIKEY'))
    .get_project(1234567890)
    .collections.get_store("ok_to_mess_around_with")
)

writer = store.create_writer()

writer.write({"foo": "bar"})
print("write")

writer.write({"fizz": "buzz"})
print("write")

writer.flush()
print("flush")
python reprex.py 

write
write
^CTraceback (most recent call last):
...
KeyboardInterrupt

Passing

import os

import scrapinghub

store = (
    scrapinghub.ScrapinghubClient(os.getenv('SH_APIKEY'))
    .get_project(1234567890)
    .collections.get_store("ok_to_mess_around_with")
)

writer = store.create_writer()

writer.write({"_key": "foo", "foo": "bar"})
print("write")

writer.write({"_key": "fizz", "fizz": "buzz"})
print("write")

writer.flush()
print("flush")
python reprex.py

write
write
flush

iter() not handling meta=None gracefully

I got a bit of a trouble when having the small code snippet below:

def get_spider_items(jobkey, meta=None):
    CLIENT.get_job(jobkey).items.iter(meta=meta)  # CLIENT was instantiated before

get_spider_items("1/2/3")

This results in something like:

  File "/app/python/lib/python3.6/site-packages/scrapinghub/client/proxy.py", line 113, in iter
    drop_key = '_key' not in apiparams.get('meta', [])
TypeError: argument of type 'NoneType' is not iterable

A workaround would something like the one below, although this would be much better when the package handles it:

meta = derive_something_with_meta() or []
get_spider_items("1/2/3", meta=meta)

class for job's key

I have written JobKey class that represents the key to the job on ScrapingHub service in project_id/spider_id/job_number format. Also, it offers methods to quickly get an ID of the spider/project, tuple representation, etc.

You can find it' code here.

Maybe it can be somehow modified/improved to get included into python-scrapinghub library?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.