scrapinghub / scrapylib Goto Github PK

View Code? Open in Web Editor NEW

31.0 95.0 77.0 176 KB

Collection of Scrapy utilities (extensions, middlewares, pipelines, etc)

Python 100.00%

scrapylib's Issues

PyPI package

Would be great if this package is available through pypi.

`HcfMiddleware` consumes too many memory on large number of URLs

How to reproduce this issue

Write a simple spider to generate a large number of Requests and store them to HCF. For example, read some site's sitemap and generate several millions or more requests from it.
The HcfMiddleware will then comsume a really large amount of memory.

Cause of the issue

The current version of HcfMiddleware's source code could be accessed here. Related codes are at line 159 and line 167:

159: if not request.url in self.new_links[slot]:
         ...
167:     self.new_links[slot].add(request.url)

From which we could see the middleware maintains a duplicate filter here, which increases the memory overhead.

How to solve this issue

As the middleware is now not relying on new_links to upload URLs (it uses the batch uploader instead), and there are only two purposes of the new_links attribute:

To maintain a duplicate filter (here)
To report the total links stored (here)
It's suggested to add a hcf_dont_filter key into request.meta, and ingore the duplicate filter if the key is set to True

Move Crawlera middleware to its own repositoy

I think we should move Crawlera middleware to its own repository & python package. Thoughts?

MAGIC_FIELDS does not replace the item attribute in yielded item object

MAGIC_FIELDS is not replacing the item attribute in yielded Item object

it only adds the new attribute if not present but do not repalce the value if that attribute is already set in yielded item

i am not sure if this is desired behavior but i think It should replace if value for that attribute is present in item object otherwise should add it to item

Release 1.5.1

Can we release da0f35c as 1.5.1?

default_input_processor with quoted markup in attribute

It looks like default_input_processor generates strange results if > and the like appear in an attribute.

Rearranging the composition as this solves the problem:

default_input_processor = MapCompose(
    replace_br, remove_tags, unquote_markup, replace_escape, strip, clean_spaces
)

Example html:

<span id="discount_box" data-option-text="Discount &lt;span class=&quot;percent&quot; id=&quot;buy_percent&quot;&gt;&lt;/span&gt;&lt;span class=&quot;percent&quot;&gt;%&lt;/span&gt;" data-default-text="up to &lt;span class=&quot;percent&quot; id=&quot;buy_percent&quot;&gt;54&lt;/span&gt;&lt;span class=&quot;percent&quot;&gt;%&lt;/span&gt;"  class="discount">up to <span class="percent" id="buy_percent">54</span><span class="percent">%</span></span>

Current result:

[u'%" data-default-text="up to 54%" class="discount">up to 54%']

Unexpected behaviour of constraint `RequiredFields`

How to reproduce the issue

Assuming we have an item defined like this:

class TestItem(scrapy.item.Item):
    test_field = scrapy.item.Field()
    constraints = [scrapylib.constraints.RequiredFields('test_field')]

And then assign value False to the field inside our spider code:

item = TestItem()
item.test_field = False
yield item

Make sure scrapylib.constraints.pipeline.ConstraintsPipeline is enabled, and the item would be ignored with the following message given:

Dropped: missing field: test_filed

Related code

Starting from line 35 of scrapylib/constraints/__init__.py

class RequiredFields(object):
    """Assert that the specified fields are populated and non-empty"""

    def __init__(self, *fields):
        self.fields = fields

    def __call__(self, item):
        for f in self.fields:
            v = item.get(f)
            assert v, "missing field: %s" % f

My opinions

As it's name suggests, RequiredFields shall care about only the existence of the field (the key), not the value.
I'd like to see another constraint NonEmptyFields here to ensure the non-emptyness of the values, which also implicitly ensures the existence of the field.
assert v is bad. assert v or v is False looks better but not good enough. I guess something based on len(v) would do.

Invalid issue (created in wrong repo)

`HcfMiddleware` sometimes gives fewer links than requested.

Currently a call to hubstorage.frontier.Frontier.read returns a generator which contains 100 batches, each of which contains 100 links. That makes 100x100=10000 links.

When HS_MAX_LINKS is not exceeding 10000, things look good; But when it's larger than 10000, e.g. specifying HS_MAX_LINKS = 20000, only 10000 links are generated by the HcfMiddleware.

Crawlera.py using scrapy.log

I just upgraded to scrapy 1.0.0rc2 and it appears that Crawlera's most recent version is still using scrapy.log as opposed to python's logging module.

This should be fixed in next version of crawlera.

Extract DeltaFetch to scrapy-plugins

See #72 (comment)

With deltafetch enabled, scrapy is still crawling previously crawled urls

UPDATE 2: I believe this IS a bug although I'm not sure if it's in scrapy or deltafetch.

I have resolved the issue, or least implemented a workaround, IMHO this is a bug.

My spider looks like this:

class FunStuffSpider(CrawlSpider):
    name = "test_fun"
    allowed_domains = ["example.com"]

    start_urls = [

    "http://www.example.com/fun/stuff/2012/",
    "http://www.example.com/fun/stuff/2013/",
    "http://www.example.com/fun/stuff/2014/"
]
    pipelines = ['fun_pipe'] 

    rules = (Rule (SgmlLinkExtractor(allow=("fun.-.*\.xml"),restrict_xpaths=('//table/tr[*]/td[2]/a'))
    , callback="parse_funstuff", follow= True),
    )

    def parse_funstuff(self, response):
        filename = response.url.split("/")[-1]
        open(filename, 'wb').write(response.body)

With the above spider, scrapy/deltafetch did not think I had scraped anything, even though it was happily downloading the target XML files.

I added the folllowing to the spider:

        xxs = XmlXPathSelector(response)
        xxs.remove_namespaces()
        sites = xxs.select("//funstuff/morefun")
        items = []
        for site in sites:
            item = FunItem()
            item['stuff'] = site.select("//funstuff/FunStuff/funny/ID").extract()
            items.append(item)
        return items

Now deltafetch works as expected; logs both the "GET" and a "Scraped from", populates the output xml that I have defined in the "FunItem" that is defined in pipelines.py, updates the db file and subsequent spider crawls correctly logs that the url for each xml file has already been visited and ignores it.

original post below:

I apologize in advance if this is a configuration, user, or PEBKAC issue and not a bug, or if this is not the appropriate forum to bring this. Please disregard if so. I've not been able to find relevant documentation, nor have other examples of scrapy+deltafetch addressed this. II also have not been able to find any helpful references to dotscrapy, so I am also unsure if it is configured properly. I believe this is a bug since I believe I have it configured correctly, but without reference documentation or other user examples, I can't be sure. If this is not a bug then please disregard; however if anyone could point me to a working configuration, or additional troubleshooting steps, it would be most appreciated.

Issue:

With deltafetch enabled, scrapy is still crawling previously crawled urls.

System is RHEL 6.5

[root@hostname ~]# python -V
Python 2.6.6

I've installed deltafetch via pip:

[root@hostname ~]# pip search scrapy
Scrapy                    - A high-level Python Screen Scraping framework
  INSTALLED: 0.18.4
  LATEST:    0.22.2

[root@hostname ~]# pip search scrapylib
scrapylib                 - Scrapy helper functions and processors
  INSTALLED: 1.1.3 (latest)

/usr/lib/python2.6/site-packages/scrapylib/deltafetch.py

I've configured my settings.py thus:

SPIDER_MIDDLEWARES = {
    'scrapylib.deltafetch.DeltaFetch': 100,
}

DELTAFETCH_ENABLED = True
DOTSCRAPY_ENABLED = True

When I run the spider, DeltaFetch appears to be enabled:

2014-06-20 10:58:00-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware,
DeltaFetch, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

The .scrapy directory is created:

[user@hostname output]$ ls -al ../.scrapy
 total 12
 drwxrwxr-x. 3 user user 4096 Jun 20 10:58 .
 drwxrwxr-x. 6 user user 4096 Jun 20 10:58 ..
drwxrwxr-x. 2 user user 4096 Jun 20 10:58 deltafetch

The db file is being created:

[user@hostname output]$ ls -al ../.scrapy/deltafetch/
total 16
drwxrwxr-x. 2 user user  4096 Jun 20 10:58 .
drwxrwxr-x. 3 user user  4096 Jun 20 10:58 ..
-rw-rw-r--. 1 user user 12288 Jun 20 10:58 spider.db

[user@hostname deltafetch]$ file spider.db 
spider.db: Berkeley DB (Hash, version 9, native byte-order)
[user@hostname deltafetch]$

However the .db file appears to have no state data:

[user@hostname deltafetch]$ python
Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bsddb
>>> for k, v in bsddb.hashopen("spider.db").iteritems(): print k, v
... 
>>> 

[user@hostname deltafetch]$ db_dump spider.db 
VERSION=3
format=bytevalue
type=hash
db_pagesize=4096
HEADER=END
DATA=END

When I run the spider again, all the same urls are crawled/fetched and even though there were new items in the crawl, the state db did not get appear to get updated, eg .this is previously fetched file:

2014-06-20 11:13:56-0400 [spider] DEBUG: Crawled (200)
<GET http://www.example.com/xxx/xxx/xxx/xxx/xxx.xml>
(referer: None)

I can see not only from the log that the files are still being crawled, but the .xml files that I create from the crawl are getting created again.

I've looked at the other related deltafetch questions and they did not address this issue, any assistance appreciated.