scrapinghub / scrapylib Goto Github PK
View Code? Open in Web Editor NEWCollection of Scrapy utilities (extensions, middlewares, pipelines, etc)
Collection of Scrapy utilities (extensions, middlewares, pipelines, etc)
Would be great if this package is available through pypi.
Write a simple spider to generate a large number of Request
s and store them to HCF. For example, read some site's sitemap and generate several millions or more requests from it.
The HcfMiddleware
will then comsume a really large amount of memory.
The current version of HcfMiddleware
's source code could be accessed here. Related codes are at line 159 and line 167:
159: if not request.url in self.new_links[slot]:
...
167: self.new_links[slot].add(request.url)
From which we could see the middleware maintains a duplicate filter here, which increases the memory overhead.
As the middleware is now not relying on new_links
to upload URLs (it uses the batch uploader instead), and there are only two purposes of the new_links
attribute:
I think we should move Crawlera middleware to its own repository & python package. Thoughts?
MAGIC_FIELDS is not replacing the item attribute in yielded Item object
it only adds the new attribute if not present but do not repalce the value if that attribute is already set in yielded item
i am not sure if this is desired behavior but i think It should replace if value for that attribute is present in item object otherwise should add it to item
Can we release da0f35c as 1.5.1?
It looks like default_input_processor generates strange results if > and the like appear in an attribute.
Rearranging the composition as this solves the problem:
default_input_processor = MapCompose(
replace_br, remove_tags, unquote_markup, replace_escape, strip, clean_spaces
)
Example html:
<span id="discount_box" data-option-text="Discount <span class="percent" id="buy_percent"></span><span class="percent">%</span>" data-default-text="up to <span class="percent" id="buy_percent">54</span><span class="percent">%</span>" class="discount">up to <span class="percent" id="buy_percent">54</span><span class="percent">%</span></span>
Current result:
[u'%" data-default-text="up to 54%" class="discount">up to 54%']
Assuming we have an item defined like this:
class TestItem(scrapy.item.Item):
test_field = scrapy.item.Field()
constraints = [scrapylib.constraints.RequiredFields('test_field')]
And then assign value False
to the field inside our spider code:
item = TestItem()
item.test_field = False
yield item
Make sure scrapylib.constraints.pipeline.ConstraintsPipeline
is enabled, and the item would be ignored with the following message given:
Dropped: missing field: test_filed
Starting from line 35 of scrapylib/constraints/__init__.py
class RequiredFields(object):
"""Assert that the specified fields are populated and non-empty"""
def __init__(self, *fields):
self.fields = fields
def __call__(self, item):
for f in self.fields:
v = item.get(f)
assert v, "missing field: %s" % f
RequiredFields
shall care about only the existence of the field (the key), not the value.NonEmptyFields
here to ensure the non-emptyness of the values, which also implicitly ensures the existence of the field.assert v
is bad. assert v or v is False
looks better but not good enough. I guess something based on len(v)
would do.Currently a call to hubstorage.frontier.Frontier.read
returns a generator which contains 100 batches, each of which contains 100 links. That makes 100x100=10000 links.
When HS_MAX_LINKS
is not exceeding 10000, things look good; But when it's larger than 10000, e.g. specifying HS_MAX_LINKS = 20000
, only 10000 links are generated by the HcfMiddleware
.
I just upgraded to scrapy 1.0.0rc2 and it appears that Crawlera's most recent version is still using scrapy.log as opposed to python's logging module.
This should be fixed in next version of crawlera.
See #72 (comment)
UPDATE 2: I believe this IS a bug although I'm not sure if it's in scrapy or deltafetch.
I have resolved the issue, or least implemented a workaround, IMHO this is a bug.
My spider looks like this:
class FunStuffSpider(CrawlSpider):
name = "test_fun"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/fun/stuff/2012/",
"http://www.example.com/fun/stuff/2013/",
"http://www.example.com/fun/stuff/2014/"
]
pipelines = ['fun_pipe']
rules = (Rule (SgmlLinkExtractor(allow=("fun.-.*\.xml"),restrict_xpaths=('//table/tr[*]/td[2]/a'))
, callback="parse_funstuff", follow= True),
)
def parse_funstuff(self, response):
filename = response.url.split("/")[-1]
open(filename, 'wb').write(response.body)
With the above spider, scrapy/deltafetch did not think I had scraped anything, even though it was happily downloading the target XML files.
I added the folllowing to the spider:
xxs = XmlXPathSelector(response)
xxs.remove_namespaces()
sites = xxs.select("//funstuff/morefun")
items = []
for site in sites:
item = FunItem()
item['stuff'] = site.select("//funstuff/FunStuff/funny/ID").extract()
items.append(item)
return items
Now deltafetch works as expected; logs both the "GET" and a "Scraped from", populates the output xml that I have defined in the "FunItem" that is defined in pipelines.py, updates the db file and subsequent spider crawls correctly logs that the url for each xml file has already been visited and ignores it.
original post below:
I apologize in advance if this is a configuration, user, or PEBKAC issue and not a bug, or if this is not the appropriate forum to bring this. Please disregard if so. I've not been able to find relevant documentation, nor have other examples of scrapy+deltafetch addressed this. II also have not been able to find any helpful references to dotscrapy, so I am also unsure if it is configured properly. I believe this is a bug since I believe I have it configured correctly, but without reference documentation or other user examples, I can't be sure. If this is not a bug then please disregard; however if anyone could point me to a working configuration, or additional troubleshooting steps, it would be most appreciated.
Issue:
With deltafetch enabled, scrapy is still crawling previously crawled urls.
System is RHEL 6.5
[root@hostname ~]# python -V
Python 2.6.6
I've installed deltafetch via pip:
[root@hostname ~]# pip search scrapy
Scrapy - A high-level Python Screen Scraping framework
INSTALLED: 0.18.4
LATEST: 0.22.2
[root@hostname ~]# pip search scrapylib
scrapylib - Scrapy helper functions and processors
INSTALLED: 1.1.3 (latest)
/usr/lib/python2.6/site-packages/scrapylib/deltafetch.py
I've configured my settings.py thus:
SPIDER_MIDDLEWARES = {
'scrapylib.deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True
DOTSCRAPY_ENABLED = True
When I run the spider, DeltaFetch appears to be enabled:
2014-06-20 10:58:00-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware,
DeltaFetch, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
The .scrapy directory is created:
[user@hostname output]$ ls -al ../.scrapy
total 12
drwxrwxr-x. 3 user user 4096 Jun 20 10:58 .
drwxrwxr-x. 6 user user 4096 Jun 20 10:58 ..
drwxrwxr-x. 2 user user 4096 Jun 20 10:58 deltafetch
The db file is being created:
[user@hostname output]$ ls -al ../.scrapy/deltafetch/
total 16
drwxrwxr-x. 2 user user 4096 Jun 20 10:58 .
drwxrwxr-x. 3 user user 4096 Jun 20 10:58 ..
-rw-rw-r--. 1 user user 12288 Jun 20 10:58 spider.db
[user@hostname deltafetch]$ file spider.db
spider.db: Berkeley DB (Hash, version 9, native byte-order)
[user@hostname deltafetch]$
However the .db file appears to have no state data:
[user@hostname deltafetch]$ python
Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bsddb
>>> for k, v in bsddb.hashopen("spider.db").iteritems(): print k, v
...
>>>
[user@hostname deltafetch]$ db_dump spider.db
VERSION=3
format=bytevalue
type=hash
db_pagesize=4096
HEADER=END
DATA=END
When I run the spider again, all the same urls are crawled/fetched and even though there were new items in the crawl, the state db did not get appear to get updated, eg .this is previously fetched file:
2014-06-20 11:13:56-0400 [spider] DEBUG: Crawled (200)
<GET http://www.example.com/xxx/xxx/xxx/xxx/xxx.xml>
(referer: None)
I can see not only from the log that the files are still being crawled, but the .xml files that I create from the crawl are getting created again.
I've looked at the other related deltafetch questions and they did not address this issue, any assistance appreciated.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.