UPDATE 2: I believe this IS a bug although I'm not sure if it's in scrapy or deltafetch.
I have resolved the issue, or least implemented a workaround, IMHO this is a bug.
My spider looks like this:
class FunStuffSpider(CrawlSpider):
name = "test_fun"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/fun/stuff/2012/",
"http://www.example.com/fun/stuff/2013/",
"http://www.example.com/fun/stuff/2014/"
]
pipelines = ['fun_pipe']
rules = (Rule (SgmlLinkExtractor(allow=("fun.-.*\.xml"),restrict_xpaths=('//table/tr[*]/td[2]/a'))
, callback="parse_funstuff", follow= True),
)
def parse_funstuff(self, response):
filename = response.url.split("/")[-1]
open(filename, 'wb').write(response.body)
With the above spider, scrapy/deltafetch did not think I had scraped anything, even though it was happily downloading the target XML files.
I added the folllowing to the spider:
xxs = XmlXPathSelector(response)
xxs.remove_namespaces()
sites = xxs.select("//funstuff/morefun")
items = []
for site in sites:
item = FunItem()
item['stuff'] = site.select("//funstuff/FunStuff/funny/ID").extract()
items.append(item)
return items
Now deltafetch works as expected; logs both the "GET" and a "Scraped from", populates the output xml that I have defined in the "FunItem" that is defined in pipelines.py, updates the db file and subsequent spider crawls correctly logs that the url for each xml file has already been visited and ignores it.
original post below:
I apologize in advance if this is a configuration, user, or PEBKAC issue and not a bug, or if this is not the appropriate forum to bring this. Please disregard if so. I've not been able to find relevant documentation, nor have other examples of scrapy+deltafetch addressed this. II also have not been able to find any helpful references to dotscrapy, so I am also unsure if it is configured properly. I believe this is a bug since I believe I have it configured correctly, but without reference documentation or other user examples, I can't be sure. If this is not a bug then please disregard; however if anyone could point me to a working configuration, or additional troubleshooting steps, it would be most appreciated.
Issue:
With deltafetch enabled, scrapy is still crawling previously crawled urls.
System is RHEL 6.5
[root@hostname ~]# python -V
Python 2.6.6
I've installed deltafetch via pip:
[root@hostname ~]# pip search scrapy
Scrapy - A high-level Python Screen Scraping framework
INSTALLED: 0.18.4
LATEST: 0.22.2
[root@hostname ~]# pip search scrapylib
scrapylib - Scrapy helper functions and processors
INSTALLED: 1.1.3 (latest)
/usr/lib/python2.6/site-packages/scrapylib/deltafetch.py
I've configured my settings.py thus:
SPIDER_MIDDLEWARES = {
'scrapylib.deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True
DOTSCRAPY_ENABLED = True
When I run the spider, DeltaFetch appears to be enabled:
2014-06-20 10:58:00-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware,
DeltaFetch, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
The .scrapy directory is created:
[user@hostname output]$ ls -al ../.scrapy
total 12
drwxrwxr-x. 3 user user 4096 Jun 20 10:58 .
drwxrwxr-x. 6 user user 4096 Jun 20 10:58 ..
drwxrwxr-x. 2 user user 4096 Jun 20 10:58 deltafetch
The db file is being created:
[user@hostname output]$ ls -al ../.scrapy/deltafetch/
total 16
drwxrwxr-x. 2 user user 4096 Jun 20 10:58 .
drwxrwxr-x. 3 user user 4096 Jun 20 10:58 ..
-rw-rw-r--. 1 user user 12288 Jun 20 10:58 spider.db
[user@hostname deltafetch]$ file spider.db
spider.db: Berkeley DB (Hash, version 9, native byte-order)
[user@hostname deltafetch]$
However the .db file appears to have no state data:
[user@hostname deltafetch]$ python
Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bsddb
>>> for k, v in bsddb.hashopen("spider.db").iteritems(): print k, v
...
>>>
[user@hostname deltafetch]$ db_dump spider.db
VERSION=3
format=bytevalue
type=hash
db_pagesize=4096
HEADER=END
DATA=END
When I run the spider again, all the same urls are crawled/fetched and even though there were new items in the crawl, the state db did not get appear to get updated, eg .this is previously fetched file:
2014-06-20 11:13:56-0400 [spider] DEBUG: Crawled (200)
<GET http://www.example.com/xxx/xxx/xxx/xxx/xxx.xml>
(referer: None)
I can see not only from the log that the files are still being crawled, but the .xml files that I create from the crawl are getting created again.
I've looked at the other related deltafetch questions and they did not address this issue, any assistance appreciated.