UPDATE 2: I believe this IS a bug although I'm not sure if it's in scrapy or deltafetc

/summons <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

With deltafetch enabled, scrapy is still crawling previously crawled urls about scrapylib HOT 5 CLOSED

scrapinghub commented on June 28, 2024

With deltafetch enabled, scrapy is still crawling previously crawled urls

from scrapylib.

Comments (5)

Digenis commented on June 28, 2024

From the deltafetch documentation:
...a spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider...
You clearly don't yield any items in you first parse method.
The second however is an example of more proper usage.
It's impossible for deltafetch to speculate on side effects like open(filename, 'wb').write(...
so move this logic to the pipeline.
You may be interested in http://doc.scrapy.org/en/latest/topics/jobs.html instead.
Also, you are using a scrapy version that has known bugs and is no longer supported
(however this doesn't seem to affect this ticket).

This issue is invalid and should be closed.

from scrapylib.

empereor commented on June 28, 2024

Yeah. Figured that out a year ago. And while it may be clear to you, to someone just starting to use scrapy and deltafetch, the distinction of an "item" in scrapy and an item isn't so clear.

And while it may be impossible for deltafetch to speculate on side affects, clearly it's completely possible for deltafetch to be able to determine if a url has been previously visited, it's just not coded that way. Fine, I get it, easily rectified as demonstrated in the second configuration.

Thank you for the link to jobs, I'll see if that applies here, even though I already solved my issue as demonstrated in the second configuration, I was simply trying to give back to the project with what I thought was a bug.

And of course I was on an old version, it was over a year ago, at the time the bug was raised, .0.18.4 was only three months behind current, which was 0.22.0. Please realize that enterprise folks can't just go upgrading entire stacks with every new release. Thanks.

from scrapylib.

nyov commented on June 28, 2024

/summons @redapple (close please?)

from scrapylib.

redapple commented on June 28, 2024

@empereor , sorry for this late comment. Any user feedback is valuable and appreciated. Thank you.

DeltaFetch now lives in a seperate repository, https://github.com/scrapy-plugins/scrapy-deltafetch, and has releases independent of scrapylib (which will gradually end life without any foreceable update)
We'll do our best to document the behavior better there, that it's about not visiting pages for which the spider produced items.

from scrapylib.

redapple commented on June 28, 2024

thanks @nyov for the reminder.

from scrapylib.

With deltafetch enabled, scrapy is still crawling previously crawled urls about scrapylib HOT 5 CLOSED

Comments (5)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent