Comments (5)
From the deltafetch documentation:
...a spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider...
You clearly don't yield any items in you first parse method.
The second however is an example of more proper usage.
It's impossible for deltafetch to speculate on side effects like open(filename, 'wb').write(...
so move this logic to the pipeline.
You may be interested in http://doc.scrapy.org/en/latest/topics/jobs.html instead.
Also, you are using a scrapy version that has known bugs and is no longer supported
(however this doesn't seem to affect this ticket).
This issue is invalid and should be closed.
from scrapylib.
Yeah. Figured that out a year ago. And while it may be clear to you, to someone just starting to use scrapy and deltafetch, the distinction of an "item" in scrapy and an item isn't so clear.
And while it may be impossible for deltafetch to speculate on side affects, clearly it's completely possible for deltafetch to be able to determine if a url has been previously visited, it's just not coded that way. Fine, I get it, easily rectified as demonstrated in the second configuration.
Thank you for the link to jobs, I'll see if that applies here, even though I already solved my issue as demonstrated in the second configuration, I was simply trying to give back to the project with what I thought was a bug.
And of course I was on an old version, it was over a year ago, at the time the bug was raised, .0.18.4 was only three months behind current, which was 0.22.0. Please realize that enterprise folks can't just go upgrading entire stacks with every new release. Thanks.
from scrapylib.
/summons @redapple (close please?)
from scrapylib.
@empereor , sorry for this late comment. Any user feedback is valuable and appreciated. Thank you.
DeltaFetch now lives in a seperate repository, https://github.com/scrapy-plugins/scrapy-deltafetch, and has releases independent of scrapylib (which will gradually end life without any foreceable update)
We'll do our best to document the behavior better there, that it's about not visiting pages for which the spider produced items.
from scrapylib.
thanks @nyov for the reminder.
from scrapylib.
Related Issues (12)
- PyPI package HOT 2
- default_input_processor with quoted markup in attribute HOT 2
- Invalid issue (created in wrong repo)
- Unexpected behaviour of constraint `RequiredFields` HOT 1
- `HcfMiddleware` consumes too many memory on large number of URLs HOT 5
- `HcfMiddleware` sometimes gives fewer links than requested.
- Move Crawlera middleware to its own repositoy HOT 3
- Crawlera.py using scrapy.log HOT 4
- Release 1.5.1 HOT 1
- MAGIC_FIELDS does not replace the item attribute in yielded item object HOT 2
- Extract DeltaFetch to scrapy-plugins HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapylib.