Code Monkey home page Code Monkey logo

Comments (5)

Digenis avatar Digenis commented on June 28, 2024

From the deltafetch documentation:
...a spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider...
You clearly don't yield any items in you first parse method.
The second however is an example of more proper usage.
It's impossible for deltafetch to speculate on side effects like open(filename, 'wb').write(...
so move this logic to the pipeline.
You may be interested in http://doc.scrapy.org/en/latest/topics/jobs.html instead.
Also, you are using a scrapy version that has known bugs and is no longer supported
(however this doesn't seem to affect this ticket).

This issue is invalid and should be closed.

from scrapylib.

empereor avatar empereor commented on June 28, 2024

Yeah. Figured that out a year ago. And while it may be clear to you, to someone just starting to use scrapy and deltafetch, the distinction of an "item" in scrapy and an item isn't so clear.

And while it may be impossible for deltafetch to speculate on side affects, clearly it's completely possible for deltafetch to be able to determine if a url has been previously visited, it's just not coded that way. Fine, I get it, easily rectified as demonstrated in the second configuration.

Thank you for the link to jobs, I'll see if that applies here, even though I already solved my issue as demonstrated in the second configuration, I was simply trying to give back to the project with what I thought was a bug.

And of course I was on an old version, it was over a year ago, at the time the bug was raised, .0.18.4 was only three months behind current, which was 0.22.0. Please realize that enterprise folks can't just go upgrading entire stacks with every new release. Thanks.

from scrapylib.

nyov avatar nyov commented on June 28, 2024

/summons @redapple (close please?)

from scrapylib.

redapple avatar redapple commented on June 28, 2024

@empereor , sorry for this late comment. Any user feedback is valuable and appreciated. Thank you.

DeltaFetch now lives in a seperate repository, https://github.com/scrapy-plugins/scrapy-deltafetch, and has releases independent of scrapylib (which will gradually end life without any foreceable update)
We'll do our best to document the behavior better there, that it's about not visiting pages for which the spider produced items.

from scrapylib.

redapple avatar redapple commented on June 28, 2024

thanks @nyov for the reminder.

from scrapylib.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.