Comments (5)
I'll be working on a PR for solving this issue.
from scrapylib.
I don't think your memory issues come from this middleware.
- Are you using SitemapSpider along with this middleware?
- Are you reading requests as you upload them with the same spider?
from scrapylib.
Sorry I mistakenly clicked on the "Close and comment" button just now.. :(
Thanks for the reply @nramirezuy :)
- Yes. (But hacked into its
_parse_sitemap
and_parse_urlset
methods) - No.
I'm working with a hackedSitemapSpider
, a typical job sends ~500 requests and got ~10 million URLs to store to HCF. Wouldn't there be ~10 million URLs stored intoHcfMiddleware.new_links
? What I want is to eliminate this overhead. :)
from scrapylib.
What do you mean by hacked, what are the changes made to the spider? SitemapSpider
is extremely memory consuming because of big sitemap files, which implies massive responses.
I would suggest to not remove new_links
just to change it to use the same technique Scheduler
uses and make it optional by a setting.
from scrapylib.
What do you mean by hacked, what are the changes made to the spider?
SitemapSpider
is extremely memory consuming because of big sitemap files, which implies massive responses.
Setting request.meta["use_hcf"] = True
for requests that come from the sitemap. Thus there won't be massive responses, only massive URLs to be stored to HCF.
I would suggest to not remove new_links just to change it to use the same technique Scheduler uses and make it optional by a setting.
Sure. This is exactly what I do in PR #57 :)
from scrapylib.
Related Issues (12)
- PyPI package HOT 2
- default_input_processor with quoted markup in attribute HOT 2
- With deltafetch enabled, scrapy is still crawling previously crawled urls HOT 5
- Invalid issue (created in wrong repo)
- Unexpected behaviour of constraint `RequiredFields` HOT 1
- `HcfMiddleware` sometimes gives fewer links than requested.
- Move Crawlera middleware to its own repositoy HOT 3
- Crawlera.py using scrapy.log HOT 4
- Release 1.5.1 HOT 1
- MAGIC_FIELDS does not replace the item attribute in yielded item object HOT 2
- Extract DeltaFetch to scrapy-plugins HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapylib.