How to reproduce this issue Write a simple spider to generate a la

`HcfMiddleware` consumes too many memory on large number of URLs about scrapylib HOT 5 OPEN

scrapinghub commented on June 14, 2024

`HcfMiddleware` consumes too many memory on large number of URLs

from scrapylib.

Comments (5)

starrify commented on June 14, 2024

I'll be working on a PR for solving this issue.

from scrapylib.

nramirezuy commented on June 14, 2024

I don't think your memory issues come from this middleware.

Are you using SitemapSpider along with this middleware?
Are you reading requests as you upload them with the same spider?

from scrapylib.

starrify commented on June 14, 2024

Sorry I mistakenly clicked on the "Close and comment" button just now.. :(

Thanks for the reply @nramirezuy :)

Yes. (But hacked into its _parse_sitemap and _parse_urlset methods)
No.
I'm working with a hacked SitemapSpider, a typical job sends ~500 requests and got ~10 million URLs to store to HCF. Wouldn't there be ~10 million URLs stored into HcfMiddleware.new_links? What I want is to eliminate this overhead. :)

from scrapylib.

nramirezuy commented on June 14, 2024

What do you mean by hacked, what are the changes made to the spider? SitemapSpider is extremely memory consuming because of big sitemap files, which implies massive responses.

I would suggest to not remove new_links just to change it to use the same technique Scheduler uses and make it optional by a setting.

from scrapylib.

starrify commented on June 14, 2024

What do you mean by hacked, what are the changes made to the spider? SitemapSpider is extremely memory consuming because of big sitemap files, which implies massive responses.

Setting request.meta["use_hcf"] = True for requests that come from the sitemap. Thus there won't be massive responses, only massive URLs to be stored to HCF.