puerkitobio / fetchbot Goto Github PK
View Code? Open in Web Editor NEWA simple and flexible web crawler that follows the robots.txt policies and crawl delays.
License: BSD 3-Clause "New" or "Revised" License
A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
License: BSD 3-Clause "New" or "Revised" License
Feature request.
Trying to fetch in a 'browser like' way that permits 4, 6 or 8 goroutines to request content from a given domain.
ie: GET index.html, extract all the images/css/javascript urls, then get those in up to N goroutines.
Would be awesome if I could do that with this library.
I have the following scenario:
q.Close()
does not help as even though it prevents new items from getting added to the queue, no method is offered for draining the queue of the existing items.
I'm not sure how to approach doing it, otherwise I'd offer a patch, but could a q.Drain()
be added... or better still a q.Cancel()
which first calls q.Close()
and then drains the queue before returning and releasing q.Block()
.
I don't see any pattern to fail the Queued page based on a handler (ie: couldn't get page data, requeue)
https://github.com/temoto/robotstxt-go seems to have renamed to https://github.com/temoto/robotstxt
I am running the "full" example found in "example" folder with a seed of https://moz.com/top500/domains. I am getting a bunch of i/o timeouts and can't trace what is wrong. For instance:
[ERR] HEAD http://bbc.com - Head http://www.bbc.com/: dial tcp: lookup www.bbc.com on 127.0.1.1:53: read udp 127.0.0.1:53755->127.0.1.1:53: i/o timeout
I have also tried the other examples and get the same result. I am running Ubuntu 16.04 and Go1.8 linux/amd64.
thank you
I'm currently crawling hundreds of sub/vanity-domains which are on the same server (IP).
Unfortunately the crawl delay doesn't get respected since it considers them all distinct hosts. Eventually I get flagged by the firewall and all my connections get rejected.
Is there any opposition to implementing the following:
Add an option for f.hosts
to allow tracking of channels by domain instead of the entire string from url.Host
, i.e. use "mysite.com" as the key for both "sample.mysite.com" and "another.mysite.com".
Thoughts?
This isn't an issue but a question.
Is it possible somehow to limit the URL's from within a given seed to a set "depth" from the root. So I could set something like "3" and then a url like example.org/1/2/3/ would be accessed but a URL like example.org/1/2/3/4/ would not be?
First off, thanks for your contributions to this package!
I'm using Cancel() to stop bot for some reason and in a little while start another.
I found fetchbot.sliceIQ
goroutine will never stop. It leaks.
for _, v := range pending {
next <- v
}
This part will never end because the receiving loop in processChan()
breaked just for the 'Cancel()' request.
Hope this will be fixed.
Sorry about my bad English, by the way.
I'm wondering what's the best solution to add a random delay between each cmd, any suggestion?
Hi,
I try to understand your full sample but it seem to got block EVEN if there is no more page to fetch in the seed ...
I think that the url is kept in the seed even after the fetching ... but not redone .. not sure I got it all ...
can you confirm ?
Stéphane
Go modules is now the official dependency management system of Golang.
Please add a go.mod and go.sum file to the repo.
Hello,
I am using the example/full/main.go in my crawler and scrapper. The link has the implementation. When I run the code it does crawling and scrapping as per expected but it consumes too much memory and creates too many goroutines which result in exiting the code after some time (approx an hour in my case.).
Please help me understand where I'm wrong and what are the things I need to take care of. I believe I'm missing something or misunderstood which is causing this. I am a newbie so please feel free to ask any more explanation/ clarification if needed.
Thank you.
Hi there,
I am not so sure how to use HeaderProvider
. It says to implement this on the Command, but I can't seem to get a hold of the Command object.
Thanks
Currently the Mux prioritizes the Handler with the most specific Path, but if multiple Handlers are registered with the same Path, but with different criteria such as ContentType, the selected handler is undefined.
For example, I want a generic handler for a specific Host, and a more specific handler for a specific content type. If the content type matches, I want that more specific handler to be selected.
I suggest returning an additional parameter from ResponseMatcher.match
that is the number of matching criteria. When selecting a ResponseMatcher
in Mux.Handle
the handler with the highest number of matching criteria should be selected. The path length can then be used as a tie breaker.
This would change behavior slightly in some cases, but mostly only where the behavior was already undefined.
First off, thank you a bunch for your contributions to fetchbot
, goquery
, and purell
! I've been messing with fetchbot and I wanted to suggest a possible feature.
I would love to be able to somehow specify a handler function when enqueueing a request. When I'm doing a targeted scrape, I often have all the context for what kind of link I'm about to queue at that moment.
This is the general workflow of another scraper I have used in Python called "scrapy":
scrapy.Request(link_url, callback=self.parse_genre_links)
You can see there, that when queuing the url, I specified the callback handler as the function "parse_genre_links". There are 4-5 different "page types" that I'm scraping from a website. I usually create a function per page type and handle the parsing in it.
I believe the way you are expected to handle this in fetchbot
is to use something like the Muxer to specify handlers based on criteria as responses come in. You can then hand off the handling to whatever functions you want for a given page.
My issue with this approach is that my context has been lost. The only thing I have to go off of is the URL itself and depending on the website, this isn't always sufficient for differentiating what type of page this is and how I should parse it. I did have the context for what this link was when I was enqueueing it.
One other minor feature request I had was to allow for more Matcher types. I really like how clean the Mux'ing is when I have unique URL prefixes where I can simply do a Path
matcher to split out my handlers. However, sometimes I'd like to match the URLs based on something like a query parameter. Perhaps there is a way to supply a Matcher function where I return a predicate to indicate if a given request matches? Or, if we can't do that, perhaps adding Matchers that let me do something like PathContains
that won't do a strict prefix match.
Are there any reason why there is no way to view the current queue size? Handling counters manually by increment when sending a request and decrementing when the request is filled seems error prone when you should just be able to ask fetchbot for how many outstanding requests are in its queue
Hi, any chance to set how much urls can be retrieved in parallel in the same fetcher ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.