puerkitobio / fetchbot Goto Github PK

View Code? Open in Web Editor NEW

782.0 34.0 95.0 2.07 MB

A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

License: BSD 3-Clause "New" or "Revised" License

Go 98.34% Shell 1.66%

crawler robots-txt

fetchbot's Introduction

fetchbot

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.

It is very much a rewrite of gocrawl with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!

Installation

To install, simply run in a terminal:

go get github.com/PuerkitoBio/fetchbot

The package has a single external dependency, robotstxt. It also integrates code from the iq package.

The API documentation is available on godoc.org.

Changes

2019-09-11 (v1.2.0): update robotstxt dependency (import path/repo URL has changed, issue #31, thanks to @michael-stevens for raising the issue).
2017-09-04 (v1.1.1): fix a goroutine leak when cancelling a Queue (issue #26, thanks to @ryu-koui for raising the issue).
2017-07-06 (v1.1.0): add Queue.Done to get the done channel on the queue, allowing to wait in a select statement (thanks to @DennisDenuto).
2015-07-25 (v1.0.0) : add Cancel method on the Queue, to close and drain without requesting any pending commands, unlike Close that waits for all pending commands to be processed (thanks to @buro9 for the feature request).
2015-07-24 : add HandlerCmd and call the Command's Handler function if it implements the Handler interface, bypassing the Fetcher's handler. Support a Custom matcher on the Mux, using a predicate. (thanks to @mmcdole for the feature requests).
2015-06-18 : add Scheme criteria on the muxer (thanks to @buro9).
2015-06-10 : add DisablePoliteness field on the Fetcher to optionally bypass robots.txt checks (thanks to @oli-g).
2014-07-04 : change the type of Fetcher.HttpClient from *http.Client to the Doer interface. Low chance of breaking existing code, but it's a possibility if someone used the fetcher's client to run other requests (e.g. f.HttpClient.Get(...)).

Usage

The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/fetchbot"
)

func main() {
	f := fetchbot.New(fetchbot.HandlerFunc(handler))
	queue := f.Start()
	queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
	queue.Close()
}

func handler(ctx *fetchbot.Context, res *http.Response, err error) {
	if err != nil {
		fmt.Printf("error: %s\n", err)
		return
	}
	fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}

A more complex and complete example can be found in the repository, at /example/full/.

Fetcher

Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).

A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.

Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:

type Command interface {
	URL() *url.URL
	Method() string
}
type Handler interface {
	Handle(*Context, *http.Response, error)
}

A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.

A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.

Command-related Interfaces

The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs.

BasicAuthProvider: Implement this interface to specify the basic authentication credentials to set on the request.
CookiesProvider: If the Command implements this interface, the provided Cookies will be set on the request.
HeaderProvider: Implement this interface to specify the headers to set on the request.
ReaderProvider: Implement this interface to set the body of the request, via an io.Reader.
ValuesProvider: Implement this interface to set the body of the request, as form-encoded values. If the Content-Type is not specifically set via a HeaderProvider, it is set to "application/x-www-form-urlencoded". ReaderProvider and ValuesProvider should be mutually exclusive as they both set the body of the request. If both are implemented, the ReaderProvider interface is used.
Handler: Implement this interface if the Command's response should be handled by a specific callback function. By default, the response is handled by the Fetcher's Handler, but if the Command implements this, this handler function takes precedence and the Fetcher's Handler is ignored.

Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString* methods.

There is also a convenience HandlerCmd struct for the commands that should be handled by a specific callback function. It is a Command with a Handler interface implementation.

Fetcher Options

The Fetcher has a number of fields that provide further customization:

HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.
CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.
UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.
WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.
AutoClose : If true, closes the queue automatically once the number of active hosts reach 0.
DisablePoliteness : If true, ignores the robots.txt policies of the hosts.

What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.

License

The BSD 3-Clause license, the same as the Go language. The iq package source code is under the CDDL-1.0 license (details in the source file).

fetchbot's People

Contributors

Stargazers

Watchers

Forkers

tcol juliendsv guozanhua shaked ncabatoff jdavidson namick janqii ryr mruebush h4ck3rm1k3 tanza9 no2key morephp oli-g sluceno mengjinglei sagaienet olegdink fxtentacle ukd1 suzuken dullgiulio flyeven abhishekgahlot luyaotsung vizewang riseofthetigers yuanwr waitingkuo gernest bogdanovich zgeomantic bxxd ascepanovic noscripter jeremiahyan thesoftwarefactoryuk mostack ii0 adammtgreenberg jaydenhe erans danchenkov alexander-manley hopkings2008 mewbak waylandgod lamproae domitian leonli-github allianzcortex yogihardi nguyendangminh guhao022 oligo bobquest33 captjt gophersgang devopsmi xlwh nosun thechalice forging2012 riffaudooo jacky1104 ngopher deesims teotikalki swiftdiaries sirpros priestd09 artemzi xsoer coderbradlee six7zero9 tzafrirb niejn freemanchari shsms genjiluo liuchaoren hhy5277 dolanor-galaxy hubitor adamslevy suresh-nakkeran ykankaya ghaoo strogo raymondli 10e10e100 iq-scm xuedipiaofei chriss-0x01

fetchbot's Issues

Expose queue size

Are there any reason why there is no way to view the current queue size? Handling counters manually by increment when sending a request and decrementing when the request is filled seems error prone when you should just be able to ask fetchbot for how many outstanding requests are in its queue

limit the depth

This isn't an issue but a question.

Is it possible somehow to limit the URL's from within a given seed to a set "depth" from the root. So I could set something like "3" and then a url like example.org/1/2/3/ would be accessed but a URL like example.org/1/2/3/4/ would not be?

Crawling sub-domains which share the same server

I'm currently crawling hundreds of sub/vanity-domains which are on the same server (IP).

Unfortunately the crawl delay doesn't get respected since it considers them all distinct hosts. Eventually I get flagged by the firewall and all my connections get rejected.

Is there any opposition to implementing the following:

Add an option for f.hosts to allow tracking of channels by domain instead of the entire string from url.Host, i.e. use "mysite.com" as the key for both "sample.mysite.com" and "another.mysite.com".

Thoughts?

Handler and Matcher Design

First off, thank you a bunch for your contributions to fetchbot, goquery, and purell! I've been messing with fetchbot and I wanted to suggest a possible feature.

I would love to be able to somehow specify a handler function when enqueueing a request. When I'm doing a targeted scrape, I often have all the context for what kind of link I'm about to queue at that moment.

This is the general workflow of another scraper I have used in Python called "scrapy":

scrapy.Request(link_url, callback=self.parse_genre_links)

You can see there, that when queuing the url, I specified the callback handler as the function "parse_genre_links". There are 4-5 different "page types" that I'm scraping from a website. I usually create a function per page type and handle the parsing in it.

I believe the way you are expected to handle this in fetchbot is to use something like the Muxer to specify handlers based on criteria as responses come in. You can then hand off the handling to whatever functions you want for a given page.

My issue with this approach is that my context has been lost. The only thing I have to go off of is the URL itself and depending on the website, this isn't always sufficient for differentiating what type of page this is and how I should parse it. I did have the context for what this link was when I was enqueueing it.

One other minor feature request I had was to allow for more Matcher types. I really like how clean the Mux'ing is when I have unique URL prefixes where I can simply do a Path matcher to split out my handlers. However, sometimes I'd like to match the URLs based on something like a query parameter. Perhaps there is a way to supply a Matcher function where I return a predicate to indicate if a given request matches? Or, if we can't do that, perhaps adding Matchers that let me do something like PathContains that won't do a strict prefix match.

The ability to use multiple goroutines per host

Feature request.

Trying to fetch in a 'browser like' way that permits 4, 6 or 8 goroutines to request content from a given domain.

ie: GET index.html, extract all the images/css/javascript urls, then get those in up to N goroutines.

Would be awesome if I could do that with this library.

q.Block even if seed empty

Hi,

I try to understand your full sample but it seem to got block EVEN if there is no more page to fetch in the seed ...
I think that the url is kept in the seed even after the fetching ... but not redone .. not sure I got it all ...
can you confirm ?

Stéphane

HeaderProvider example

Hi there,

I am not so sure how to use HeaderProvider. It says to implement this on the Command, but I can't seem to get a hold of the Command object.

Thanks

Update package to use go.mod

Go modules is now the official dependency management system of Golang.

Please add a go.mod and go.sum file to the repo.

Drain queue

I have the following scenario:

Start crawling a large website
Queue 1,000 URLs
Need to cancel the crawl ASAP (for whatever reason)

q.Close() does not help as even though it prevents new items from getting added to the queue, no method is offered for draining the queue of the existing items.

I'm not sure how to approach doing it, otherwise I'd offer a patch, but could a q.Drain() be added... or better still a q.Cancel() which first calls q.Close() and then drains the queue before returning and releasing q.Block().

Parallellize queue

Hi, any chance to set how much urls can be retrieved in parallel in the same fetcher ?

Need help to identify the issue in the implementation.

Hello,

I am using the example/full/main.go in my crawler and scrapper. The link has the implementation. When I run the code it does crawling and scrapping as per expected but it consumes too much memory and creates too many goroutines which result in exiting the code after some time (approx an hour in my case.).

Please help me understand where I'm wrong and what are the things I need to take care of. I believe I'm missing something or misunderstood which is causing this. I am a newbie so please feel free to ask any more explanation/ clarification if needed.

Thank you.

Mux does not prioritize most specific matching Handler

Currently the Mux prioritizes the Handler with the most specific Path, but if multiple Handlers are registered with the same Path, but with different criteria such as ContentType, the selected handler is undefined.

For example, I want a generic handler for a specific Host, and a more specific handler for a specific content type. If the content type matches, I want that more specific handler to be selected.

I suggest returning an additional parameter from ResponseMatcher.match that is the number of matching criteria. When selecting a ResponseMatcher in Mux.Handle the handler with the highest number of matching criteria should be selected. The path length can then be used as a tie breaker.

This would change behavior slightly in some cases, but mostly only where the behavior was already undefined.

Getting lots of i/o timeouts

I am running the "full" example found in "example" folder with a seed of https://moz.com/top500/domains. I am getting a bunch of i/o timeouts and can't trace what is wrong. For instance:

[ERR] HEAD http://bbc.com - Head http://www.bbc.com/: dial tcp: lookup www.bbc.com on 127.0.1.1:53: read udp 127.0.0.1:53755->127.0.1.1:53: i/o timeout

I have also tried the other examples and get the same result. I am running Ubuntu 16.04 and Go1.8 linux/amd64.

thank you

robotstxt-go has renamed to robotstxt

https://github.com/temoto/robotstxt-go seems to have renamed to https://github.com/temoto/robotstxt

Cancel() make goroutine leak

First off, thanks for your contributions to this package!
I'm using Cancel() to stop bot for some reason and in a little while start another.
I found fetchbot.sliceIQ goroutine will never stop. It leaks.

	for _, v := range pending {
		next <- v
	}

This part will never end because the receiving loop in processChan() breaked just for the 'Cancel()' request.
Hope this will be fixed.
Sorry about my bad English, by the way.

Fail queue object in handler

I don't see any pattern to fail the Queued page based on a handler (ie: couldn't get page data, requeue)

Add a random delay between each cmd?

I'm wondering what's the best solution to add a random delay between each cmd, any suggestion?