Code Monkey home page Code Monkey logo

fetchbot's People

Contributors

adamslevy avatar buro9 avatar c4s4 avatar dennisdenuto avatar mna avatar oli-g avatar waitingkuo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fetchbot's Issues

The ability to use multiple goroutines per host

Feature request.

Trying to fetch in a 'browser like' way that permits 4, 6 or 8 goroutines to request content from a given domain.

ie: GET index.html, extract all the images/css/javascript urls, then get those in up to N goroutines.

Would be awesome if I could do that with this library.

Drain queue

I have the following scenario:

  1. Start crawling a large website
  2. Queue 1,000 URLs
  3. Need to cancel the crawl ASAP (for whatever reason)

q.Close() does not help as even though it prevents new items from getting added to the queue, no method is offered for draining the queue of the existing items.

I'm not sure how to approach doing it, otherwise I'd offer a patch, but could a q.Drain() be added... or better still a q.Cancel() which first calls q.Close() and then drains the queue before returning and releasing q.Block().

Getting lots of i/o timeouts

I am running the "full" example found in "example" folder with a seed of https://moz.com/top500/domains. I am getting a bunch of i/o timeouts and can't trace what is wrong. For instance:

[ERR] HEAD http://bbc.com - Head http://www.bbc.com/: dial tcp: lookup www.bbc.com on 127.0.1.1:53: read udp 127.0.0.1:53755->127.0.1.1:53: i/o timeout

I have also tried the other examples and get the same result. I am running Ubuntu 16.04 and Go1.8 linux/amd64.

thank you

Crawling sub-domains which share the same server

I'm currently crawling hundreds of sub/vanity-domains which are on the same server (IP).

Unfortunately the crawl delay doesn't get respected since it considers them all distinct hosts. Eventually I get flagged by the firewall and all my connections get rejected.

Is there any opposition to implementing the following:

Add an option for f.hosts to allow tracking of channels by domain instead of the entire string from url.Host, i.e. use "mysite.com" as the key for both "sample.mysite.com" and "another.mysite.com".

Thoughts?

limit the depth

This isn't an issue but a question.

Is it possible somehow to limit the URL's from within a given seed to a set "depth" from the root. So I could set something like "3" and then a url like example.org/1/2/3/ would be accessed but a URL like example.org/1/2/3/4/ would not be?

Cancel() make goroutine leak

First off, thanks for your contributions to this package!
I'm using Cancel() to stop bot for some reason and in a little while start another.
I found fetchbot.sliceIQ goroutine will never stop. It leaks.

	for _, v := range pending {
		next <- v
	}

This part will never end because the receiving loop in processChan() breaked just for the 'Cancel()' request.
Hope this will be fixed.
Sorry about my bad English, by the way.

q.Block even if seed empty

Hi,

I try to understand your full sample but it seem to got block EVEN if there is no more page to fetch in the seed ...
I think that the url is kept in the seed even after the fetching ... but not redone .. not sure I got it all ...
can you confirm ?

Stéphane

Update package to use go.mod

Go modules is now the official dependency management system of Golang.

Please add a go.mod and go.sum file to the repo.

Need help to identify the issue in the implementation.

Hello,

I am using the example/full/main.go in my crawler and scrapper. The link has the implementation. When I run the code it does crawling and scrapping as per expected but it consumes too much memory and creates too many goroutines which result in exiting the code after some time (approx an hour in my case.).

Please help me understand where I'm wrong and what are the things I need to take care of. I believe I'm missing something or misunderstood which is causing this. I am a newbie so please feel free to ask any more explanation/ clarification if needed.

Thank you.

HeaderProvider example

Hi there,

I am not so sure how to use HeaderProvider. It says to implement this on the Command, but I can't seem to get a hold of the Command object.

Thanks

Mux does not prioritize most specific matching Handler

Currently the Mux prioritizes the Handler with the most specific Path, but if multiple Handlers are registered with the same Path, but with different criteria such as ContentType, the selected handler is undefined.

For example, I want a generic handler for a specific Host, and a more specific handler for a specific content type. If the content type matches, I want that more specific handler to be selected.

I suggest returning an additional parameter from ResponseMatcher.match that is the number of matching criteria. When selecting a ResponseMatcher in Mux.Handle the handler with the highest number of matching criteria should be selected. The path length can then be used as a tie breaker.

This would change behavior slightly in some cases, but mostly only where the behavior was already undefined.

Handler and Matcher Design

First off, thank you a bunch for your contributions to fetchbot, goquery, and purell! I've been messing with fetchbot and I wanted to suggest a possible feature.

I would love to be able to somehow specify a handler function when enqueueing a request. When I'm doing a targeted scrape, I often have all the context for what kind of link I'm about to queue at that moment.

This is the general workflow of another scraper I have used in Python called "scrapy":

scrapy.Request(link_url, callback=self.parse_genre_links)

You can see there, that when queuing the url, I specified the callback handler as the function "parse_genre_links". There are 4-5 different "page types" that I'm scraping from a website. I usually create a function per page type and handle the parsing in it.

I believe the way you are expected to handle this in fetchbot is to use something like the Muxer to specify handlers based on criteria as responses come in. You can then hand off the handling to whatever functions you want for a given page.

My issue with this approach is that my context has been lost. The only thing I have to go off of is the URL itself and depending on the website, this isn't always sufficient for differentiating what type of page this is and how I should parse it. I did have the context for what this link was when I was enqueueing it.

One other minor feature request I had was to allow for more Matcher types. I really like how clean the Mux'ing is when I have unique URL prefixes where I can simply do a Path matcher to split out my handlers. However, sometimes I'd like to match the URLs based on something like a query parameter. Perhaps there is a way to supply a Matcher function where I return a predicate to indicate if a given request matches? Or, if we can't do that, perhaps adding Matchers that let me do something like PathContains that won't do a strict prefix match.

Expose queue size

Are there any reason why there is no way to view the current queue size? Handling counters manually by increment when sending a request and decrementing when the request is filled seems error prone when you should just be able to ask fetchbot for how many outstanding requests are in its queue

Parallellize queue

Hi, any chance to set how much urls can be retrieved in parallel in the same fetcher ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.