brendonboshell / supercrawler Goto Github PK

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

License: Apache License 2.0

JavaScript 100.00%

web-crawler robot crawler sitemap distributed-crawler

supercrawler's People

Contributors

Stargazers

Watchers

Forkers

berezovskyi taina0407 simple555a endata-corp benchristel miguelramosfdz gitter-badger blobberbun votenl mediatorus mrlobster digiguru domainoverflow jasoneliann jljohnson ashooner mikalv grubinsky colorwebdesigner nunnun upandfine aflansburg stanxii jcho4794 cacheflow basst wajid23071dev adversinc langri-sha hubitor hhy5277 mrrefactoring tuanquynet grupo-cliq mvwijland modulexcite jonathanjonathanjonathan ghanima gautamsharma0095 matrix-une tomasvts hjr3 salumguilherme cbess minhtran83 protyped cigolpl neilblaze sidetrackparty arnaud-cortisse jbenesch collectforked makotoomori abdojabboulie jsaribeirolopes innovationlove dystudio lzl0 y0dev zhaopufeng tarikjn

supercrawler's Issues

set Referer

Hi great tool, is there any way to set an referrer like the url of the page that lead the crawler to the current page.

URL Destination
https://example.com https://example.com/page1

Also is there possible to make the crawler scan all links but store in the database only the external links ?

Thank you very much

Confusion over Sitemap parser

I am a little confused over how the sitemap parser works.

var sp = new sitemapsParser({
      urlFilter: function (url) {
        return url.indexOf("de") === -1;
      }
    });

This looks like it finds all links that match a certain pattern in the sitemap itself (rather than a list of sitemaps) am I right?

How to know when crawl finished?

I didn't find on documentation how can we handle when the crawler finished.

How, in theory, to support arbitrary parsers?

I'm considering extending supercrawler to support Phantom or https://github.com/GoogleChrome/puppeteer (i.e. Headless Chrome, as a modern alternative to PhantomJS) for the purposes of parsing dynamic content—do you have some ideas on how best to architect or re-architect this?

Ensure Crawler#start can be safely called after Crawler#stop

See #16 (comment): "Using the start method of Crawler after you have called stop is problematic because it doesn't properly handle the case where you resume while there are outstanding requests from the previous crawl."

Ability to customize http request

pass custom headers, use cookie jar, etc.

Example URL hangs ...?

I'm trying this: https://gist.github.com/t3db0t/2db07159130bc5c0dc9ebe6c8dc8f317

The only output I'm getting is:

$ node test.js 
Starting Crawler...
Crawling https://sweetpricing.com
Crawling https://sweetpricing.com/robots.txt
Crawling https://sweetpricing.com/en/

And then it hangs indefinitely. Any ideas? Is there any debug output option?

.....

hi Brendon!

thanks for making this!
what is the best way to make a Crawl that is in-progress, PAUSE?
...and then (when the user decides ) to Continue from exactly where it left off?

cheers!!

controlling how deep to crawl

I might have missed it, but in the crawler options i didnt see a way to control how deep the crawler goes. Is that possible via a parameter, or would i have to implement that logic in the htmllinkParser handler via urlFilter?

Add event for 'url queued'

I'd like to see when one of supercrawler's handlers queue up another URL to crawl. Would be nice to add an event for this.

How to configure to only crawl internal links of the url supplied?

Basically we need to get all internal links of our user's websites so we only want to crawls internal links within supplied url.

Is it possible to configure supercrawler like that?

Upgrade sqlite3 dependency to latest

[email protected] introduces support for Node.js 10.

Script doesn't exit when using RedisUrlList

Problem: Script doesn't exit when using RedisUrList.

Step to replicate:
Run the following code:

'use strict';

const supercrawler = require('supercrawler');
const crawler = new supercrawler.Crawler({
    urlList: new supercrawler.RedisUrlList({
        redis: {
            host: 'redis-server.example.org'
        }
    })
});

console.log('Script should exit after this.');

Expected behavior:
Script should stop after running.

Actual behavior:
Script runs indefinitely.

Workaround:
Call process.exit() to terminate the script.

BTW, I'm using AWS ElastiCache for Redis, just in case this detail is needed. :)

Adding typescript definitions

I'm currently using supercrawler for typescript. I described the base types in my repository. Could you add them to the main application?

I think that you need to slightly work them out before adding. I hope for your help

URLs statusCode NULL while there was no errors crawling

Hello, thanks for the app, I was using it to crawl links of my website but I found that it add statusCode NULL to crawled links in the database while there is no errors at all.

Why does it act like that, I consider a request is successful and should log it's success statusCode if there were no errors, also strangely not all of succeeded requests gives statusCode of NULL.

I have checked the access log also during the crawler work and found 200 responses to URLs that logs statusCode of NULL

Using Proxies List

Hi, some websites cannot be crawled because are blocking the access, how can I implant a proxy to be used with the crawler?

Thank you

SQL Injection Vulnerabitlity

Sequelize at the version currently specified has an SQL injection vulnerability. This should be updated ASAP to a version >= 5.8.11.

Is the htmlLinkParser `hostnames` option actually implemented?

It doesn't seem to work, and I can't find anywhere in the source that actually utilizes it (just checks to see if it's there?)

Add option to customize table name for DbUrlList

Crawling binary files

supercrawler is picking up ALL links on a page. If there are links to movie files, images, or any large files it will add these URLs to the queue. The urls get passed to request which tries to download them.

I'm trying to craw an url an their links but not works

The url is https://www.fomento.gob.es/informacion-para-el-ciudadano/participacion-publica/sec-carreter

Only recover the first html but not ther links. Why?

How to avoid to be see as a robot?

I try to crawl a website but there is a captcha appearing after a certain amount of page crawled. So i open my browser and fill the captcha manually.
Is there a way to avoid to do that or at least increase the number of page i can crawler before get this message. cooldown? hwo to?

How to periodically crawl again

I see in the database there is a field for nextRetryDate which i assume is used for failed requests.

How would one make it so that a domain/url can be "re-crawled" let's say every day?

At the moment, once a url is finished, it won't touch it again.

is it possible to limit access to a specific path URL?

In addition to limit by domain name is it possible to limit to a specific path and sub path url
For the moment im doing this into an handler but it register urls into urls table even if I dont use it.

Expose response object

First of all thanks a lot for this nifty crawler. I am really enjoying your api design!

Nevertheless, would it be possible to expose the entire response object in the handler api? Currently supercrawler only exposes contentType, response.body and url.

This would allow us to crawl a lot more information:

crawl response headers
handle http status codes/messages
performing custom redirects

I can create a PR if such a change would be acceptable.

Is there a way to get the ID of the inserted link on the addHandler("text/html", function(context)). My first time in git posting a issue, maybe this is not a issue :)

Crawling last x URL's

Hi, how can I crawl only last x URL's found on the homepage ?

Urls that redirect gets ignored by htmlLinkParser

When specifing a hostname to restict to like "www.acme.com", and a path like: "www.acme.com/foo" return a 301 the location is added to the queue without validation that it has the correct hostname.

Maybe a "hook" should be implemented here: https://github.com/brendonboshell/supercrawler/blob/master/lib/Crawler.js#L192

Allowing the htmlLinkParser to intercept and ignore the upsert.