Code Monkey home page Code Monkey logo

supercrawler's People

Contributors

brendonboshell avatar cbess avatar dependabot[bot] avatar hjr3 avatar mrrefactoring avatar simoncpu avatar taina0407 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

supercrawler's Issues

set Referer

Hi great tool, is there any way to set an referrer like the url of the page that lead the crawler to the current page.

URL Destination
https://example.com https://example.com/page1

Also is there possible to make the crawler scan all links but store in the database only the external links ?

Thank you very much

Confusion over Sitemap parser

I am a little confused over how the sitemap parser works.

var sp = new sitemapsParser({
      urlFilter: function (url) {
        return url.indexOf("de") === -1;
      }
    });

This looks like it finds all links that match a certain pattern in the sitemap itself (rather than a list of sitemaps) am I right?

.....

hi Brendon!

  1. thanks for making this!

  2. what is the best way to make a Crawl that is in-progress, PAUSE?
    ...and then (when the user decides ) to Continue from exactly where it left off?

cheers!!

controlling how deep to crawl

I might have missed it, but in the crawler options i didnt see a way to control how deep the crawler goes. Is that possible via a parameter, or would i have to implement that logic in the htmllinkParser handler via urlFilter?

Add event for 'url queued'

I'd like to see when one of supercrawler's handlers queue up another URL to crawl. Would be nice to add an event for this.

Script doesn't exit when using RedisUrlList

Problem: Script doesn't exit when using RedisUrList.

Step to replicate:
Run the following code:

'use strict';

const supercrawler = require('supercrawler');
const crawler = new supercrawler.Crawler({
    urlList: new supercrawler.RedisUrlList({
        redis: {
            host: 'redis-server.example.org'
        }
    })
});

console.log('Script should exit after this.');

Expected behavior:
Script should stop after running.

Actual behavior:
Script runs indefinitely.

Workaround:
Call process.exit() to terminate the script.

BTW, I'm using AWS ElastiCache for Redis, just in case this detail is needed. :)

Adding typescript definitions

I'm currently using supercrawler for typescript. I described the base types in my repository. Could you add them to the main application?

I think that you need to slightly work them out before adding. I hope for your help

URLs statusCode NULL while there was no errors crawling

Hello, thanks for the app, I was using it to crawl links of my website but I found that it add statusCode NULL to crawled links in the database while there is no errors at all.

Why does it act like that, I consider a request is successful and should log it's success statusCode if there were no errors, also strangely not all of succeeded requests gives statusCode of NULL.

I have checked the access log also during the crawler work and found 200 responses to URLs that logs statusCode of NULL

screenshot

Using Proxies List

Hi, some websites cannot be crawled because are blocking the access, how can I implant a proxy to be used with the crawler?

Thank you

SQL Injection Vulnerabitlity

Sequelize at the version currently specified has an SQL injection vulnerability. This should be updated ASAP to a version >= 5.8.11.

Crawling binary files

supercrawler is picking up ALL links on a page. If there are links to movie files, images, or any large files it will add these URLs to the queue. The urls get passed to request which tries to download them.

How to avoid to be see as a robot?

I try to crawl a website but there is a captcha appearing after a certain amount of page crawled. So i open my browser and fill the captcha manually.
Is there a way to avoid to do that or at least increase the number of page i can crawler before get this message. cooldown? hwo to?

How to periodically crawl again

I see in the database there is a field for nextRetryDate which i assume is used for failed requests.

How would one make it so that a domain/url can be "re-crawled" let's say every day?

At the moment, once a url is finished, it won't touch it again.

Expose response object

First of all thanks a lot for this nifty crawler. I am really enjoying your api design!

Nevertheless, would it be possible to expose the entire response object in the handler api? Currently supercrawler only exposes contentType, response.body and url.

This would allow us to crawl a lot more information:

  • crawl response headers
  • handle http status codes/messages
  • performing custom redirects

I can create a PR if such a change would be acceptable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.