WebCrawler

A simple web crawler CLI application.

The aim of the application is to enumerate all the links for each page on a given domain. The result of the crawl will be output to a JSON file in the results under the crawl domain e.g. https://wiprodigital.com -> /results/wiprodigital.com.json (a sample has been included)

There are some caveats:

The application will automatically exclude JS / CSS URLs
The application will not crawl external URLs
The application will not crawl sub-domain URLs e.g. test.wiprodigital.com

Installing

Node v7.6+ required

npm i

Running

npm start <domain>

Note - <domain> may also be set as an environment variable START_URL, if both values are used the CLI takes precedence

Testing

npm test

Notes

Applied SRP for scraping (HttpPage) and traversing (WebCrawler)
Made use of Map to dedupe links
Used recursion to traverse pages
Took the decision to not create an "exporter" or "deserializer" class, given the native support in Node for JSON serialization & file exporting. However, if the application needed to support various export types then this would perhaps be a good approach to introduce a common interface.

TODO

Improve performance and speed of crawl e.g. run scrapes in parallel, use pm2 to scale out (although we need to be wary of race conditions, would need a mutex of some kind)
Include additional processing options e.g. max page depth, rate-limiting (protect against 429 errors)
Decouple HTML parsing from HttpPage class (maybe down the line we want to move away from cheerio)
Move results deserialization / file exporting to separate classes
Avoid crawling CMS-related URLs (/xmlrpc.php, /wp-json etc.)
Better handling of erroneous but valid URLs e.g. http://domain.com//a/b/c, crawler would currently treat //a as the domain in itself
Better hashtag URL processing (although the page is the same, they may pull dynamic content)
Better file name validation
Include stats e.g. total links found, pages crawled, crawl times etc.
Include more unit tests (happy-day, edge-case, error scenarios)
Include integration tests (validate against a real URL)
Implement Babel to leverage ES2017 syntax (i.e. yield, Object.fromEntries)
Improve parameter validation (or better yet, use TypeScript)
Improve instrumentation, utilise remote services like Loggly, Prometheus or equivalent
Perf tests against readily available libs like crawler, make sure you are reinventing the wheel for good reason

jameshowe / webcrawler Goto Github PK

webcrawler's Introduction

WebCrawler

Installing

Running

Testing

Notes

TODO

webcrawler's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent