Code Monkey home page Code Monkey logo

websight's Introduction

websight

CI Status Test Coverage Maintainability WTFPL License

A simple crawler that fetches all pages in a given website and prints the links between them.

Screenshot

๐Ÿ“ฃ Note that this project was purpose-built for a coding challenge (see problem statement) and is not meant for production use (unless you aren't web scale yet).

๐Ÿ› ๏ธ Setup

Before you run this app, make sure you have Node.js installed. yarn is recommended, but can be used interchangeably with npm. If you'd prefer running everything inside a Docker container, see the Docker setup section.

git clone https://github.com/paambaati/websight
cd websight
yarn install && yarn build

๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป Usage

yarn start <website>

๐Ÿงช Tests & Coverage

yarn run coverage

๐Ÿณ Docker Setup

docker build -t websight .
docker run -ti websight <website>

๐Ÿ“ฆ Executable Binary

yarn bundle && yarn binary

This produces standalone executable binaries for both Linux and macOS.

๐Ÿงฉ Design

                                            +---------------------+                        
                                            |   Link Extractor    |                        
                                            | +-----------------+ |                        
                                            | |                 | |                        
                                            | |   URL Resolver  | |                        
                                            | |                 | |                        
                                            | +-----------------+ |                        
                    +-----------------+     | +-----------------+ |     +-----------------+
                    |                 |     | |                 | |     |                 |
                    |     Crawler     +---->+ |     Fetcher     | +---->+     Sitemap     |
                    |                 |     | |                 | |     |                 |
                    +-----------------+     | +-----------------+ |     +-----------------+
                                            | +-----------------+ |                        
                                            | |                 | |                        
                                            | |     Parser      | |                        
                                            | |                 | |                        
                                            | +-----------------+ |                        
                                            +---------------------+                        

The Crawler class runs a fast non-deterministic fetch of all pages (via LinkExtractor) & the URLs in them recursively and saves them in Sitemap. When crawling is complete[1], the sitemap is printed as a ASCII tree.

The LinkExtractor class is a thin orchestrating wrapper around 3 core components โ€”

  1. URLResolver includes logic for resolving relative URLs and normalizing them. It also includes utility methods for filtering out external URLs.
  2. Fetcher takes a URL, fetches it and returns the response as a Stream. This is better because streams can be read in small buffered chunks, avoiding holding very large HTMLs in memory.
  3. Parser parses the HTML stream (returned by Fetcher) in chunks and emits the link event on each page URL and the asset event on each static asset found in the HTML.

1 Crawler.crawl() is an async function that never resolves because it is technically impossible to detect when we've finished crawling. In most runtimes, we'd have to implement some kind of idle polling to detect completion; however, in Node.js, as soon as the event loop has no more tasks to execute, the main process will run to completion. This is why we finally print the sitemap in the Process.beforeExit event. โ†ฉ

๐ŸŽ Optimizations

  1. Streams all the way down.

    The key workloads in this system are HTTP fetches (I/O-bound) and HTML parses (CPU-bound), and either can be time-consuming and/or high on memory usage. To better parallelize the crawls and use as little memory as possible, got library's streaming API and the very fast htmlparser2 have been used.

  2. Keep-Alive connections.

    The Fetcher class uses a global keepAlive agent to reuse sockets as we're only crawling a single domain. This helps avoid re-establishing TCP connections for each request.

โšก๏ธ Limitations

When ramping up for scale, this design exposes a few of its limitations โ€”

  1. No rate-limiting.

    Most modern and large websites have some sort of throttling set up to block bots. A production-grade crawler should implement some politeness policy to make sure it doesn't inadverdently bring down a website, and so it doesn't run into permanent bans & 429 error responses.

  2. In-memory state management.

    Sitemap().sitemap is an unbound Map, and can quickly grow and possibly cause the runtime to run out of memory & crash when crawling very large websites. In a production-grade crawler, there should an external scheduler that holds URLs to crawl next.

websight's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar paambaati avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

websight's Issues

Question: Are lines 25-32 necessary for Coverage

I noticed in https://github.com/paambaati/websight/blob/3e64f8d7b048aa5d4639c3c455e96ca7979d46c3/.github/workflows/ci.yml#L33-L49 that lines 25-32 are also included in coverage section of the yaml file.

This would duplicate parts of the process. A couple questions:

  1. Can we omit duplicative steps?
  2. Can we just omit the testing and collect coverage reports in the case they have already ran by lerna run test for example.
  3. Is it possible to break out the notify code climate, collect coverage, sum-coverage, and upload-coverage in separate distinct steps for better logging?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.