Code Monkey home page Code Monkey logo

gocrawler's Introduction

GoCrawler

A distributed web crawler implemented using Go, Postgres, RabbitMQ and Docker

Introduction

A web crawler traverses over a given webpage and find out the links present it. It then repeats the same process for each obtained link recursively indexes a series of pages, thus crawling over the sites.

Here using a supervisor worker server model we utilize a set of distributed worker nodes to process each page and a supervisor to communicate with the client and aggregate the results. We utilize RabbitMQ to send messages to the worker nodes and Postgres to persist these results.

Crawler

Steps

To provision the cluster:

$ make provision

This creates a 3 node 1 supervisor GoCrawler cluster established in their own docker network.

To view the status of the cluster

$ make info

Now we can send requests to crawl, and view sitemaps to any page.

$ curl -X POST -d "{\"URL\": \"<URL>\"}" http://localhost:8050/spider/view
$ curl -X POST -d "{\"URL\": \"<URL>\"}" http://localhost:8050/spider/crawl

In the logs for each worker docker container, we can see the logs of the worker nodes parsing and processing each obtained URL.

To tear down the cluster and remove the built docker images:

$ make clean

Example

After building the GoCrawler nodes we can now parse/view URLs from the GoSupervisor node

$ curl -X POST -d "{\"URL\": \"http://www.google.com\"}" http://localhost:8050/spider/view
$ curl -X POST -d "{\"URL\": \"http://www.google.com\"}" http://localhost:8050/spider/crawl

When reading the list of values in they aggregate the visited sitemap and respond with a JSON tree like sitemap.

$ curl -X POST -d "{\"URL\": \"http://www.google.com\"}" http://localhost:8050/spider/view
{
  "url": "http://www.google.com//",
  "links": [
    {
      "url": "http://www.google.com/about",
      "links": null
    },
    {
      "url": "http://www.monzo.com/contact",
      "links": null
    },
    ...
 }

We can then instantiate other new spiders to crawl other pages

$ curl -X POST -d "{\"URL\": \"http://www.google.com\"}" http://localhost:8050/spider/view
> 200 OK

gocrawler's People

Contributors

0xflotus avatar el10savio avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.