Code Monkey home page Code Monkey logo

web-crawler's Introduction

Web Crawler using Go

This is a simple implementation of a web crawler in Go. It scans for all the URLs in the given domain and returns a JSON response.

Go Report Card CircleCI

Features:

  • Scrapes all the url on the given domain
  • Async scraping
  • Uses super fast Gin framework for running an HTTP rest server

Limitations:

  • No support for multi-domain scraping
  • No support for web pages behind authentication/forms
  • No support for dynamic/Ajax-based web pages
  • The scraper does not support invisible scraping
  • No caching support

Requirement

  • Go version 1.12 or above

Installation and Running

To build the source, go must be installed with GOROOT and GOPATH set correctly. Read this to set up your go environment. Once the setup is complete, clone the repository or put the source directory into $GOPATH/src/.

Now run go get ./... inside the source directory (web-crawler). This will download all the dependencies into $GOPATH/pkg/. If this does not work in rare case; install dependecies individually:
go get -u github.com/gin-gonic/gin
go get -u github.com/gocolly/colly/...
go get github.com/gin-contrib/gzip

After the dependencies are installed, run go build from inside the source directory. This will create an executable file according to the host OS. To run the crawler service use
LINUX/MAC: $ ./web-crawler
WINDOWS: > web-crawler.exe

If you want to build for a different target machine then use the following command:
WINDOWS(64-bit): env GOOS=windows GOARCH=amd64 go build
LINUX(64-bit): env GOOS=linux GOARCH=amd64 go build
MAC(64-bit): env GOOS=darwin GOARCH=amd64 go build

By default, the gin mode is set to 'DebugMode' so that when you run the executable; all the registered endpoints can be seen. This can be changed by changing the Environment var from config.go.

To run tests, use go test

Once the service is running, it will expose GET /crawl endpoint. Create a request as follows:
REQUEST: GET localhost:8888/crawl
HEADER: "Scrape": "https://wiprodigital.com"

Performance

It takes 8.6 secs to crawl 226 URLs of https://wiprodigital.com and create JSON response on an 8-core machine running windows 10 with 8GB of memory using Postman client. There is no guarantee that the first run will return all the 226 URLs but tests show that it takes 2-4 initial runs to produce a consistent result. This depends on many parameters like limits in domain, NS, etc.

Milestone

The limitations mentioned will be removed in the next release(s).

web-crawler's People

Contributors

jayantsinha avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.