Code Monkey home page Code Monkey logo

scrape's Introduction

Scrape Go Report Card

Scrape is minimalistic depth controlled web scraping project. It can be used as command-line tool or integrate it in your project. Scrape also supports sitemap generation as an output.

Scrape Response

Once the Scraping is done on given URL, the API returns the following structure.

// Response holds the scrapped response
package scrape

import (
	"net/url"
	"regexp"
)

type Response struct {
	BaseURL      *url.URL            // starting url at maxDepth 0
	UniqueURLs   map[string]int      // UniqueURLs holds the map of unique urls we crawled and times each url is repeated
	URLsPerDepth map[int][]*url.URL  // URLsPerDepth holds urls found in each depth
	SkippedURLs  map[string][]string // SkippedURLs holds urls extracted from source urls but failed domainRegex (if given) and are invalid.
	ErrorURLs    map[string]error    // errorURLs holds details as to why reason the url was not crawled
	DomainRegex  *regexp.Regexp      // restricts crawling the urls to given domain
	MaxDepth     int                 // MaxDepth of crawl, -1 means no limit for maxDepth
	Interrupted  bool                // true if the scrapping was interrupted
}

Command line:

Installation:

go get github.com/vedhavyas/scrape/cmd/scrape/

Available command line options:

Usage of ./scrape:
 -domain-regex string(optional)
        Domain regex to limit crawls to. Defaults to base url domain
 -max-depth int(optional)
        Max depth to Crawl (default -1)
 -sitemap string(optional)
        File location to write sitemap to
 -url string(required)
        Starting URL (default "https://vedhavyas.com")

Output

Scrape supports 2 types of output.

  1. Printing all the above collected data to stdout from Response
  2. Generating a sitemap xml file(if passed) from the Response.

As a Package

Scrape can be integrated into any Go project through the given APIs. As a package, you will have access to the above mentioned Response and all the data in it. At this point, the following are the available APIs.

Start

func Start(ctx context.Context, url string) (resp *Response, err error)

Start will start the scrapping with no depth limit(-1) and base url domain

StartWithDepth

func StartWithDepth(ctx context.Context, url string, maxDepth int) (resp *Response, err error)

StartWithDepth will start the scrapping with given max depth and base url domain

StartWithDepthAndDomainRegex

func StartWithDepthAndDomainRegex(ctx context.Context, url string, maxDepth int, domainRegex string) (resp *Response, err error) 

StartWithDepthAndDomainRegex will start the scrapping with max depth and regex

StartWithRegex

func StartWithDomainRegex(ctx context.Context, url, domainRegex string) (resp *Response, err error)

StartWithRegex will start the scrapping with no depth limit(-1) and regex

Sitemap

func Sitemap(resp *Response, file string) error 

Sitemap generates a sitemap from the given response

Feedback and Contributions

  1. If you think something is missing, please feel free to raise an issue.
  2. If you would like to work on an open issue, feel free to announce yourself in issue's comments

scrape's People

Contributors

vedhavyas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

scrape's Issues

Limit on the number of requests and threads

Hi, thanks for the project.
Is there an opportunity to add a limit on the number of requests and threads?
My site on node.js does not stand up and falls, after starting the scan.

Also scan relative to uri.
For example, locale:

/en-us
/de-de
/ru-ru

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.