Code Monkey home page Code Monkey logo

xcrawl3r's Introduction

xcrawl3r

made with go release license maintenance open issues closed issues contribution

xcrawl3r is a command-line interface (CLI) utility to recursively crawl webpages i.e systematically browse webpages' URLs and follow links to discover linked webpages' URLs.

Resources

Features

  • Recursively crawls webpages for URLs.
  • Parses URLs from files (.js, .json, .xml, .csv, .txt & .map).
  • Parses URLs from robots.txt.
  • Parses URLs from sitemaps.
  • Renders pages (including Single Page Applications such as Angular and React).
  • Cross-Platform (Windows, Linux & macOS)

Installation

Install release binaries (Without Go Installed)

Visit the releases page and find the appropriate archive for your operating system and architecture. Download the archive from your browser or copy its URL and retrieve it with wget or curl:

  • ...with wget:

     wget https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz
  • ...or, with curl:

     curl -OL https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz

...then, extract the binary:

tar xf xcrawl3r-<version>-linux-amd64.tar.gz

TIP: The above steps, download and extract, can be combined into a single step with this onliner

curl -sL https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz | tar -xzv

NOTE: On Windows systems, you should be able to double-click the zip archive to extract the xcrawl3r executable.

...move the xcrawl3r binary to somewhere in your PATH. For example, on GNU/Linux and OS X systems:

sudo mv xcrawl3r /usr/local/bin/

NOTE: Windows users can follow How to: Add Tool Locations to the PATH Environment Variable in order to add xcrawl3r to their PATH.

Install source (With Go Installed)

Before you install from source, you need to make sure that Go is installed on your system. You can install Go by following the official instructions for your operating system. For this, we will assume that Go is already installed.

go install ...

go install -v github.com/hueristiq/xcrawl3r/cmd/xcrawl3r@latest

go build ... the development Version

  • Clone the repository

     git clone https://github.com/hueristiq/xcrawl3r.git 
  • Build the utility

     cd xcrawl3r/cmd/xcrawl3r && \
     go build .
  • Move the xcrawl3r binary to somewhere in your PATH. For example, on GNU/Linux and OS X systems:

     sudo mv xcrawl3r /usr/local/bin/

    NOTE: Windows users can follow How to: Add Tool Locations to the PATH Environment Variable in order to add xcrawl3r to their PATH.

NOTE: While the development version is a good way to take a peek at xcrawl3r's latest features before they get released, be aware that it may have bugs. Officially released versions will generally be more stable.

Usage

To display help message for xcrawl3r use the -h flag:

xcrawl3r -h

help message:

                             _ _____      
__  _____ _ __ __ ___      _| |___ / _ __ 
\ \/ / __| '__/ _` \ \ /\ / / | |_ \| '__|
 >  < (__| | | (_| |\ V  V /| |___) | |   
/_/\_\___|_|  \__,_| \_/\_/ |_|____/|_| v0.1.0

A CLI utility to recursively crawl webpages.

USAGE:
  xcrawl3r [OPTIONS]

INPUT:
  -d, --domain string               domain to match URLs
      --include-subdomains bool     match subdomains' URLs
  -s, --seeds string                seed URLs file (use `-` to get from stdin)
  -u, --url string                  URL to crawl

CONFIGURATION:
      --depth int                   maximum depth to crawl (default 3)
                                       TIP: set it to `0` for infinite recursion
      --headless bool               If true the browser will be displayed while crawling.
  -H, --headers string[]            custom header to include in requests
                                       e.g. -H 'Referer: http://example.com/'
                                       TIP: use multiple flag to set multiple headers
      --proxy string[]              Proxy URL (e.g: http://127.0.0.1:8080)
                                       TIP: use multiple flag to set multiple proxies
      --render bool                 utilize a headless chrome instance to render pages
      --timeout int                 time to wait for request in seconds (default: 10)
      --user-agent string           User Agent to use (default: web)
                                       TIP: use `web` for a random web user-agent,
                                       `mobile` for a random mobile user-agent,
                                        or you can set your specific user-agent.

RATE LIMIT:
  -c, --concurrency int             number of concurrent fetchers to use (default 10)
      --delay int                   delay between each request in seconds
      --max-random-delay int        maximux extra randomized delay added to `--dalay` (default: 1s)
  -p, --parallelism int             number of concurrent URLs to process (default: 10)

OUTPUT:
      --debug bool                  enable debug mode (default: false)
  -m, --monochrome bool             coloring: no colored output mode
  -o, --output string               output file to write found URLs
  -v, --verbosity string            debug, info, warning, error, fatal or silent (default: debug)

Contributing

Issues and Pull Requests are welcome! Check out the contribution guidelines.

Licensing

This utility is distributed under the MIT license.

Credits

xcrawl3r's People

Contributors

enenumxela avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.