Code Monkey home page Code Monkey logo

url-collector's Introduction

URL Collector

The URL collector is an application that crawls the Common Crawl corpus for URLs with the specified file extensions.

Architecture

The application suite is a distributed system to increase the processing speed. It can be deployed to one machine, but if the crawling speed should be increased, the slave application can be deployed to multiple machines.

The url-collector suite is built from three applications.

URL Collector Master

Knows what WARC files (work units) should be crawled, store this data in a database and distribute the work to the slave applications.

URL Collector Slave

Gets a work unit (effectively a Common Crawl WARC file) from the Master application, that should be parsed for URLs. Parse the urls then upload the results to a warehouse.

URL Collector Merger

Merge the crawl results in a warehouse into one big file, while doing so, removing the duplicate entries in the files.

Installation

To start crawling, you need to have Java and MongoDB installed on your machine.

Installing Java

First, you need to download the Java 17 Runtime Environment. It’s available here. After the download is complete you should run the installer and follow the directions it provides until the installation is complete.

Once it’s done, if you open a command line (write cmd to the Start menu’s search bar) you will be able to use the java command. Try to write java -version. You should get something similar:

java version "15" 2020-09-15
Java(TM) SE Runtime Environment (build 15+36-1562)
Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)

Installing MongoDB

Download MongoDB 5.0 from here. After the download is complete run the installer and follow the directions it provides. If it’s possible, install the MongoDB Compass tool as well because you might need it later for administrative tasks.

Running the Master Application

You can download the Master Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-master-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name and it’s value. For example:

java -jar url-collector-master-application-{release-number}.jar --example.parameter=value

Parameters

Table 1. Parameters
Parameter Description

database.host

The host location of the MongoDB database server. (Default value: localhost)

database.port

The port open for the MongoDB database server. (Default value: 27017)

database.uri

If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "")

server.port

The port where the master server should listen on. (Default value: 8080)

Running the Slave Application

You can download the Slave Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-slave-application-{release-number}.jar ...

You can start the slave application on more than one machines if necessary, to improve the crawling speed.

In the place of the …​ you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name and it’s value. For example:

java -jar url-collector-slave-application-{release-number}.jar --example.parameter=value

Parameters

Table 2. Parameters
Parameter Description

execution.parallelism-target

How many work units should be processed at the same time. (Default value: the number of processor cores multiplied by two)

warehouse.type

The type of the warehouse. Can be either 'local' or 'aws'.

warehouse.local.target-directory

The target directory where the crawled URLs should be saved. Only used if the warehouse.type is 'local'.

warehouse.aws.region

The region of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.bucket-name

The name of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.access-key

The access key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.secret-key

The secret key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

validation.types

The list of file executions that we need to save the URLs for.

master.host

The host location of the Master Application. (Default value: localhost)

master.port

The port location of the Master Application. (Default value: 8080)

Starting a crawl

To start a crawl, the Master application’s crawl initialization endpoint should be called. The request should be a POST request to the /crawl endpoint on the master with a body the contains the crawlId for the Common Crawl dataset that should be processed.

For example:

curl --location --request POST 'http://185.191.228.214:8081/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
"crawlId": "CC-MAIN-2021-31"
}'

Once it is done, the Slave application should automatically pick up the new work units in a matter of minutes.

Starting the Merger Application

You can download the Merger Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-merger-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name, and it’s value. For example:

java -jar url-collector-merger-application-{release-number}.jar --example.parameter=value

Parameters

Table 3. Parameters
Parameter Description

database.host

The host location of the MongoDB database server. (Default value: localhost)

database.port

The port open for the MongoDB database server. (Default value: 27017)

database.uri

If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "")

warehouse.type

The type of the warehouse. Can be either 'local' or 'aws'.

warehouse.local.target-directory

The target directory where the crawled URLs should be saved. Only used if the warehouse.type is 'local'.

warehouse.aws.region

The region of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.bucket-name

The name of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.access-key

The access key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.secret-key

The secret key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

result.path

The location where the result of the merge should be saved at. It should be a directory. The result file will be saved there with the filename of 'result.ubds'.

Peeking into the results

The individual result files are LZMA encoded. If you want to peek into them, then first you should decompress the files (7-Zip can help you with this on Windows). The URLs will sit in a JSON array.

The merged result file is NOT compressed. In it, the URLs are separated by newline characters.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.