Code Monkey home page Code Monkey logo

wikireverse's Introduction

wikireverse

Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles. Launched using the elasticrawl CLI tool.

Running using Elasticrawl

Here is how to configure Elasticrawl and an example of parsing some data. To run this example you need an AWS account and it will cost between 40 and 80 cents.

  • Install elasticrawl using the deploy packages for OS X or Linux. deploy steps
  • Run the init command and choose an S3 bucket for storing your data and logs.
$ ./elasticrawl init wikireverse-2014-52
Enter AWS Access Key ID:
Enter AWS Secret Access Key: 
…

Bucket s3://wikireverse-2014-52 created
Config dir /home/vagrant/.elasticrawl created
Config complete
  • Edit ~/.elasticrawl/jobs.yml and replace the steps section with the WikiReverse settings.
steps:
  parse:
    jar: 's3://wikireverse/jar/wikireverse-0.0.1.jar'
    class: 'org.wikireverse.commoncrawl.WikiReverse'
    input_filter: 'wat/*.warc.wat.gz'
    emr_config: #'s3://wikireverse/jar/parse-mapred-site.xml'
  combine:
    jar: 's3://wikireverse/jar/wikireverse-0.0.1.jar'
    class: 'org.wikireverse.commoncrawl.SegmentCombiner'
    input_filter: 'part-*'
    emr_config: #'s3://wikireverse/jar/combine-mapred-site.xml'
  • Run a parse job to process the first 2 files in the first 2 segments.
$ ./elasticrawl parse CC-MAIN-2014-52 --max-segments 2 --max-files 2
Segments
Segment: 1418802765002.8 Files: 176
Segment: 1418802765093.40 Files: 176

Job configuration
Crawl: CC-MAIN-2014-52 Segments: 2 Parsing: 2 files per segment

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1422436508058 Job Flow ID: j-2KMT57YJN4EJA
  • Run a combine job to combine the results from the parse job.
$ ./elasticrawl combine --input-jobs 1422436508058
No entry for terminal type "xterm";
using dumb terminal settings.
Job configuration
Combining: 2 segments

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1422438064880 Job Flow ID: j-1A6Q7LJ1G9TX
  • Finally run an output job to produce the results. This job is launched manually in the EMR section of the AWS Console. The job step takes in 3 arguments. ** Class: org.wikireverse.commoncrawl.OutputToText ** Input Location: e.g. s3://wikireverse-2014-52/data/2-combine/1422438064880/part-* ** Output Location: e.g. s3://wikireverse-2014-52/data/3-output/2014-52-test/
JAR Location: s3://wikireverse/jar/wikireverse-0.0.1.jar
Arguments: org.wikireverse.commoncrawl.OutputToText s3://wikireverse-2014-52/data/2-combine/1422438064880/part-* s3://wikireverse-2014-52/data/3-output/2014-52-test/
  • The final results can be downloaded from the S3 section of the AWS Console.
  • Running the destroy command will clean up and delete your S3 bucket. Otherwise you will be charged for any data stored in your S3 bucket.
./elasticrawl destroy
WARNING:
Bucket s3://wikireverse-2014-52 and its data will be deleted
Config dir /Users/ross/.elasticrawl will be deleted
Delete? (y/n)
y

Bucket s3://wikireverse-2014-52 deleted
Config dir /Users/ross/.elasticrawl deleted
Config deleted

Quickstart Build Instructions

  • Build the package
$ mvn clean package
  • Upload to S3
$ aws s3 cp ./target/wikireverse-0.0.1.jar s3://yourbucket/jar/new-1.0.jar
  • Update ~/.elasticrawl/jobs.yml and continue as above.

TODO

  • Add even more detail on building the code using Maven and Eclipse.

Thanks

  • Thanks to everyone at Common Crawl for making this awesome dataset available.
  • Thanks to Wikipedia for having such interesting data.

License

This code is licensed under the MIT license.

wikireverse's People

Contributors

benmccann avatar e271828- avatar rossf7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikireverse's Issues

Other sites

Hi Ross,

Excellent work on an excellent data set and use of it.

Is it a simple task to adapt this codebase to extract the data for sites other than Wikipedia? if so, how would I go about making it happen?

thanks

Jason

Can't access data from S3 Bucket

I am trying to download the wikireverse dataset from the s3 bucket, but the data is not accessible. I am running a machine in us-east region. This looks like a possible global permission issue.

root@ip:/raid/wikireverse# aws s3 ls s3://wikireverse/
An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

root@ip:/raid/wikireverse# aws s3 cp s3://wikireverse/data/2014-23/part-00000.gz .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.