Code Monkey home page Code Monkey logo

hadoop2solr's Introduction

This project contains the code and data for the dzone article on using Solr as a NoSQL endpoint for a Hadoop workflow.

Below are the steps that would get you up and running, assuming you have an AWS account set up as per http://openbixo.org/documentation/running-bixo-in-ec2/, and a git client installed.

====================================================
Building the Hadoop job jar
====================================================

% mkdir hadoop2solr-home
% cd hadoop2solr-home
% git clone git://github.com/bixolabs/hadoop2solr.git 
% cd hadoop2solr
% ant job

====================================================
Setting up the Hadoop cluster
====================================================

% cd hadoop2solr-home
% git clone git://github.com/bixo/bixo.git bixo
% cd bixo/ec2
% . setenv.sh
% hadoop-ec2 launch-cluster hadoop2solr 1 m1.large
% hadoop-ec2 push hadoop2solr ../hadoop2solr/build/ hadoop2solr-job-1.0-SNAPSHOT.jar
% hadoop-ec2 screen hadoop2solr

====================================================
Running the Hadoop jop (on the hadoop2solr cluster)
====================================================

% hadoop fs -mkdir /user/root/working
% hadoop distcp s3n://bixolabs-dzone/urls /user/root/working/crawldb

Monitor the job progress using your browser until the job completes successfully

% hadoop jar hadoop2solr-job-1.0-SNAPSHOT.jar com.bixolabs.tools.IndexTool -input working/crawldb -output working/solr-index

Monitor the job progress using your browser until the job completes successfully

The output directory will contain a single 'part-00000' directory, which contains a set of Lucene files for a single index.

% hadoop fs -copyToLocal /user/root/working/solr-index /mnt/solr-index

====================================================
Setting up Solr (on the hadoop2solr master)
====================================================

% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr.zip
% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr-conf.zip
% unzip solr.zip
% unzip solr-conf.zip
% mkdir solr-data
% cd solr-data
% ln -s /mnt/solr-index/part-00000 index
% cd ../solr
% java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar > jetty.log 2>&1 &

====================================================
Cleaning up/testing
====================================================

Use the AWS Console to kill off the slave server.

Use the AWS Console to open up TCP port 8983 on the master server.

Open a browser window on http://<ec2 server public name>:8983/solr/admin

hadoop2solr's People

Contributors

schmed avatar

Watchers

James Cloos avatar vivek kumar guta avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.