juanplopes / hadoop2solr Goto Github PK
View Code? Open in Web Editor NEWCode and data for dzone article
Code and data for dzone article
This project contains the code and data for the dzone article on using Solr as a NoSQL endpoint for a Hadoop workflow. Below are the steps that would get you up and running, assuming you have an AWS account set up as per http://openbixo.org/documentation/running-bixo-in-ec2/, and a git client installed. ==================================================== Building the Hadoop job jar ==================================================== % mkdir hadoop2solr-home % cd hadoop2solr-home % git clone git://github.com/bixolabs/hadoop2solr.git % cd hadoop2solr % ant job ==================================================== Setting up the Hadoop cluster ==================================================== % cd hadoop2solr-home % git clone git://github.com/bixo/bixo.git bixo % cd bixo/ec2 % . setenv.sh % hadoop-ec2 launch-cluster hadoop2solr 1 m1.large % hadoop-ec2 push hadoop2solr ../hadoop2solr/build/ hadoop2solr-job-1.0-SNAPSHOT.jar % hadoop-ec2 screen hadoop2solr ==================================================== Running the Hadoop jop (on the hadoop2solr cluster) ==================================================== % hadoop fs -mkdir /user/root/working % hadoop distcp s3n://bixolabs-dzone/urls /user/root/working/crawldb Monitor the job progress using your browser until the job completes successfully % hadoop jar hadoop2solr-job-1.0-SNAPSHOT.jar com.bixolabs.tools.IndexTool -input working/crawldb -output working/solr-index Monitor the job progress using your browser until the job completes successfully The output directory will contain a single 'part-00000' directory, which contains a set of Lucene files for a single index. % hadoop fs -copyToLocal /user/root/working/solr-index /mnt/solr-index ==================================================== Setting up Solr (on the hadoop2solr master) ==================================================== % wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr.zip % wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr-conf.zip % unzip solr.zip % unzip solr-conf.zip % mkdir solr-data % cd solr-data % ln -s /mnt/solr-index/part-00000 index % cd ../solr % java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar > jetty.log 2>&1 & ==================================================== Cleaning up/testing ==================================================== Use the AWS Console to kill off the slave server. Use the AWS Console to open up TCP port 8983 on the master server. Open a browser window on http://<ec2 server public name>:8983/solr/admin
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.