Code Monkey home page Code Monkey logo

ceteri-mapred's Introduction

## Getting Started on Hadoop
## Paco Nathan <[email protected]>
##
## Silicon Valley Cloud Computing Meetup
## http://www.meetup.com/cloudcomputing/calendar/13911740/
## Mountain View, 2010-07-19

GitHub src repo:
    http://github.com/ceteri/ceteri-mapred

Presentation slides available here in Keynote format or 
online at SlideShare:
    doc/enron.key
    http://www.slideshare.net/pacoid/getting-started-on-hadoop


See the "WordCount" example at:
    bin/run_wc.sh

See the "Enron Email Dataset" demo at:
    bin/run_enron.sh

R statistics demo:
    thresh.R, thresh.tsv

Gephi graph demo:
    graph.gephi


## to run your own code on Elastic MapReduce

    1. create a bucket in S3
    2. copy the Python scripts into a "src" folder there
    3. determine some subset of the email message input
	cat msgs.tsv | head -1000 > input
    4. copy "input" to your S3 "src" folder
    5. follow examples in slide deck, based on params below


## Hadoop job flow 1 on Elastic MapReduce

-input s3n://ceteri-mapred/enron/src/input 
-output s3n://ceteri-mapred/enron/src/output 
-mapper '"python map_parse.py http://ceteri-mapred.s3.amazonaws.com/ stopwords"'
-reducer '"python red_idf.py 2500"'
-cacheFile s3n://ceteri-mapred/enron/src/map_parse.py#map_parse.py
-cacheFile s3n://ceteri-mapred/enron/src/red_idf.py#red_idf.py
-cacheFile s3n://ceteri-mapred/enron/src/stopwords#stopwords

## Hadoop job flow 2 on Elastic MapReduce

-input s3n://ceteri-mapred/enron/src/output 
-output s3n://ceteri-mapred/enron/src/filter 
-mapper '"python map_filter.py"'
-reducer '"python red_filter.py 0.0633"'
-cacheFile s3n://ceteri-mapred/enron/src/map_filter.py#map_filter.py
-cacheFile s3n://ceteri-mapred/enron/src/red_filter.py#red_filter.py

## after downloading the partition file named "filter" from S3, then
## run the following command to build a lexicon:

cat filter/part-* | sort -k1 -k4 -nr > lexicon

ceteri-mapred's People

Contributors

ceteri avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.