Code Monkey home page Code Monkey logo

hadoop_playground's Introduction

Hadoop_playground

Hadoop_playground

General Hadoop Tips on EOS

Debugging Map and Reduce Separately

By default, the EOS system discards the Map-stage outputs automatically after Map-Reduce completes. Unfortunately, this makes it difficult to debug Hadoop Streaming programs. An easy way around this is to run the MR method 3 times, once with map and no reduce, once with reduce and no map, and once with both map and reduce.

For example, to run streaming with no reduce:

hadoop jar $HADOOP_STREAMING \
	-mapper "./map.out" \
	-reducer NONE \	
	-file map.out \
	-verbose \
	-input $TAMUSC_USER_WORK_DIR/inputs/ \
	-output $TAMUSC_USER_WORK_DIR/outputs/maponly 

Note By running this code with a C-based word count program, I found that Map generally outputs 8 files labeled Part-0000 - Part-0007. However, Reduce generally only outputs a single file. I wonder how this holds for larger files.

Important Caveat

I found that using hadoop -fs get does not overwrite files if they exist in the target directory nor does it tell us whether or not this condition occurs. I ran into the situation where I would run my code but no matter what changes I would make to my reduce program, the reduce-output files were empty. Verifying successful mapping was simple enough; I just used the method described in Debugging Map and Reduce Separately and at the map outputs directly.

The Quick Fix for this issue with Hadoop fs -get/put is to always target a unique directory. For example, I used target folders with some form of $PBS_JOBID written in them (since the EOS PBS system gives us unique Job ID's).

http://sc.tamu.edu/help/eos/batch/

More Advanced Debugging

According to the Hadoop Wiki, it should be possible to run Clean-up scripts if either map and/or reduce fails. More information can be found at http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms.

http://srinathsview.blogspot.com/2012/05/debugging-hadoop-task-tracker-job.html

I will play around with some of these scripts and post my findings soon.

Something interesting I found (but have yet to try) is that the EOS PBS system lets us choose our std streams. I might be able to use this to create a named stderr stream for my C programs to write logs to.

hadoop_playground's People

Contributors

gwood avatar mbartling avatar

Watchers

 avatar  avatar

Forkers

gwood

hadoop_playground's Issues

Is Map Efficient?

Right now I have the map set up to process each line from STDIN (Required when running Hadoop with C) and computing the local min and max for each line. The lines are given a random number from 1 to N for the key since hadoop with automagically order it for me. The collect stage will combine keys before sending it to the reduce stage with new <min/max, value> maps.

Should I make the key tags before hand and remove the combine stage?

I wonder how Hadoop will handle this pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.