atlantbh / emr-s3-io Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 10.0 202 KB

Hadoop IO for Amazon S3

Java 100.00%

emr-s3-io's People

Contributors

Stargazers

Watchers

Forkers

strategist922 dllllb kpelykh lunastorm alex-ikanow peakxu jeremysong jnbala appolo13

emr-s3-io's Issues

InputFormat Interface

I am using AWS EMR and I can't seem to get hadoop streaming to run using the emr-s3-io input formatters. My custom output formats work fine, but input format seems to blow up in Configuration.setClass where it tests "isAssignableFrom". I have included the emr-s3-io jar in my customer-profiles-1.0.jar which includes all my custom extensions.

I am running:
AMI version:2.4.6
Hadoop distribution: 1.0.3

I think there must be something wrong with the way I am running the job as I can't seem to specify any inputformat class from the command line? Can you help me? Even the TextInputFormat - which should be the default, blows up in the same place when specified via -inputformat.

Here is the command I am running.
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -libjars /home/hadoop/customer-profiles-1.0.jar -input 's3:////////' -output s3:/// -inputformat com.atlantbh.hadoop.s3.io.S3ObjectSummaryInputFormat -mapper s3n:///map_keys.rb -reducer aggregate

Gives this exception.

Exception in thread "main" java.lang.RuntimeException: class com.atlantbh.hadoop.s3.io.S3ObjectSummaryInputFormat not org.apache.hadoop.mapred.InputFormat
at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:982)
at org.apache.hadoop.mapred.JobConf.setInputFormat(JobConf.java:600)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:798)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:122)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

Mapper is not getting called

I added bunch of log messages, and noticed that S3ObjectSummaryInputFormat's createRecordReader is never getting called. This might be the reason, my map function never gets called.

2015-12-24 18:22:34,502 INFO com.atlantbh.hadoop.s3.io.S3InputFormat (main): create S3InputFormat
2015-12-24 18:22:34,502 INFO com.atlantbh.hadoop.s3.io.S3ObjectSummaryInputFormat (main): Constructed S3ObjectSummaryInputFormat
2015-12-24 18:22:34,502 WARN com.atlantbh.hadoop.s3.io.S3InputFormat (main): Using s3.input.numOfKeys value to determine input splits
2015-12-24 18:22:34,503 INFO com.atlantbh.hadoop.s3.io.S3BucketReader (main): Initializing ListObjectsRequest and s3Client
2015-12-24 18:22:35,508 INFO com.atlantbh.hadoop.s3.io.S3InputFormat (main): currentKeyIDX: 0 nextKeyIDX: 53 maxKeyIDX: 53
2015-12-24 18:22:35,509 INFO com.atlantbh.hadoop.s3.io.S3InputFormat (main): Number of input splits=1
2015-12-24 18:22:35,560 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:1
2015-12-24 18:22:35,814 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1450981249790_0001
2015-12-24 18:22:36,147 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1450981249790_0001
2015-12-24 18:22:36,183 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-10-182-59-27.ec2.internal:20888/proxy/application_1450981249790_0001/
2015-12-24 18:22:36,184 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1450981249790_0001
2015-12-24 18:22:44,305 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1450981249790_0001 running in uber mode : false
2015-12-24 18:22:44,306 INFO org.apache.hadoop.mapreduce.Job (main): map 0% reduce 0%
2015-12-24 18:22:51,375 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 0%
2015-12-24 18:22:57,410 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 14%
2015-12-24 18:22:59,421 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 29%
2015-12-24 18:23:04,453 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 43%
2015-12-24 18:23:05,459 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 57%
2015-12-24 18:23:09,487 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 71%
2015-12-24 18:23:10,492 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 86%
2015-12-24 18:23:14,516 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 100%

public static class S3KeyMap extends
        Mapper<Text, S3ObjectSummaryWritable, Text, S3ObjectSummaryWritable> {

    private static final transient Logger log = Logger
            .getLogger(S3KeyMap.class.getName());

    @Override
    protected void map(Text key, S3ObjectSummaryWritable value, 
            Context context) throws IOException, InterruptedException {
        log.info("Map processing " + key.toString());
        context.write((Text) key, (S3ObjectSummaryWritable) value);

}

        Configuration conf = new Configuration();
        conf.set(S3InputFormat.S3_BUCKET_NAME, "mybucket");
        conf.set(S3InputFormat.S3_KEY_PREFIX,
                "folder/with/small/files/");
        conf.set(S3InputFormat.S3_NUM_OF_KEYS_PER_MAPPER, "500");
        conf.set(S3InputFormat.S3_NUM_OF_MAPPERS, "-1");
        Job job = Job.getInstance(conf);
        job.setJarByClass(S3ReorgEMR.class); // set it to our name
        job.setJobName("S3ReorgEMR");
        job.setInputFormatClass(S3ObjectSummaryInputFormat.class);
        job.setMapperClass(S3KeyMap.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(S3ObjectSummaryWritable.class);
        job.setReducerClass(S3DJReducer.class);
        job.setOutputFormatClass(NullOutputFormat.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

Backport emr-s3-io to old Hadoop API

emr-s3-io uses new Hadoop API which does not provide same features as old Hadoop API. Old API has been deprecated in version 0.20.x but now in latest hadoop version it's been un-deprecated.

MultipleOutputs support?

MultipleOutputs would be hugely beneficial in conjunction with emr-s3-io. Unfortunately they appear to be incompatible on the current EMR hadoop version.

Perhaps this is hard to solve... But EMR runs on 0.20.205. This project uses the "new" mapreduce API rather than the mapred API. Unfortunately, MultipleOutputs in 0.20.205 only works on the "old" mapred API:
http://stackoverflow.com/questions/9876456/hadoop-multipleoutputs-addnamedoutput-throws-cannot-find-symbol

(Later solved in MAPREDUCE-370)

This creates a incompatibility between emr-s3-io and MultipleOutputs.

Backport emr-s3-io to the old mapred API? Or possibly do you have a fix for MultipleOutputs that would be compatible?

License?

Hi, can you attach a license for this repo?

Thanks in advance.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.