Code Monkey home page Code Monkey logo

emr-s3-io's People

Contributors

samiraga avatar samireljazovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

emr-s3-io's Issues

InputFormat Interface

I am using AWS EMR and I can't seem to get hadoop streaming to run using the emr-s3-io input formatters. My custom output formats work fine, but input format seems to blow up in Configuration.setClass where it tests "isAssignableFrom". I have included the emr-s3-io jar in my customer-profiles-1.0.jar which includes all my custom extensions.

I am running:
AMI version:2.4.6
Hadoop distribution: 1.0.3

I think there must be something wrong with the way I am running the job as I can't seem to specify any inputformat class from the command line? Can you help me? Even the TextInputFormat - which should be the default, blows up in the same place when specified via -inputformat.

Here is the command I am running.
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -libjars /home/hadoop/customer-profiles-1.0.jar -input 's3:////////' -output s3:/// -inputformat com.atlantbh.hadoop.s3.io.S3ObjectSummaryInputFormat -mapper s3n:///map_keys.rb -reducer aggregate

Gives this exception.

Exception in thread "main" java.lang.RuntimeException: class com.atlantbh.hadoop.s3.io.S3ObjectSummaryInputFormat not org.apache.hadoop.mapred.InputFormat
at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:982)
at org.apache.hadoop.mapred.JobConf.setInputFormat(JobConf.java:600)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:798)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:122)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

Mapper is not getting called

I added bunch of log messages, and noticed that S3ObjectSummaryInputFormat's createRecordReader is never getting called. This might be the reason, my map function never gets called.

2015-12-24 18:22:34,502 INFO com.atlantbh.hadoop.s3.io.S3InputFormat (main): create S3InputFormat
2015-12-24 18:22:34,502 INFO com.atlantbh.hadoop.s3.io.S3ObjectSummaryInputFormat (main): Constructed S3ObjectSummaryInputFormat
2015-12-24 18:22:34,502 WARN com.atlantbh.hadoop.s3.io.S3InputFormat (main): Using s3.input.numOfKeys value to determine input splits
2015-12-24 18:22:34,503 INFO com.atlantbh.hadoop.s3.io.S3BucketReader (main): Initializing ListObjectsRequest and s3Client
2015-12-24 18:22:35,508 INFO com.atlantbh.hadoop.s3.io.S3InputFormat (main): currentKeyIDX: 0 nextKeyIDX: 53 maxKeyIDX: 53
2015-12-24 18:22:35,509 INFO com.atlantbh.hadoop.s3.io.S3InputFormat (main): Number of input splits=1
2015-12-24 18:22:35,560 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:1
2015-12-24 18:22:35,814 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1450981249790_0001
2015-12-24 18:22:36,147 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1450981249790_0001
2015-12-24 18:22:36,183 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-10-182-59-27.ec2.internal:20888/proxy/application_1450981249790_0001/
2015-12-24 18:22:36,184 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1450981249790_0001
2015-12-24 18:22:44,305 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1450981249790_0001 running in uber mode : false
2015-12-24 18:22:44,306 INFO org.apache.hadoop.mapreduce.Job (main): map 0% reduce 0%
2015-12-24 18:22:51,375 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 0%
2015-12-24 18:22:57,410 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 14%
2015-12-24 18:22:59,421 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 29%
2015-12-24 18:23:04,453 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 43%
2015-12-24 18:23:05,459 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 57%
2015-12-24 18:23:09,487 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 71%
2015-12-24 18:23:10,492 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 86%
2015-12-24 18:23:14,516 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 100%

public static class S3KeyMap extends
        Mapper<Text, S3ObjectSummaryWritable, Text, S3ObjectSummaryWritable> {

    private static final transient Logger log = Logger
            .getLogger(S3KeyMap.class.getName());

    @Override
    protected void map(Text key, S3ObjectSummaryWritable value, 
            Context context) throws IOException, InterruptedException {
        log.info("Map processing " + key.toString());
        context.write((Text) key, (S3ObjectSummaryWritable) value);

}

        Configuration conf = new Configuration();
        conf.set(S3InputFormat.S3_BUCKET_NAME, "mybucket");
        conf.set(S3InputFormat.S3_KEY_PREFIX,
                "folder/with/small/files/");
        conf.set(S3InputFormat.S3_NUM_OF_KEYS_PER_MAPPER, "500");
        conf.set(S3InputFormat.S3_NUM_OF_MAPPERS, "-1");
        Job job = Job.getInstance(conf);
        job.setJarByClass(S3ReorgEMR.class); // set it to our name
        job.setJobName("S3ReorgEMR");
        job.setInputFormatClass(S3ObjectSummaryInputFormat.class);
        job.setMapperClass(S3KeyMap.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(S3ObjectSummaryWritable.class);
        job.setReducerClass(S3DJReducer.class);
        job.setOutputFormatClass(NullOutputFormat.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

Backport emr-s3-io to old Hadoop API

emr-s3-io uses new Hadoop API which does not provide same features as old Hadoop API. Old API has been deprecated in version 0.20.x but now in latest hadoop version it's been un-deprecated.

MultipleOutputs support?

MultipleOutputs would be hugely beneficial in conjunction with emr-s3-io. Unfortunately they appear to be incompatible on the current EMR hadoop version.

Perhaps this is hard to solve... But EMR runs on 0.20.205. This project uses the "new" mapreduce API rather than the mapred API. Unfortunately, MultipleOutputs in 0.20.205 only works on the "old" mapred API:
http://stackoverflow.com/questions/9876456/hadoop-multipleoutputs-addnamedoutput-throws-cannot-find-symbol

(Later solved in MAPREDUCE-370)

This creates a incompatibility between emr-s3-io and MultipleOutputs.

Backport emr-s3-io to the old mapred API? Or possibly do you have a fix for MultipleOutputs that would be compatible?

License?

Hi, can you attach a license for this repo?

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.