Code Monkey home page Code Monkey logo

druid-spark-batch's Introduction

Build Status

druid-spark-batch

Druid indexing plugin for using Spark in batch jobs

This repository holds a Druid extension for using Spark as the engine for running batch jobs

To build issue the command sbt clean test publish-local publish-m2

Default Properties

The default properties injected into spark are as follows:

    .set("spark.executor.memory", "7G")
    .set("spark.executor.cores", "1")
    .set("spark.kryo.referenceTracking", "false")
    .set("user.timezone", "UTC")
    .set("file.encoding", "UTF-8")
    .set("java.util.logging.manager", "org.apache.logging.log4j.jul.LogManager")
    .set("org.jboss.logging.provider", "slf4j")

How to use

There are four key things that need configured to use this extension

  1. Overlord needs the druid-spark-batch extension added.
  2. MiddleManager (if present) needs the druid-spark-batch extension added.
  3. A task json needs configured.
  4. Spark is included in the default hadoop coordinates similar to druid.indexer.task.defaultHadoopCoordinates=["org.apache.spark:spark-core_2.10:1.5.2-mmx1"]

To load the extension, use the appropriate coordinates (for druid 0.8.x the following should be added to druid.extensions.coordinates : io.druid.extensions:druid-spark-batch_2.10:jar:assembly:0.0.13) or make certain the extension jars are located in the proper directories (druid 0.9.0 with version 0.9.0.x of this library, druid 0.9.1 with the 0.9.1.x version)

The recommended method of pulling down the extensions is to use pull-deps to pull down the versions of interest. A Hadoop coordinate and an extension should be specified as per -h org.apache.spark:spark-core_2.10:1.5.2-mmx4 and -c io.druid.extensions:druid-spark-batch_2.10:0.9.1-0 (with the appropriate versions of course)

Task JSON

The following is an example spark batch task for the indexing service:

{
    "paths":["/<your-druid-spark-batch-dir>/src/test/resources/lineitem.small.tbl"],
    "dataSchema": {
        "dataSource": "sparkTest",
        "granularitySpec": {
            "intervals": [
                "1992-01-01T00:00:00.000Z/1999-01-01T00:00:00.000Z"
            ],
            "queryGranularity": {
                "type": "all"
            },
            "segmentGranularity": "YEAR",
            "type": "uniform"
        },
        "metricsSpec": [
            {
                "name": "count",
                "type": "count"
            },
            {
                "fieldName": "l_quantity",
                "name": "L_QUANTITY_longSum",
                "type": "longSum"
            },
            {
                "fieldName": "l_extendedprice",
                "name": "L_EXTENDEDPRICE_doubleSum",
                "type": "doubleSum"
            },
            {
                "fieldName": "l_discount",
                "name": "L_DISCOUNT_doubleSum",
                "type": "doubleSum"
            },
            {
                "fieldName": "l_tax",
                "name": "L_TAX_doubleSum",
                "type": "doubleSum"
            }
        ],
        "parser": {
            "encoding": "UTF-8",
            "parseSpec": {
                "columns": [
                    "l_orderkey",
                    "l_partkey",
                    "l_suppkey",
                    "l_linenumber",
                    "l_quantity",
                    "l_extendedprice",
                    "l_discount",
                    "l_tax",
                    "l_returnflag",
                    "l_linestatus",
                    "l_shipdate",
                    "l_commitdate",
                    "l_receiptdate",
                    "l_shipinstruct",
                    "l_shipmode",
                    "l_comment"
                ],
                "delimiter": "|",
                "dimensionsSpec": {
                    "dimensionExclusions": [
                        "l_tax",
                        "l_quantity",
                        "count",
                        "l_extendedprice",
                        "l_shipdate",
                        "l_discount"
                    ],
                    "dimensions": [
                        "l_comment",
                        "l_commitdate",
                        "l_linenumber",
                        "l_linestatus",
                        "l_orderkey",
                        "l_receiptdate",
                        "l_returnflag",
                        "l_shipinstruct",
                        "l_shipmode",
                        "l_suppkey"
                    ],
                    "spatialDimensions": []
                },
                "format": "tsv",
                "listDelimiter": ",",
                "timestampSpec": {
                    "column": "l_shipdate",
                    "format": "yyyy-MM-dd",
                    "missingValue": null
                }
            },
            "type": "string"
        }
    },
    "indexSpec": {
        "bitmap": {
            "type": "concise"
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4"
    },
    "intervals": ["1992-01-01T00:00:00.000Z/1999-01-01T00:00:00.000Z"],
    "master": "local[1]",
    "properties": {
        "some.property": "someValue",
        "spark.io.compression.codec":"org.apache.spark.io.LZ4CompressionCodec"
    },
    "targetPartitionSize": 10000000,
    "type": "index_spark_2.11"
}

The json keys accepted by the spark batch indexer are described below

Batch indexer json fields

Field Type Required Default Description
type String Yes, index_spark N/A Must be index_spark
paths List of strings Yes N/A A list of hadoop-readable input files. The values are joined with a , and used as a SparkContext.textFile
dataSchema DataSchema Yes N/A The data schema to use
intervals List of strings Yes N/A A list of ISO intervals to be indexed. ALL data for these intervals MUST be present in paths
maxRowsInMemory positive integer No 75000 Maximum number of rows to store in memory before an intermediate flush to disk
targetPartitionSize positive integer No 5000000 The target number of rows per partition per segment granularity
master String No master[1] The spark master URI
properties Map No none A map of string key/value pairs to inject into the SparkContext properties overriding any prior set values
id String No Assigned based on dataSource, intervals, and DateTime.now() The ID for the task. If not provied it will be assigned
indexSpec InputSpec No concise, lz4, lz4 The InputSpec containing the various compressions to be used
context Map No none The task context
hadoopDependencyCoordinates List of strings No null (use default set by druid config) The spark dependency coordinates to load in the ClassLoader when launching the task
buildV9Directly Boolean No False Build v9 index directly instead of building v8 index and converting it to v9 format.

Deploying this project

This project uses cross-building in SBT. Both 2.10 and 2.11 versions can be built and deployed with sbt release

For setting repository credentials to be able to publish a release, refer to https://stackoverflow.com/a/19598435

Upgrading to 0.9.2

There is now a version for scala 2.10 and scala 2.11. Only ONE of which may be used at any given time.

druid-spark-batch's People

Contributors

bjozet avatar drcrallen avatar fokko avatar gkc2104 avatar jisookim0513 avatar leventov avatar xvrl avatar zzl0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

druid-spark-batch's Issues

What is spark mmx version

Hi

I'm using spark but I can't build nor using this extension because I can't find any info on a spark mmx version please help
I'm using spark 1.6.1 bundle with ec2 script
I'm using imply 1.3.0 based on druid 0.9.1.1

Thanks

Can't make Spark indexer run with spark-1.5.2-mmx0-hadoop2.4_2.10 because of classpath issues

I've compiled Spark indexer against Druid 0.8.2 and patched Spark 1.5.2 (mmx0 patch).

When running a task, I'm getting this exception:

2015-12-24T10:49:44,546 ERROR [task-runner-0] io.druid.indexer.spark.SparkBatchIndexTask - Error running task [index_spark_installs_2015-12-01T00:00:00.000Z_2015-12-20T00:00:00.000Z_2015-12-24T10:49:08.874Z]
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
    at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:0.0.25-SNAPSHOT]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:138) ~[druid-indexing-service-0.8.2.jar:0.8.2]
    at io.druid.indexer.spark.SparkBatchIndexTask.run(SparkBatchIndexTask.scala:165) [druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:221) [druid-indexing-service-0.8.2.jar:0.8.2]
    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:200) [druid-indexing-service-0.8.2.jar:0.8.2]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_66]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_66]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_66]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_66]
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_66]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_66]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_66]
    at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_66]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:135) ~[druid-indexing-service-0.8.2.jar:0.8.2]
    ... 7 more
Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 5 times, most recent failure: Lost task 4.4 in stage 0.0 (TID 52, sparkspot-20039-003-prod.eu1.appsflyer.com): java.lang.IllegalAccessError: tried to access method com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap; from class com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
    at com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
    at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._constructDefaultValueInstantiator(BasicDeserializerFactory.java:325)
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findValueInstantiator(BasicDeserializerFactory.java:266)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:266)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:168)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:403)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:352)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:264)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:244)
    at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142)
    at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:394)
    at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3169)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3062)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
    at io.druid.indexer.spark.SerializedJson.fillFromMap(SparkDruidIndexer.scala:548)
    at io.druid.indexer.spark.SerializedJson.readObject(SparkDruidIndexer.scala:518)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at com.google.common.io.Closer.rethrow(Closer.java:149) ~[guava-16.0.1.jar:?]
    at io.druid.indexer.spark.SparkBatchIndexTask$.runTask(SparkBatchIndexTask.scala:381) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at io.druid.indexer.spark.SparkBatchIndexTask.runTask(SparkBatchIndexTask.scala) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at io.druid.indexer.spark.Runner.runTask(Runner.java:29) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_66]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_66]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_66]
    at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_66]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:135) ~[druid-indexing-service-0.8.2.jar:0.8.2]
    ... 7 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 5 times, most recent failure: Lost task 4.4 in stage 0.0 (TID 52, sparkspot-20039-003-prod.eu1.appsflyer.com): java.lang.IllegalAccessError: tried to access method com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap; from class com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
    at com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
    at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._constructDefaultValueInstantiator(BasicDeserializerFactory.java:325)
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findValueInstantiator(BasicDeserializerFactory.java:266)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:266)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:168)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:403)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:352)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:264)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:244)
    at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142)
    at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:394)
    at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3169)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3062)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
    at io.druid.indexer.spark.SerializedJson.fillFromMap(SparkDruidIndexer.scala:548)
    at io.druid.indexer.spark.SerializedJson.readObject(SparkDruidIndexer.scala:518)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ~[scala-library-2.10.5.jar:?]
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) ~[scala-library-2.10.5.jar:?]
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at scala.Option.foreach(Option.scala:236) ~[scala-library-2.10.5.jar:?]
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.rdd.RDD.collect(RDD.scala:908) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at io.druid.indexer.spark.SparkDruidIndexer$.loadData(SparkDruidIndexer.scala:163) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at io.druid.indexer.spark.SparkBatchIndexTask$.runTask(SparkBatchIndexTask.scala:448) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at io.druid.indexer.spark.SparkBatchIndexTask.runTask(SparkBatchIndexTask.scala) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at io.druid.indexer.spark.Runner.runTask(Runner.java:29) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_66]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_66]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_66]
    at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_66]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:135) ~[druid-indexing-service-0.8.2.jar:0.8.2]
    ... 7 more
Caused by: java.lang.IllegalAccessError: tried to access method com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap; from class com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
    at com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39) ~[druid-common-0.8.2.jar:0.8.2]
    at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269) ~[jackson-databind-2.4.6.jar:0.8.2]
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._constructDefaultValueInstantiator(BasicDeserializerFactory.java:325) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findValueInstantiator(BasicDeserializerFactory.java:266) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:266) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:168) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:403) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:352) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:264) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:244) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:394) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3169) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3062) ~[jackson-databind-2.4.6.jar:2.4.6]
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161) ~[jackson-databind-2.4.6.jar:2.4.6]
    at io.druid.indexer.spark.SerializedJson.fillFromMap(SparkDruidIndexer.scala:548) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at io.druid.indexer.spark.SerializedJson.readObject(SparkDruidIndexer.scala:518) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_66]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_66]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_66]
    at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_66]
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) ~[?:1.8.0_66]
    at scala.collection.immutable.$colon$colon.readObject(List.scala:362) ~[scala-library-2.10.5.jar:?]
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_66]
    at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_66]
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) ~[?:1.8.0_66]
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) ~[?:1.8.0_66]
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.scheduler.Task.run(Task.scala:88) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ~[druid-spark-batch-assembly-0.0.25-SNAPSHOT.jar:0.0.25-SNAPSHOT]
    ... 3 more

org.apache.spark#spark-core_2.10;1.5.2-mmx4: not found

What's the problem?

I am trying to use druid-spark-batch, and got the following error:

[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.spark#spark-core_2.10;1.5.2-mmx4: not found
[warn]  :: io.druid#druid-processing;0.9.1-SNAPSHOT: not found
[warn]  :: io.druid#druid-server;0.9.1-SNAPSHOT: not found
[warn]  :: io.druid#druid-indexing-service;0.9.1-SNAPSHOT: not found
[warn]  :: io.druid#druid-indexing-hadoop;0.9.1-SNAPSHOT: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
.
.
.
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.spark#spark-core_2.10;1.5.2-mmx4: not found
[error] unresolved dependency: io.druid#druid-processing;0.9.1-SNAPSHOT: not found
[error] unresolved dependency: io.druid#druid-server;0.9.1-SNAPSHOT: not found
[error] unresolved dependency: io.druid#druid-indexing-service;0.9.1-SNAPSHOT: not found
[error] unresolved dependency: io.druid#druid-indexing-hadoop;0.9.1-SNAPSHOT: not found
[error] Total time: 79 s, completed May 11, 2016 11:07:41 AM

How to reproduce the isuue?

  1. I download the druid-spark-batch project.
  2. And issued command sbt clean test publish-local publish-m2.

unknown bucket for datetime

I am having trouble indexing using druid-spark-batch, getting the following error. Have anybody encountered this error.

(TID 2, 192.168.0.11): com.metamx.common.ISE: unknown bucket for datetime 1375329600000. Known values are Set(1376784000000, 1376092800000, 1377216000000, 1376697600000, 1376352000000, 1377388800000, 1375833600000, 1375747200000, 1375315200000, 1375660800000, 1377820800000, 1377734400000, 1375488000000, 1376265600000, 1376611200000, 1377129600000, 1375401600000, 1376870400000, 1377043200000, 1377561600000, 1376524800000, 1376179200000, 1377302400000, 1377475200000, 1377648000000, 1375574400000, 1375920000000, 1376438400000, 1376006400000, 1376956800000, 1377907200000)
    at io.druid.indexer.spark.DateBucketPartitioner.getPartition(SparkDruidIndexer.scala:625)
    at org.apache.spark.util.collection.ExternalSorter.getPartition(ExternalSorter.scala:104)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:212)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Can't use spark-batch-indexer

Hey guys,

I don't want to spam in druid-development group thread, so i post here.
I actually build the jar on my own with spark 1.6.1, add the jar to /druidpath/extensions/druid-spark-batch/ on both the overlord and middle manager. Added druid.indexer.task.defaultHadoopCoordinates=["org.apache.spark:spark-core_2.10:1.6.1"]
in the two nodes runtime properties, restarted nodes then submit the job with json file.
Still get: "error": "Could not resolve type id 'index_spark' into a subtype of [simple type, class io.druid.indexing.common.task.Task]\n at [Source: HttpInputOverHTTP@2cecd2f2; line: 54, column: 38]"
Any idea or morde doc provided out of here ?

Thanks,
Ben

build is failing

I run the command according to the documentation

sbt clean test publish-local publish-m2

and got

[error] (*:update) sbt.ResolveException: unresolved dependency: io.druid#druid-processing;0.11.0-SNAPSHOT: not found
[error] unresolved dependency: io.druid#druid-server;0.11.0-SNAPSHOT: not found
[error] unresolved dependency: io.druid#druid-indexing-service;0.11.0-SNAPSHOT: not found
[error] unresolved dependency: io.druid#druid-indexing-hadoop;0.11.0-SNAPSHOT: not found

I also see that documentation is not update as it use "spark-core_2.10:1.5.2-mmx4" instead of spark 2.X

Spark indexer fails with spark coordinates with "old" hadoops

When spark is built to be compatible with very old versions of hadoop, jets3t 0.7.x is set as the jets3t version. This version breaks horribly with many other things.

Setting the spark coordinates to something which has a more recent version of hadoop (2.x or higher) allows jets3t version 0.9.x to be brought in instead. These versions seem to work better with druid.

How to set configuration to access files on S3 bucket

Hi
I succeed to start the job but my files are located in S3 bucket so I get always

Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
    at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70) ~[?:?]

I've tried to set the keys in all way I know nothing help
tried to put in properties like this :

"properties": {
        "fs.s3n.awsAccessKeyId": "xx",
        "fs.s3n.awsSecretAccessKey": "xx"}

or in context like this

"context": {
        "fs.s3n.awsAccessKeyId": "xx",
        "fs.s3n.awsSecretAccessKey": "xx",

I've also tried to set the S3 keys directly on spark cluster but it seems that it's not working anymore.
Do you have any idea what can I do ?

Thanks

Fix task specification to only have interval once

See #70 for more info.

Currently the task specification requires intervals to be entered twice in the json. This improvement would have unit tests in place to verify that only once is required, and updates to the documentation clarifying which one is actually required.

Latest release does not work on spark2

See https://issues.apache.org/jira/browse/SPARK-16798 for more info

16/07/29 18:19:21 ERROR Executor: Exception in task 9.0 in stage 0.0 (TID 9)
java.lang.IllegalArgumentException: bound must be positive
    at java.util.Random.nextInt(Random.java:388)
    at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
    at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

making indices and merging are slower than MapReduce ingestion

Hi,

I've read the post about stop supporting this project. But could you please answer this question if you have some time?

My team has implemented supporting parquet files. It' almost done! But the problem is that ingest speed is 3 times slower than MaReduce ingestion.

It turned out that the slower point is creating IncrementalIndex and merge them.

https://github.com/metamx/druid-spark-batch/blob/master/src/main/scala/io/druid/indexer/spark/SparkDruidIndexer.scala#L293

    ...
    val indices: util.List[IndexableAdapter] = groupedRows.map(rs => {
      // TODO: rewrite this without using IncrementalIndex, because IncrementalIndex bears a lot of overhead
      // to support concurrent querying, that is not needed in Spark
      val incrementalIndex = new IncrementalIndex.Builder()

      ...
    }

    val file = finalStaticIndexer.merge(indices, ...)

There is a comment like this:

  // TODO: rewrite this without using IncrementalIndex, because IncrementalIndex bears a lot of overhead
  • Question1: Can I know what this TODO means?
  • Question2: IndexGeneratorJob.reduce is also using IncrementalIndex as well. I checked MapReduce and Spark Ingest implementation, and It's the same. Could you advise what should I check?

Thanks.

Job pulls in spark jars

Ideally the spark job should be able to pull spark from SPARK_HOME and not require the spark jars and all the extra junk that gets pulled in just to make spark work on the Peon.

segment file size much larger 4GB instead of 900mb

Trying to use this as a library from 0.10.6 commit for spark druid re-processor. Using dataframes and segment pusher to make this as independent process. For some reason when I use the default map-reduce task on 1 day worth of data at "fifteen_minute" granularity I get 700mb file, but if I use this library then I get 4 GB of file for the same spec. Is there some compression or configuration I am missing ?
Verified the dimensions & metrics are the same
verified its the same data that I am processing
verified that data is actually at fifteen_minute granularity by looking at footer of smoosh file.

My guess is that it is missing dimension compression, but unsure how to figure out by looking at the smoosh file.

using druid 0.10.1

Size exceeds Integer.MAX_VALUE

Hi, I have this error when running a spark task.

Here is task json.

{
  "type": "index_spark",
  "dataSchema": {
    "dataSource": "ad_statistics_history",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "tsv",
        "timestampSpec": {
          "column": "time",
          "format": "auto",
          "missingValue": null
        },
        "dimensionsSpec": {
          "dimensions": [
            "time_frame",
            "group_id",
            "network_id",
            "advertiser_id",
            "lineitem_id",
            "campaign_id",
            "creative_id",
            "payment_model",
            "is_ad_default",
            "website_id",
            "section_id",
            "channel_id",
            "zone_id",
            "placement_id",
            "template_id",
            "zone_format",
            "topic_id",
            "interest_id",
            "inmarker_id",
            "topic_curr_id",
            "interest_curr_id",
            "inmarket_curr_id",
            "audience_id",
            "location_id",
            "os_id",
            "os_version_id",
            "browser_id",
            "device_type",
            "device_id",
            "carrier_id",
            "age_range_id",
            "gender_id"
          ],
          "dimensionExclusions": [
          ],
          "spatialDimensions": []
        },
        "delimiter": ";",
        "columns": [
          "time",
          "time_frame",
          "group_id",
          "network_id",
          "advertiser_id",
          "lineitem_id",
          "campaign_id",
          "creative_id",
          "payment_model",
          "is_ad_default",
          "website_id",
          "section_id",
          "channel_id",
          "zone_id",
          "placement_id",
          "template_id",
          "zone_format",
          "topic_id",
          "interest_id",
          "inmarker_id",
          "topic_curr_id",
          "interest_curr_id",
          "inmarket_curr_id",
          "audience_id",
          "location_id",
          "os_id",
          "os_version_id",
          "browser_id",
          "device_type",
          "device_id",
          "carrier_id",
          "age_range_id",
          "gender_id",
          "impression",
          "viewable",
          "click",
          "click_fraud",
          "revenue",
          "proceeds",
          "spent",
          "user_sketches",
          "ad_request",
          "true_ad_request",
          "pageview",
          "converted_click",
          "conversion",
          "conv_value",
          "f_converted_click",
          "f_conversion",
          "f_conv_value",
          "a_converted_click",
          "a_conversion",
          "a_conv_value",
          "l_converted_click",
          "l_conversion",
          "l_conv_value"
        ]
      },
      "encoding": "UTF-8"
    },
    "metricsSpec": [
      {
        "type": "count",
        "name": "count"
      },
      {
        "type": "hyperUnique",
        "name": "user_sketches",
        "fieldName": "user_sketches"
      },
      {
        "type": "longSum",
        "name": "pageview",
        "fieldName": "pageview"
      },
      {
        "type": "longSum",
        "name": "impression",
        "fieldName": "impression"
      },
      {
        "type": "longSum",
        "name": "viewable",
        "fieldName": "viewable"
      },
      {
        "type": "longSum",
        "name": "click",
        "fieldName": "click"
      },
      {
        "type": "longSum",
        "name": "click_fraud",
        "fieldName": "click_fraud"
      },
      {
        "type": "doubleSum",
        "name": "revenue",
        "fieldName": "revenue"
      },
      {
        "type": "doubleSum",
        "name": "proceeds",
        "fieldName": "proceeds"
      },
      {
        "type": "doubleSum",
        "name": "spent",
        "fieldName": "spent"
      },
      {
        "type": "longSum",
        "name": "ad_request",
        "fieldName": "ad_request"
      },
      {
        "type": "longSum",
        "name": "converted_click",
        "fieldName": "converted_click"
      },
      {
        "type": "longSum",
        "name": "conversion",
        "fieldName": "conversion"
      },
      {
        "type": "doubleSum",
        "name": "conv_value",
        "fieldName": "conv_value"
      },
      {
        "type": "longSum",
        "name": "f_converted_click",
        "fieldName": "f_converted_click"
      },
      {
        "type": "longSum",
        "name": "f_conversion",
        "fieldName": "f_conversion"
      },
      {
        "type": "doubleSum",
        "name": "f_conv_value",
        "fieldName": "f_conv_value"
      },
      {
        "type": "longSum",
        "name": "a_converted_click",
        "fieldName": "a_converted_click"
      },
      {
        "type": "longSum",
        "name": "a_conversion",
        "fieldName": "f_conversion"
      },
      {
        "type": "doubleSum",
        "name": "a_conv_value",
        "fieldName": "f_conv_value"
      },
      {
        "type": "longSum",
        "name": "l_converted_click",
        "fieldName": "l_converted_click"
      },
      {
        "type": "longSum",
        "name": "l_conversion",
        "fieldName": "l_conversion"
      },
      {
        "type": "doubleSum",
        "name": "l_conv_value",
        "fieldName": "l_conv_value"
      },
      {
        "type": "longSum",
        "name": "true_ad_request",
        "fieldName": "true_ad_request"
      }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "HOUR",
      "intervals": [
        "2016-10-19T01:00:00.000Z/2016-10-19T02:00:00.000Z"
      ]
    }
  },
  "intervals": [
    "2016-10-19T01:00:00.000Z/2016-10-19T02:00:00.000Z"
  ],
  "paths": [
    "/ad-statistic-hourly-utc/pageview/2016-10-19-01",
    "/ad-statistic-hourly-utc/pageview-20/2016-10-19-01",
    "/ad-statistic-hourly-utc/viewable/2016-10-19-01",
    "/ad-statistic-hourly-utc/click/2016-10-19-01",
    "/ad-statistic-hourly-utc/conversion/2016-10-19-01"
  ],
  "targetPartitionSize": 500000000,
  "properties": {
    "java.util.logging.manager": "org.apache.logging.log4j.jul.LogManager",
    "user.timezone": "UTC",
    "org.jboss.logging.provider": "log4j2",
    "file.encoding": "UTF-8"
  },
  "master": "local[8]",
  "context": {},
  "indexSpec": {
    "bitmap": {
      "type": "concise"
    },
    "dimensionCompression": "lz4",
    "metricCompression": "lz4"
  },
  "hadoopDependencyCoordinates": [
    "org.apache.spark:spark-core_2.10:1.6.1"
  ],
  "dataSource": "ad_statistics_history"
}

and the log

Total time for which application threads were stopped: 0.0004104 seconds, Stopping threads took: 0.0000443 seconds
2016-10-19T04:59:24,669 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
2016-10-19T04:59:24,669 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
2016-10-19T04:59:24,669 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/api,null}
2016-10-19T04:59:24,669 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/,null}
2016-10-19T04:59:24,669 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/static,null}
2016-10-19T04:59:24,669 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/executors/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/executors,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/environment/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/environment,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/storage/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/storage,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/stages/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/stages,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
2016-10-19T04:59:24,670 INFO [task-runner-0-priority-0] org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/jobs,null}
2016-10-19T04:59:24,726 INFO [task-runner-0-priority-0] org.apache.spark.ui.SparkUI - Stopped Spark web UI at http://10.199.0.20:4040
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
538.289: RevokeBias                       [     290          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0005247 seconds, Stopping threads took: 0.0000820 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
538.291: RevokeBias                       [     289          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0005893 seconds, Stopping threads took: 0.0000842 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
538.297: RevokeBias                       [     283          0              2    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0008298 seconds, Stopping threads took: 0.0002542 seconds
2016-10-19T04:59:24,745 INFO [dispatcher-event-loop-17] org.apache.spark.MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
538.309: RevokeBias                       [     277          0              2    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0006842 seconds, Stopping threads took: 0.0001366 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
539.310: no vm operation                  [     283          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0011107 seconds, Stopping threads took: 0.0001984 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
540.513: RevokeBias                       [     283          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0010052 seconds, Stopping threads took: 0.0001971 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
540.514: RevokeBias                       [     283          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0005140 seconds, Stopping threads took: 0.0001181 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
540.515: RevokeBias                       [     283          2              1    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0006320 seconds, Stopping threads took: 0.0001637 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
540.516: RevokeBias                       [     283          1              3    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0005673 seconds, Stopping threads took: 0.0002083 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
540.516: RevokeBias                       [     283          1              5    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0005311 seconds, Stopping threads took: 0.0001341 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
540.519: RevokeBias                       [     276          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0005611 seconds, Stopping threads took: 0.0000896 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
541.520: no vm operation                  [     275          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0010081 seconds, Stopping threads took: 0.0002003 seconds
2016-10-19T04:59:29,120 INFO [task-runner-0-priority-0] org.apache.spark.storage.MemoryStore - MemoryStore cleared
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
542.681: RevokeBias                       [     275          0              1    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0009716 seconds, Stopping threads took: 0.0002078 seconds
2016-10-19T04:59:29,121 INFO [task-runner-0-priority-0] org.apache.spark.storage.BlockManager - BlockManager stopped
2016-10-19T04:59:29,122 INFO [task-runner-0-priority-0] org.apache.spark.storage.BlockManagerMaster - BlockManagerMaster stopped
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
542.684: RevokeBias                       [     273          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0003565 seconds, Stopping threads took: 0.0000569 seconds
2016-10-19T04:59:29,128 INFO [dispatcher-event-loop-22] org.apache.spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2016-10-19T04:59:29,139 INFO [sparkDriverActorSystem-akka.actor.default-dispatcher-14] akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2016-10-19T04:59:29,142 INFO [sparkDriverActorSystem-akka.actor.default-dispatcher-14] akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
542.714: RevokeBias                       [     233          0              1    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0008804 seconds, Stopping threads took: 0.0001897 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
542.720: RevokeBias                       [     233          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0003667 seconds, Stopping threads took: 0.0000499 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
542.721: RevokeBias                       [     233          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0002772 seconds, Stopping threads took: 0.0000408 seconds
2016-10-19T04:59:29,161 INFO [task-runner-0-priority-0] org.apache.spark.SparkContext - Successfully stopped SparkContext
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
542.747: RevokeBias                       [     228          0              1    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0007918 seconds, Stopping threads took: 0.0001704 seconds
2016-10-19T04:59:29,188 INFO [sparkDriverActorSystem-akka.actor.default-dispatcher-14] akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2016-10-19T04:59:29,187 ERROR [task-runner-0-priority-0] io.druid.indexer.spark.SparkBatchIndexTask - Error running task [index_spark_ad_statistics_history_2016-10-19T01:00:00.000Z_2016-10-19T02:00:00.000Z_2016-10-19T04:50:26.408Z]
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
    at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.2-rc2-SNAPSHOT.jar:0.9.2-rc2-SNAPSHOT]
    at io.druid.indexer.spark.SparkBatchIndexTask.run(SparkBatchIndexTask.scala:152) [druid-spark-batch_2.10-0.9.1-3-SNAPSHOT.jar:0.9.1-3-SNAPSHOT]
    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.2-rc2-SNAPSHOT.jar:0.9.2-rc2-SNAPSHOT]
    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.2-rc2-SNAPSHOT.jar:0.9.2-rc2-SNAPSHOT]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_77]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_77]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_77]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2-rc2-SNAPSHOT.jar:0.9.2-rc2-SNAPSHOT]
    ... 7 more
Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 80, localhost): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
    at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
    at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
    at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at com.google.common.io.Closer.rethrow(Closer.java:149) ~[guava-16.0.1.jar:?]
    at io.druid.indexer.spark.SparkBatchIndexTask$.runTask(SparkBatchIndexTask.scala:369) ~[?:?]
    at io.druid.indexer.spark.SparkBatchIndexTask.runTask(SparkBatchIndexTask.scala) ~[?:?]
    at io.druid.indexer.spark.Runner.runTask(Runner.java:29) ~[?:?]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2-rc2-SNAPSHOT.jar:0.9.2-rc2-SNAPSHOT]
    ... 7 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 80, localhost): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
    at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
    at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
    at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) ~[?:?]
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) ~[?:?]
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) ~[?:?]
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ~[?:?]
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) ~[?:?]
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) ~[?:?]
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) ~[?:?]
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) ~[?:?]
    at scala.Option.foreach(Option.scala:236) ~[?:?]
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) ~[?:?]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) ~[?:?]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) ~[?:?]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) ~[?:?]
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ~[?:?]
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) ~[?:?]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) ~[?:?]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) ~[?:?]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) ~[?:?]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) ~[?:?]
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) ~[?:?]
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) ~[?:?]
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) ~[?:?]
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) ~[?:?]
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926) ~[?:?]
    at io.druid.indexer.spark.SparkDruidIndexer$.loadData(SparkDruidIndexer.scala:173) ~[?:?]
    at io.druid.indexer.spark.SparkBatchIndexTask$.runTask(SparkBatchIndexTask.scala:418) ~[?:?]
    at io.druid.indexer.spark.SparkBatchIndexTask.runTask(SparkBatchIndexTask.scala) ~[?:?]
    at io.druid.indexer.spark.Runner.runTask(Runner.java:29) ~[?:?]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2-rc2-SNAPSHOT.jar:0.9.2-rc2-SNAPSHOT]
    ... 7 more
Caused by: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869) ~[?:1.8.0_77]
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127) ~[?:?]
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115) ~[?:?]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250) ~[?:?]
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129) ~[?:?]
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136) ~[?:?]
    at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503) ~[?:?]
    at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420) ~[?:?]
    at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625) ~[?:?]
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154) ~[?:?]
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) ~[?:?]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) ~[?:?]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) ~[?:?]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) ~[?:?]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) ~[?:?]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) ~[?:?]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) ~[?:?]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) ~[?:?]
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) ~[?:?]
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[?:?]
    at org.apache.spark.scheduler.Task.run(Task.scala:89) ~[?:?]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ~[?:?]
    ... 3 more
2016-10-19T04:59:29,221 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_spark_ad_statistics_history_2016-10-19T01:00:00.000Z_2016-10-19T02:00:00.000Z_2016-10-19T04:50:26.408Z] status changed to [FAILED].
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
542.783: RevokeBias                       [     229          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0007305 seconds, Stopping threads took: 0.0001496 seconds
2016-10-19T04:59:29,224 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_spark_ad_statistics_history_2016-10-19T01:00:00.000Z_2016-10-19T02:00:00.000Z_2016-10-19T04:50:26.408Z",
  "status" : "FAILED",
  "duration" : 536981
}

Needs more docs

Before any public release, this needs more documentation.

Input file strings must conform to hadoop expectations.

The input files must have their strings conform to hadoop expectations such that the SparkContext's textFile can understand them. The file names are delimited with a "," and passed to textFile. So any names which violate hadoop expectations will cause problems

Unable to build the package

Hi
I can't build as I can't found the package as the spark-1.5.2-mmx when I try to use the existing artifact 1.5.2 the compilation failed.
1/ Where the source or how can I build this artifact ?
2/ If I use this artifact is it working with a standard spark cluster that doesn't includes this

Thanks

Make indexer handle more RDD types

In general, it would be preferable if the RDD were passed as a parameter to the indexer, that way the indexing process can be separate from the initial formatting of the RDD and allow this code to be used as a library more easily.

How config to re-index a datasource ?

Hi,
I just wonder if it supports to re-index a datasource or not?
I config the job like this but it seems not working.

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "dataSource",
        "ingestionSpec": {
          "dataSource": "ad_statistics_hourly",
          "intervals": [
            "2016-10-17T04:00:00.000Z/2016-10-17T05:00:00.000Z"
          ]
        }
      }
    },
    "dataSchema": {
      "metricsSpec": [
        {
          "type": "count",
          "name": "count"
        },
        {
          "type": "longSum",
          "name": "true_ad_request",
          "fieldName": "true_ad_request"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "HOUR",
        "intervals": [
          "2016-10-17T04:00:00.000Z/2016-10-17T05:00:00.000Z"
        ]
      },
      "parser": {
        "parseSpec": {
          "timestampSpec": {
            "column": "time",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "time_frame",
              "group_id"
            ]
          },
          "columns": [
            "time",
            "time_frame",
            "group_id",
            "true_ad_request"
          ]
        },
        "type": "string"
      },
      "dataSource": "ad_statistics_hourly"
    },
    "intervals": [
      "2016-10-17T04:00:00.000Z/2016-10-17T05:00:00.000Z"
    ],
    "indexSpec": {
      "bitmap": {
        "type": "concise"
      },
      "dimensionCompression": "lz4",
      "metricCompression": "lz4"
    },
    "master": "local[4]",
    "properties": {
      "spark.io.compression.codec": "org.apache.spark.io.LZ4CompressionCodec"
    },
    "targetPartitionSize": 4000000,
    "type": "index_spark"
  }
}

Spark indexer fails with self-contained jar in Druid 0.9.0

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
    at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[druid-selfcontained-0.9.0-rc2-mmx1.jar:0.9.0-rc2-mmx1]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:171) ~[druid-selfcontained-0.9.0-rc2-mmx1.jar:0.9.0-rc2-mmx1]
    at io.druid.indexer.spark.SparkBatchIndexTask.run(SparkBatchIndexTask.scala:155) [druid-spark-batch_2.10-0.0.30.jar:0.0.30]
    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:338) [druid-selfcontained-0.9.0-rc2-mmx1.jar:0.9.0-rc2-mmx1]
    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:318) [druid-selfcontained-0.9.0-rc2-mmx1.jar:0.9.0-rc2-mmx1]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_72]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_72]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_72]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_72]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_72]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:168) ~[druid-selfcontained-0.9.0-rc2-mmx1.jar:0.9.0-rc2-mmx1]
    ... 7 more
Caused by: java.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:60) ~[?:?]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_72]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_72]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_72]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_72]
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) ~[?:?]
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) ~[?:?]
    at org.apache.hadoop.fs.s3native.$Proxy194.initialize(Unknown Source) ~[?:?]
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:272) ~[?:?]
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433) ~[?:?]
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) ~[?:?]
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) ~[?:?]
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) ~[?:?]
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) ~[?:?]
    at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1614) ~[?:?]
    at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:66) ~[?:?]
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:539) ~[?:?]
    at io.druid.indexer.spark.SparkBatchIndexTask$.runTask(SparkBatchIndexTask.scala:359) ~[?:?]
    at io.druid.indexer.spark.SparkBatchIndexTask.runTask(SparkBatchIndexTask.scala) ~[?:?]
    at io.druid.indexer.spark.Runner.runTask(Runner.java:29) ~[?:?]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_72]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_72]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_72]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_72]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:168) ~[druid-selfcontained-0.9.0-rc2-mmx1.jar:0.9.0-rc2-mmx1]
    ... 7 more

latest fails in spark 2.0

16/09/27 22:20:37 ERROR SparkDruidIndexer$: Error in partition [1]
java.lang.NoSuchMethodError: com.google.common.io.ByteSource.concat(Ljava/lang/Iterable;)Lcom/google/common/io/ByteSource;
    at io.druid.segment.data.GenericIndexedWriter.combineStreams(GenericIndexedWriter.java:139)
    at io.druid.segment.StringDimensionMergerLegacy.writeValueMetadataToFile(StringDimensionMergerLegacy.java:206)
    at io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:696)
    at io.druid.segment.IndexMerger.merge(IndexMerger.java:438)
    at io.druid.segment.IndexMerger.persist(IndexMerger.java:186)
    at io.druid.segment.IndexMerger.persist(IndexMerger.java:152)
    at io.druid.indexer.spark.SparkDruidIndexer$$anonfun$13$$anonfun$21.apply(SparkDruidIndexer.scala:293)
    at io.druid.indexer.spark.SparkDruidIndexer$$anonfun$13$$anonfun$21.apply(SparkDruidIndexer.scala:288)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
    at scala.collection.AbstractIterator.toList(Iterator.scala:1336)
    at io.druid.indexer.spark.SparkDruidIndexer$$anonfun$13.apply(SparkDruidIndexer.scala:309)
    at io.druid.indexer.spark.SparkDruidIndexer$$anonfun$13.apply(SparkDruidIndexer.scala:205)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Enhance input datasource capabilities

Right now druid-spark-batch reads data using sc.textFile from given locations. This is important limitation if data is stored in formats like Parquet (or any other data format supported by Spark).

Would you consider to enhance this tool with support for arbitrary Spark SQL expression to define input data? You'll get for free:

  • support for any data format supported by Spark
  • support for any UDF supported by Spark for data pre-processing
  • support for joins before ingestion

Job fails due to NoSuchMethodError exception

I'm trying to execute the index_spark job on a Spark 1.6.2 cluster pre-built for Hadoop 2.4.0.
Each time I submit the job, I've got this exception:

2017-04-10T16:17:58,925 WARN [task-result-getter-0] org.apache.spark.scheduler.TaskSetManager - Lost task 2.0 in stage 0.0 (TID 2, 172.20.0.1): java.lang.NoSuchMethodError: com.google.inject.util.Types.collectionOf(Ljava/lang/reflect/Type;)Ljava/lang/reflect/ParameterizedType;
	at com.google.inject.multibindings.Multibinder.collectionOfProvidersOf(Multibinder.java:202)
	at com.google.inject.multibindings.Multibinder$RealMultibinder.<init>(Multibinder.java:283)
	at com.google.inject.multibindings.Multibinder$RealMultibinder.<init>(Multibinder.java:258)
	at com.google.inject.multibindings.Multibinder.newRealSetBinder(Multibinder.java:178)
	at com.google.inject.multibindings.Multibinder.newSetBinder(Multibinder.java:150)
	at io.druid.guice.LifecycleModule.getEagerBinder(LifecycleModule.java:130)
	at io.druid.guice.LifecycleModule.configure(LifecycleModule.java:136)
	at com.google.inject.spi.Elements$RecordingBinder.install(Elements.java:223)
	at com.google.inject.spi.Elements.getElements(Elements.java:101)
	at com.google.inject.spi.Elements.getElements(Elements.java:92)
	at com.google.inject.util.Modules$RealOverriddenModuleBuilder$1.configure(Modules.java:152)
	at com.google.inject.AbstractModule.configure(AbstractModule.java:59)
	at com.google.inject.spi.Elements$RecordingBinder.install(Elements.java:223)
	at com.google.inject.spi.Elements.getElements(Elements.java:101)
	at com.google.inject.spi.Elements.getElements(Elements.java:92)
	at com.google.inject.util.Modules$RealOverriddenModuleBuilder$1.configure(Modules.java:152)
	at com.google.inject.AbstractModule.configure(AbstractModule.java:59)
	at com.google.inject.spi.Elements$RecordingBinder.install(Elements.java:223)
	at com.google.inject.spi.Elements.getElements(Elements.java:101)
	at com.google.inject.internal.InjectorShell$Builder.build(InjectorShell.java:133)
	at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:103)
	at com.google.inject.Guice.createInjector(Guice.java:95)
	at com.google.inject.Guice.createInjector(Guice.java:72)
	at com.google.inject.Guice.createInjector(Guice.java:62)
	at io.druid.initialization.Initialization.makeInjectorWithModules(Initialization.java:366)
	at io.druid.indexer.spark.SerializedJsonStatic$.liftedTree1$1(SparkDruidIndexer.scala:438)
	at io.druid.indexer.spark.SerializedJsonStatic$.injector$lzycompute(SparkDruidIndexer.scala:437)
	at io.druid.indexer.spark.SerializedJsonStatic$.injector(SparkDruidIndexer.scala:436)
	at io.druid.indexer.spark.SerializedJsonStatic$.liftedTree2$1(SparkDruidIndexer.scala:465)
	at io.druid.indexer.spark.SerializedJsonStatic$.mapper$lzycompute(SparkDruidIndexer.scala:464)
	at io.druid.indexer.spark.SerializedJsonStatic$.mapper(SparkDruidIndexer.scala:463)
	at io.druid.indexer.spark.SerializedJson.getMap(SparkDruidIndexer.scala:520)
	at io.druid.indexer.spark.SerializedJson.readObject(SparkDruidIndexer.scala:534)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2122)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
	at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2122)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

which causes my job to fail.

I used the java -classpath "lib/*" io.druid.cli.Main tools pull-deps -c io.druid.extensions:druid-spark-batch_2.10:0.9.2.14 -h org.apache.spark:spark-core_2.10:1.6.2 command to install my package.

I tried to add manually the guice jar to my spark's classpath, but it didn't help. I also noticed, that executing the job works with a local Spark master (local[*]). I read this page, because I found similar errors for the index_hadoop job, but I couldn't really apply those tips for my case.

Any help would be really appreciated.

update: I'm using imply 2.0.0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.