Code Monkey home page Code Monkey logo

rovio-ingest's People

Contributors

jorgeramirezcarrasco-rovio avatar juhoautio avatar juhoautio-rovio avatar omega359 avatar rocel avatar vivek-balakrishnan-rovio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

rovio-ingest's Issues

Support for Multi-Value Dimensions

Is there support or will there be future support for dimensions that have multiple values (e.g. arrays of strings)? I was running into an issue when trying to run the No-code wrapper script provided by the rovio-ingest repository but ran into an error when it was trying to ingest a column that has a value of a string array.

Error Message:

  • pyspark.errors.exceptions.captured.IllegalArgumentException: Dimensions with unsupported data types, set excludeColumnsWithUnknownTypes to true to exclude: StructField(colName,ArrayType(StringType,true)

Example of the data I am trying to work with:

  • "tags": ["t1","t2","t3"]

need suggestion on reading druid from spark

I'm starting on to use spark to read and write to druid. I've come across this library for ingesting. can you please suggest what's the best way to read records from druid into spark dataset. in some forums it was mentioned to use [Avatica JDBC driver]. want to get your opinion on same.

Apologies for raising it as issue for Q&A type. I couldn't find discussions tab to use same.

thanks in advance.

Support for azure storage

I am trying to implement Azure Storage.
I am facing difficulties in creating a AzureDataSegmentKiller particularly for creating an instance of AzureCloudBlobIterableFactory.
Do you have any idea to do that?

Support PostgreSQL as Metadata Storage

Hello folks, first of all what a great project. I think this has the potential to replace native index_parallel.

We are currently evaluating this project and wondering if it can support connecting to PostgreSQL metadata store?

Avoid creating too many string objects in TaskDataWriter

In recent release of the library (1.0.5), we introduced a change to read value as Java String type for all String columns. This was introduced as Spark's internal UTF8String is not compatible with DataSketches.

However, this resulted in performance degradation as too many objects are created. We noticed that this is problematic while re-ingesting a big dataset with over 10 years of data with lots of String dimensions.

We are working on a fix to coerce value to Java string only of sketch columns.

Potential secutiry vulnerabilities in the shared libraries which rovio-ingest depends on.

Hi, @juhoautio , @jorgeramirezcarrasco-rovio , I'd like to report a vulnerability issue in com.rovio.ingest:rovio-ingest:1.0.0_spark_3.0.1.

Issue Description

com.rovio.ingest:rovio-ingest:1.0.0_spark_3.0.1 directly or transitively depends on 45 C libraries (.so) cross many platforms(such as x86-64, x86, arm64, armhf). However, I noticed that some C libraries are vulnerable, containing the following CVEs:

llibzstd-jni.so from C project zstd(version:1.4.4) exposed 1 vulnerabilities:
CVE-2021-24032
liblz4-java.so from C project lz4(version:1.9.1) exposed 1 vulnerabilities:
CVE-2019-17543

Suggested Vulnerability Patch Versions

zstd has fixed the vulnerabilities in versions >=1.4.9
lz4 has fixed the vulnerabilities in versions >=1.9.2

Java build tools cannot report vulnerable C libraries, which may induce potential security issues to many downstream Java projects.
Could you please upgrade the above shared libraries to their patch versions?

Thanks for your help~
Best regards,
Helen Parr

ThetaSketch Ingestion

Is ThetaSketch ingestion supported?

If the ThetaSketch objects are already created in the Spark DataFrame using the Datasketches libraries, and I try to ingest it, I get the following error.
Are there any dependencies I need to add?

Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 127905) (ip-10-232-14-117.ec2.internal executor 127): java.lang.IllegalArgumentException: Failed to deserialize from metricsSpec=[
        {
          "type": "thetaSketch",
          "name": "user_id_sketch",
          "fieldName": "uid",
          "size": 4096,
          "shouldFinalize": true,
          "isInputThetaSketch": true,
          "errorBoundsStdDev": null
        },
        {
          "type": "longMin",
          "name": "min_timestamp",
          "fieldName": "unix_timestamp_min",
          "expression": null
        },
        {
          "type": "longMax",
          "name": "max_timestamp",
          "fieldName": "unix_timestamp_max",
          "expression": null
        }
      ]
	at com.rovio.ingest.model.SegmentSpec.getAggregators(SegmentSpec.java:240)
	at com.rovio.ingest.model.SegmentSpec.getDataSchema(SegmentSpec.java:158)
	at com.rovio.ingest.TaskDataWriter.<init>(TaskDataWriter.java:97)
	at com.rovio.ingest.DruidDataSourceWriter$TaskWriterFactory.createWriter(DruidDataSourceWriter.java:106)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:430)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:138)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: com.fasterxml.jackson.databind.exc.InvalidTypeIdException: Please make sure to load all the necessary extensions and jars with type 'thetaSketch'. Could not resolve type id 'thetaSketch' as a subtype of `org.apache.druid.query.aggregation.AggregatorFactory` known type ids = [cardinality, count, doubleAny, doubleFirst, doubleLast, doubleMax, doubleMean, doubleMin, doubleSum, expression, filtered, floatAny, floatFirst, floatLast, floatMax, floatMin, floatSum, grouping, histogram, hyperUnique, javascript, longAny, longFirst, longLast, longMax, longMin, longSum, stringAny, stringFirst, stringFirstFold, stringLast, stringLastFold]
 at [Source: (String)"[
        {
          "type": "thetaSketch",
          "name": "user_id_sketch",
          "fieldName": "uid",
          "size": 4096,
          "shouldFinalize": true,
          "isInputThetaSketch": true,
          "errorBoundsStdDev": null
        },
        {
          "type": "longMin",
          "name": "min_timestamp",
          "fieldName": "unix_timestamp_min",
          "expression": null
        },
        {
          "type": "longMax",
          "name": "max_timestamp",
          "fi"[truncated 78 chars]; line: 3, column: 19] (through reference chain: java.util.ArrayList[0])
	at com.fasterxml.jackson.databind.exc.InvalidTypeIdException.from(InvalidTypeIdException.java:43)
	at org.apache.druid.jackson.DefaultObjectMapper$DefaultDeserializationProblemHandler.handleUnknownTypeId(DefaultObjectMapper.java:124)
	at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownTypeId(DeserializationContext.java:1545)
	at com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._handleUnknownTypeId(TypeDeserializerBase.java:298)
	at com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:165)
	at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:125)
	at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:110)
	at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:263)
	at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer._deserializeFromArray(CollectionDeserializer.java:357)
	at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:244)
	at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:28)
	at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4674)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3629)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3612)
	at com.rovio.ingest.model.SegmentSpec.getAggregators(SegmentSpec.java:238)
	... 13 more

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.