paypal / dione Goto Github PK

View Code? Open in Web Editor NEW

49.0 49.0 22.0 252 KB

Dione - a Spark and HDFS indexing library

License: Apache License 2.0

Java 17.05% Scala 79.41% Python 3.54%

dione's People

Stargazers

Watchers

Forkers

uzadude benraha shay1bz forshareit epolyakpaypal easylifeml tufanrakshit taousif isabella232 aravindd007 eyala cyberflamego rohankumardubey ghas-results

dione's Issues

Record data_size column may have null issue

Summary

2021-11-18T13:14:15.437225965Z DEFAULT Exception in thread "main" java.lang.NullPointerException: Null value appeared in non-nullable field: 2021-11-18T13:14:15.437572643Z DEFAULT - field (class: "scala.Long", name: "_1") 2021-11-18T13:14:15.437834819Z DEFAULT - root class: "scala.Tuple2" 2021-11-18T13:14:15.438102858Z DEFAULT If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). 2021-11-18T13:14:15.438328843Z DEFAULT at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source) 2021-11-18T13:14:15.43853531Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3391) 2021-11-18T13:14:15.438733597Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3388) 2021-11-18T13:14:15.438934459Z DEFAULT at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 2021-11-18T13:14:15.439135423Z DEFAULT at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 2021-11-18T13:14:15.439371529Z DEFAULT at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 2021-11-18T13:14:15.439572596Z DEFAULT at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 2021-11-18T13:14:15.439773592Z DEFAULT at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) 2021-11-18T13:14:15.439967509Z DEFAULT at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) 2021-11-18T13:14:15.440161591Z DEFAULT at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3388) 2021-11-18T13:14:15.440368416Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) 2021-11-18T13:14:15.440566985Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) 2021-11-18T13:14:15.440760324Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3369) 2021-11-18T13:14:15.440963832Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) 2021-11-18T13:14:15.441159135Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) 2021-11-18T13:14:15.44137152Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) 2021-11-18T13:14:15.441567659Z DEFAULT at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3368) 2021-11-18T13:14:15.441793499Z DEFAULT at org.apache.spark.sql.Dataset.head(Dataset.scala:2550) 2021-11-18T13:14:15.441991147Z DEFAULT at org.apache.spark.sql.Dataset.head(Dataset.scala:2557) 2021-11-18T13:14:15.442193028Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$$anon$1.<init>(IndexManagerUtils.scala:161) 2021-11-18T13:14:15.442395756Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.inferSize(IndexManagerUtils.scala:157) 2021-11-18T13:14:15.44260486Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.sampleFilesAndInferSize(IndexManagerUtils.scala:134) 2021-11-18T13:14:15.442803748Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.calcBtreeProperties(IndexManagerUtils.scala:108) 2021-11-18T13:14:15.44299733Z DEFAULT at com.paypal.dione.spark.index.IndexManager.appendNewPartitions(IndexManager.scala:122) 2021-11-18T13:14:15.443201121Z DEFAULT at com.paypal.risk.gds.index.IndexBuilder$$anonfun$buildIndex$3$$anonfun$apply$2.apply(IndexBuilder.scala:94) 2021-11-18T13:14:15.443406372Z DEFAULT at com.paypal.risk.gds.index.IndexBuilder$$anonfun$buildIndex$3$$anonfun$apply$2.apply(IndexBuilder.scala:92) 2021-11-18T13:14:15.457653642Z DEFAULT at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) 2021-11-18T13:14:15.458046242Z DEFAULT at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) 2021-11-18T13:14:15.45827662Z DEFAULT at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2021-11-18T13:14:15.458484141Z DEFAULT at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2021-11-18T13:14:15.458685982Z DEFAULT at java.lang.Thread.run(Thread.java:748) 2021-11-18T13:14:15.458958521Z DEFAULT 2021-11-18 05:14:14 ERROR FileFormatWriter - Aborting job 4815b2e2-e798-4a64-96f5-9930468a5c32.

Links to #PR/#Issue

Detailed Description

data_size column may have some issue when reading the first record of one data source file like below.

what is the problem?
data_size may have null when buiding index

how can we solve it?
Maybe can skip to get the inputSourceBytes.
https://github.com/paypal/dione/blob/main/dione-spark/src/main/scala/com/paypal/dione/spark/index/IndexManagerUtils.scala#L164

Add simple HTTP server implementation for fetches

We can implement a simple, Java-native, standalone HTTP server that gets requests for key(s) and returns the payload (as json, etc.). Then, users can implement their own web server to visualize the data.

Support time range query

Summary

Support time range query

Links to #PR/#Issue

Detailed Description

what is the problem?
how can we solve it?

Consider datasource v2

Spark 2.4+ DataSource v2 is much more powerfull the in Spark 2.3.
the main issue in Spark 2.3 is basically you need to implement everything yourself.
and it is a headache dealing with partitioning and bucketing. this is why re reverted to use DataSource v1.
in Spark 2.4 there are lot's of pinukim. we should consider to move when we upgrade.
we'll have:

natural support for save as table (as opposed to now when we do everything manually)
we'll easily support dynamic partitioning. now we rely on user giving static list of partitions to index.
maybe even use bucketing instead of today's games of imitating this mechanism.

Add IT script

Add IT test script that generates some big table, index it, and asserts the basic functionalities. Such Spark script should work smoothly on any Hadoop cluster (with HDFS).

Switch filesDF to HadoopRDD

current filesDF is both ugly and inefficient in terms of data locality.
we should try to switch to something like HadoopRDD/NewHadoopRDD or something more natural to leverage the preferred locations functionality.

Add option to use IndexManager without Hive

We should have all the metadata we need on HDFS already, so we can enable reading the index without Hive.

Add API to easily create AvroBtreeHash tables for non-index use cases

If we just want a "key-value" table to be available in Hive and for ad-hoc queries.

ParquetIndexer currently doesn't support start_pos, end_pos

Add support for split files in all file formats

Summary

Currently we don't support file split in all formats. Relates to #10

Add missing partitions - add optional "min" partition

To support ignoring "old" partitions. Alternatively, can add some partition "blacklist" or min value to the index metadata

Trim common hdfs prefix in index DF

Try to recognize common path prefixes on runtime, and trim them.
For example, files in a standard table might look like:

hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet
...

On read, before the shuffle, we can trim the common prefix to reduce the shuffle size.

Fix Array type value when fetching from Parquet data

Summary

We currently fail to fetch a value of type Array if the data is in Parquet format.
need to look into this.
Also, loadByIndex fail if requested columns contain complex types - array/map (at least in Parquet)

Support for spark 3

Any plans to support spark 3?

Occasional corruption of index

Sometimes, when indexing an additional partition on an existing index over an Avro table, some index files become corrupt.

Reading the index produces the following error:

org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 10.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.0 in stage 10.0 (TID 2167, lvshdc5dn2191.lvs.paypalinc.com, executor 177): org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
	at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
	at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:149)
	at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:214)

This happens with Dione 0.5.x.

Add support for ORC files

Summary

As ORC files are very common, we should add indexing support for them

Any support for Delta files ?

Summary

Does this library support delta files (https://github.com/delta-io/delta/)

Links to #PR/#Issue

Detailed Description

what is the problem?
how can we solve it?

Any plan to upgrade the spark version to 3.3.0 ?

Summary

As we're going to migrate to gcp dataproc 2.1, it would be awesome to have dione upgrade first.

Links to #PR/#Issue

Detailed Description

what is the problem?
how can we solve it?

Use Hadoop's file splits instead of start_pos,end_pos

we can also use Spark's PartitionedFile.

Support non-partitioned tables

Summary

currently we hard code in createIndex, for example:
partitioned by when we create the index table

paypal / dione Goto Github PK

dione's People

Stargazers

Watchers

Forkers

dione's Issues

Summary

Detailed Description

Summary

Detailed Description

Summary

Summary

Any plans to support spark 3?

Summary

Summary

Detailed Description

Summary

Detailed Description

Summary

Recommend Projects

Recommend Topics

Recommend Org