paypal / dione Goto Github PK
View Code? Open in Web Editor NEWDione - a Spark and HDFS indexing library
License: Apache License 2.0
Dione - a Spark and HDFS indexing library
License: Apache License 2.0
2021-11-18T13:14:15.437225965Z DEFAULT Exception in thread "main" java.lang.NullPointerException: Null value appeared in non-nullable field: 2021-11-18T13:14:15.437572643Z DEFAULT - field (class: "scala.Long", name: "_1") 2021-11-18T13:14:15.437834819Z DEFAULT - root class: "scala.Tuple2" 2021-11-18T13:14:15.438102858Z DEFAULT If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). 2021-11-18T13:14:15.438328843Z DEFAULT at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source) 2021-11-18T13:14:15.43853531Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3391) 2021-11-18T13:14:15.438733597Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3388) 2021-11-18T13:14:15.438934459Z DEFAULT at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 2021-11-18T13:14:15.439135423Z DEFAULT at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 2021-11-18T13:14:15.439371529Z DEFAULT at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 2021-11-18T13:14:15.439572596Z DEFAULT at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 2021-11-18T13:14:15.439773592Z DEFAULT at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) 2021-11-18T13:14:15.439967509Z DEFAULT at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) 2021-11-18T13:14:15.440161591Z DEFAULT at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3388) 2021-11-18T13:14:15.440368416Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) 2021-11-18T13:14:15.440566985Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) 2021-11-18T13:14:15.440760324Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3369) 2021-11-18T13:14:15.440963832Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) 2021-11-18T13:14:15.441159135Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) 2021-11-18T13:14:15.44137152Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) 2021-11-18T13:14:15.441567659Z DEFAULT at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3368) 2021-11-18T13:14:15.441793499Z DEFAULT at org.apache.spark.sql.Dataset.head(Dataset.scala:2550) 2021-11-18T13:14:15.441991147Z DEFAULT at org.apache.spark.sql.Dataset.head(Dataset.scala:2557) 2021-11-18T13:14:15.442193028Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$$anon$1.<init>(IndexManagerUtils.scala:161) 2021-11-18T13:14:15.442395756Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.inferSize(IndexManagerUtils.scala:157) 2021-11-18T13:14:15.44260486Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.sampleFilesAndInferSize(IndexManagerUtils.scala:134) 2021-11-18T13:14:15.442803748Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.calcBtreeProperties(IndexManagerUtils.scala:108) 2021-11-18T13:14:15.44299733Z DEFAULT at com.paypal.dione.spark.index.IndexManager.appendNewPartitions(IndexManager.scala:122) 2021-11-18T13:14:15.443201121Z DEFAULT at com.paypal.risk.gds.index.IndexBuilder$$anonfun$buildIndex$3$$anonfun$apply$2.apply(IndexBuilder.scala:94) 2021-11-18T13:14:15.443406372Z DEFAULT at com.paypal.risk.gds.index.IndexBuilder$$anonfun$buildIndex$3$$anonfun$apply$2.apply(IndexBuilder.scala:92) 2021-11-18T13:14:15.457653642Z DEFAULT at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) 2021-11-18T13:14:15.458046242Z DEFAULT at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) 2021-11-18T13:14:15.45827662Z DEFAULT at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2021-11-18T13:14:15.458484141Z DEFAULT at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2021-11-18T13:14:15.458685982Z DEFAULT at java.lang.Thread.run(Thread.java:748) 2021-11-18T13:14:15.458958521Z DEFAULT 2021-11-18 05:14:14 ERROR FileFormatWriter - Aborting job 4815b2e2-e798-4a64-96f5-9930468a5c32.
Links to #PR/#Issue
data_size column may have some issue when reading the first record of one data source file like below.
what is the problem?
data_size may have null when buiding index
how can we solve it?
Maybe can skip to get the inputSourceBytes.
https://github.com/paypal/dione/blob/main/dione-spark/src/main/scala/com/paypal/dione/spark/index/IndexManagerUtils.scala#L164
We can implement a simple, Java-native, standalone HTTP server that gets requests for key(s) and returns the payload (as json, etc.). Then, users can implement their own web server to visualize the data.
Support time range query
Links to #PR/#Issue
what is the problem?
how can we solve it?
Spark 2.4+ DataSource v2 is much more powerfull the in Spark 2.3.
the main issue in Spark 2.3 is basically you need to implement everything yourself.
and it is a headache dealing with partitioning and bucketing. this is why re reverted to use DataSource v1.
in Spark 2.4 there are lot's of pinukim. we should consider to move when we upgrade.
we'll have:
Add IT test script that generates some big table, index it, and asserts the basic functionalities. Such Spark script should work smoothly on any Hadoop cluster (with HDFS).
current filesDF is both ugly and inefficient in terms of data locality.
we should try to switch to something like HadoopRDD/NewHadoopRDD or something more natural to leverage the preferred locations functionality.
We should have all the metadata we need on HDFS already, so we can enable reading the index without Hive.
If we just want a "key-value" table to be available in Hive and for ad-hoc queries.
Currently we don't support file split in all formats. Relates to #10
To support ignoring "old" partitions. Alternatively, can add some partition "blacklist" or min value to the index metadata
Try to recognize common path prefixes on runtime, and trim them.
For example, files in a standard table might look like:
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet
...
On read, before the shuffle, we can trim the common prefix to reduce the shuffle size.
We currently fail to fetch a value of type Array if the data is in Parquet format.
need to look into this.
Also, loadByIndex
fail if requested columns contain complex types - array/map (at least in Parquet)
Sometimes, when indexing an additional partition on an existing index over an Avro table, some index files become corrupt.
Reading the index produces the following error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 10.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.0 in stage 10.0 (TID 2167, lvshdc5dn2191.lvs.paypalinc.com, executor 177): org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:149)
at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:214)
This happens with Dione 0.5.x.
As ORC files are very common, we should add indexing support for them
Does this library support delta files (https://github.com/delta-io/delta/)
Links to #PR/#Issue
what is the problem?
how can we solve it?
As we're going to migrate to gcp dataproc 2.1, it would be awesome to have dione upgrade first.
Links to #PR/#Issue
what is the problem?
how can we solve it?
we can also use Spark's PartitionedFile.
currently we hard code in createIndex
, for example:
partitioned by
when we create the index table
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.