Code Monkey home page Code Monkey logo

dione's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dione's Issues

Record data_size column may have null issue

Summary

2021-11-18T13:14:15.437225965Z DEFAULT Exception in thread "main" java.lang.NullPointerException: Null value appeared in non-nullable field: 2021-11-18T13:14:15.437572643Z DEFAULT - field (class: "scala.Long", name: "_1") 2021-11-18T13:14:15.437834819Z DEFAULT - root class: "scala.Tuple2" 2021-11-18T13:14:15.438102858Z DEFAULT If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). 2021-11-18T13:14:15.438328843Z DEFAULT at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source) 2021-11-18T13:14:15.43853531Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3391) 2021-11-18T13:14:15.438733597Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3388) 2021-11-18T13:14:15.438934459Z DEFAULT at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 2021-11-18T13:14:15.439135423Z DEFAULT at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 2021-11-18T13:14:15.439371529Z DEFAULT at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 2021-11-18T13:14:15.439572596Z DEFAULT at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 2021-11-18T13:14:15.439773592Z DEFAULT at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) 2021-11-18T13:14:15.439967509Z DEFAULT at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) 2021-11-18T13:14:15.440161591Z DEFAULT at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3388) 2021-11-18T13:14:15.440368416Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) 2021-11-18T13:14:15.440566985Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) 2021-11-18T13:14:15.440760324Z DEFAULT at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3369) 2021-11-18T13:14:15.440963832Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) 2021-11-18T13:14:15.441159135Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) 2021-11-18T13:14:15.44137152Z DEFAULT at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) 2021-11-18T13:14:15.441567659Z DEFAULT at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3368) 2021-11-18T13:14:15.441793499Z DEFAULT at org.apache.spark.sql.Dataset.head(Dataset.scala:2550) 2021-11-18T13:14:15.441991147Z DEFAULT at org.apache.spark.sql.Dataset.head(Dataset.scala:2557) 2021-11-18T13:14:15.442193028Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$$anon$1.<init>(IndexManagerUtils.scala:161) 2021-11-18T13:14:15.442395756Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.inferSize(IndexManagerUtils.scala:157) 2021-11-18T13:14:15.44260486Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.sampleFilesAndInferSize(IndexManagerUtils.scala:134) 2021-11-18T13:14:15.442803748Z DEFAULT at com.paypal.dione.spark.index.IndexManagerUtils$.calcBtreeProperties(IndexManagerUtils.scala:108) 2021-11-18T13:14:15.44299733Z DEFAULT at com.paypal.dione.spark.index.IndexManager.appendNewPartitions(IndexManager.scala:122) 2021-11-18T13:14:15.443201121Z DEFAULT at com.paypal.risk.gds.index.IndexBuilder$$anonfun$buildIndex$3$$anonfun$apply$2.apply(IndexBuilder.scala:94) 2021-11-18T13:14:15.443406372Z DEFAULT at com.paypal.risk.gds.index.IndexBuilder$$anonfun$buildIndex$3$$anonfun$apply$2.apply(IndexBuilder.scala:92) 2021-11-18T13:14:15.457653642Z DEFAULT at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) 2021-11-18T13:14:15.458046242Z DEFAULT at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) 2021-11-18T13:14:15.45827662Z DEFAULT at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2021-11-18T13:14:15.458484141Z DEFAULT at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2021-11-18T13:14:15.458685982Z DEFAULT at java.lang.Thread.run(Thread.java:748) 2021-11-18T13:14:15.458958521Z DEFAULT 2021-11-18 05:14:14 ERROR FileFormatWriter - Aborting job 4815b2e2-e798-4a64-96f5-9930468a5c32.

Links to #PR/#Issue

Detailed Description

data_size column may have some issue when reading the first record of one data source file like below.

image

what is the problem?
data_size may have null when buiding index

how can we solve it?
Maybe can skip to get the inputSourceBytes.
https://github.com/paypal/dione/blob/main/dione-spark/src/main/scala/com/paypal/dione/spark/index/IndexManagerUtils.scala#L164

Add simple HTTP server implementation for fetches

We can implement a simple, Java-native, standalone HTTP server that gets requests for key(s) and returns the payload (as json, etc.). Then, users can implement their own web server to visualize the data.

Support time range query

Summary

Support time range query

Links to #PR/#Issue

Detailed Description

what is the problem?
how can we solve it?

Consider datasource v2

Spark 2.4+ DataSource v2 is much more powerfull the in Spark 2.3.
the main issue in Spark 2.3 is basically you need to implement everything yourself.
and it is a headache dealing with partitioning and bucketing. this is why re reverted to use DataSource v1.
in Spark 2.4 there are lot's of pinukim. we should consider to move when we upgrade.
we'll have:

  • natural support for save as table (as opposed to now when we do everything manually)
  • we'll easily support dynamic partitioning. now we rely on user giving static list of partitions to index.
  • maybe even use bucketing instead of today's games of imitating this mechanism.

Add IT script

Add IT test script that generates some big table, index it, and asserts the basic functionalities. Such Spark script should work smoothly on any Hadoop cluster (with HDFS).

Switch filesDF to HadoopRDD

current filesDF is both ugly and inefficient in terms of data locality.
we should try to switch to something like HadoopRDD/NewHadoopRDD or something more natural to leverage the preferred locations functionality.

Trim common hdfs prefix in index DF

Try to recognize common path prefixes on runtime, and trim them.
For example, files in a standard table might look like:

hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet
...

On read, before the shuffle, we can trim the common prefix to reduce the shuffle size.

Fix Array type value when fetching from Parquet data

Summary

We currently fail to fetch a value of type Array if the data is in Parquet format.
need to look into this.
Also, loadByIndex fail if requested columns contain complex types - array/map (at least in Parquet)

Occasional corruption of index

Sometimes, when indexing an additional partition on an existing index over an Avro table, some index files become corrupt.

Reading the index produces the following error:

org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 10.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.0 in stage 10.0 (TID 2167, lvshdc5dn2191.lvs.paypalinc.com, executor 177): org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
	at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
	at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:149)
	at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:214)

This happens with Dione 0.5.x.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.