Code Monkey home page Code Monkey logo

Comments (5)

osopardo1 avatar osopardo1 commented on July 18, 2024

Here's the summary of the initial experiment results.

WITH CDF (histogram scenario):

  • 9 files
  • avg size 296.8MB.
  • More cubes, but fractioned in less files.
OTree Index Metrics:
dimensionCount: 4
elementCount: 77966321
depth: 3
cubeCount: 145
desiredCubeSize: 3248596
indexingColumns: DOLocationID_cdf,PULocationID_cdf,tpep_pickup_datetime_cdf,tpep_dropoff_datetime_cdf
avgFanout: 16.0
depthOnBalance: 1.6931305699633146

Stats on cube sizes:
Quartiles:
- min: 3247255
- 1stQ: 3248326
- 2ndQ: 3249473
- 3rdQ: 3249924
- max: 3250627
Stats:
- count: 9
- l1_dev: 3.389840479463196E-4
- l2_dev: 1.2293769288666004E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
    0     3249473           0         1   0.04166
    1     3249074        1135         8    0.3757

WITHOUT CDF

- 15 files
- Avg File Size 180,42MB. 
- Less cubes, more defragmented.

OTree Index Metrics:                                                            
dimensionCount: 4
elementCount: 77966321
depth: 5
cubeCount: 79
desiredCubeSize: 3248596
indexingColumns: DOLocationID,PULocationID,tpep_pickup_datetime,tpep_dropoff_datetime
avgFanout: 5.2
depthOnBalance: 3.2196329121951317

Stats on cube sizes:
Quartiles:
- min: 3247408
- 1stQ: 3248302
- 2ndQ: 3249966
- 3rdQ: 3260554
- max: 3344342
Stats:
- count: 15
- l1_dev: 0.004943222651672702
- l2_dev: 0.00278163396661163
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
    0     3249966           0         1   0.04166
    1     3249943        1184         4    0.2768
    2     3268407       32146         7    0.6381
    3     3278608       43492         3   0.90334

Min-Max File ranges

I've analyzed the file ranges for the column tpep_pickup_datetime, which is the one that presented more outliers in the dataset. We can see that the min max of the files if more stable when using the CDF columns to index. (See image below)

tpep_pickup_datetime-dist-white

from qbeast-spark.

osopardo1 avatar osopardo1 commented on July 18, 2024

First analysis of the following queries:

// QUERY 1
qbeastTable
  .filter(col("PULocationID") >= 180 && col("PULocationID") <= 220 && col(
    "tpep_pickup_datetime") > 1600995211 && col("tpep_pickup_datetime") <= 1676000000)
  .count


// QUERY 2
qbeastTable
  .filter(col("PULocationID") >= 180 && col("PULocationID") <= 220 && col(
    "tpep_pickup_datetime") > 1600995211 && col("tpep_pickup_datetime") <= 1676000000)
  .sort("PULocationID")
  .head(20)

With CDF

  • QUERY 1: Time taken: 2912 ms
  • QUERY 2: Time taken: 9181 ms

NO CDF

  • QUERY 1: Time taken: 3262 ms
  • QUERY 2: Time taken: 11493 ms

Queries with the table indexed with CDF are 20% faster

from qbeast-spark.

osopardo1 avatar osopardo1 commented on July 18, 2024

Next steps:

  • Compare the metrics with the dataset indexed without outliers to check the similarity with the CDF approach.
  • Evaluate how much time is spent on the computation of the histogram columns. Since the methods to compute the statistics of the table can be expensive, we would need to understand if the tradeoff is worth it to keep developing in that direction.
  • Test performance with bigger skewed datasets.

from qbeast-spark.

osopardo1 avatar osopardo1 commented on July 18, 2024

IndexMetrics comparison

IndexMetrics with outliers + CDF

OTree Index Metrics:
dimensionCount: 4
elementCount: 77966321
depth: 3
cubeCount: 145
desiredCubeSize: 3248596
indexingColumns: DOLocationID_cdf,PULocationID_cdf,tpep_pickup_datetime_cdf,tpep_dropoff_datetime_cdf
avgFanout: 16.0
depthOnBalance: 1.6931305699633146

Stats on cube sizes:
Quartiles:
- min: 3247255
- 1stQ: 3248326
- 2ndQ: 3249473
- 3rdQ: 3249924
- max: 3250627
Stats:
- count: 9
- l1_dev: 3.389840479463196E-4
- l2_dev: 1.2293769288666004E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
    0     3249473           0         1   0.04166
    1     3249074        1135         8    0.3757

IndexMetrics without outliers

OTree Index Metrics:                                                            
dimensionCount: 4
elementCount: 77965705
depth: 3
cubeCount: 74
desiredCubeSize: 3248596
indexingColumns: tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID
avgFanout: 8.11111111111111
depthOnBalance: 1.9615397304356053

Stats on cube sizes:
Quartiles:
- min: 3246724
- 1stQ: 3248079
- 2ndQ: 3248494
- 3rdQ: 3249954
- max: 3250816
Stats:
- count: 9
- l1_dev: 3.079963022658267E-4
- l2_dev: 1.2662589262706753E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
    0     3250816           0         1   0.04166
    1     3248510        1044         8   0.51244

We can conclude that both indexes look pretty similar, which is a very good outcome: outliers no longer impact the index distribution, making values more evenly spread within childs.

Write Performance

What is the overhead of computing the CDF over a column?

170389 ms vs 83130 ms. It is 2x slower to generate the dataset with a single CDF column.
The execution does a sort + window operation in a single partition, which results in heavy computation for even a small dataset.

What is the overhead of computing the CDF over multiple columns?

429183 ms vs 83130 ms. It is 5x slower to generate the dataset with 4 CDF columns.

Can the performance be improved?

This overhead would only be noticed during the first write.
Also, note that the cumulative distribution function is executed over a window without partitionBy.

from qbeast-spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.