After testing with NYC Dataset, we've noticed that indexing outliers create a

First analysis of the following queries: <div class="highlight highlight-source-sc

Next steps: Compare the metrics with the dataset index

Analyze Histogram Transformation for outliers about qbeast-spark HOT 5 OPEN

osopardo1 commented on July 18, 2024

Analyze Histogram Transformation for outliers

from qbeast-spark.

Comments (5)

osopardo1 commented on July 18, 2024

Here's the summary of the initial experiment results.

WITH CDF (histogram scenario):

9 files
avg size 296.8MB.
More cubes, but fractioned in less files.

OTree Index Metrics:
dimensionCount: 4
elementCount: 77966321
depth: 3
cubeCount: 145
desiredCubeSize: 3248596
indexingColumns: DOLocationID_cdf,PULocationID_cdf,tpep_pickup_datetime_cdf,tpep_dropoff_datetime_cdf
avgFanout: 16.0
depthOnBalance: 1.6931305699633146

Stats on cube sizes:
Quartiles:
- min: 3247255
- 1stQ: 3248326
- 2ndQ: 3249473
- 3rdQ: 3249924
- max: 3250627
Stats:
- count: 9
- l1_dev: 3.389840479463196E-4
- l2_dev: 1.2293769288666004E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
    0     3249473           0         1   0.04166
    1     3249074        1135         8    0.3757

WITHOUT CDF

- 15 files
- Avg File Size 180,42MB. 
- Less cubes, more defragmented.

OTree Index Metrics:                                                            
dimensionCount: 4
elementCount: 77966321
depth: 5
cubeCount: 79
desiredCubeSize: 3248596
indexingColumns: DOLocationID,PULocationID,tpep_pickup_datetime,tpep_dropoff_datetime
avgFanout: 5.2
depthOnBalance: 3.2196329121951317

Stats on cube sizes:
Quartiles:
- min: 3247408
- 1stQ: 3248302
- 2ndQ: 3249966
- 3rdQ: 3260554
- max: 3344342
Stats:
- count: 15
- l1_dev: 0.004943222651672702
- l2_dev: 0.00278163396661163
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
    0     3249966           0         1   0.04166
    1     3249943        1184         4    0.2768
    2     3268407       32146         7    0.6381
    3     3278608       43492         3   0.90334

Min-Max File ranges

I've analyzed the file ranges for the column tpep_pickup_datetime, which is the one that presented more outliers in the dataset. We can see that the min max of the files if more stable when using the CDF columns to index. (See image below)

from qbeast-spark.

osopardo1 commented on July 18, 2024

First analysis of the following queries:

// QUERY 1
qbeastTable
  .filter(col("PULocationID") >= 180 && col("PULocationID") <= 220 && col(
    "tpep_pickup_datetime") > 1600995211 && col("tpep_pickup_datetime") <= 1676000000)
  .count


// QUERY 2
qbeastTable
  .filter(col("PULocationID") >= 180 && col("PULocationID") <= 220 && col(
    "tpep_pickup_datetime") > 1600995211 && col("tpep_pickup_datetime") <= 1676000000)
  .sort("PULocationID")
  .head(20)

With CDF

QUERY 1: Time taken: 2912 ms
QUERY 2: Time taken: 9181 ms

NO CDF

QUERY 1: Time taken: 3262 ms
QUERY 2: Time taken: 11493 ms

Queries with the table indexed with CDF are 20% faster

from qbeast-spark.

osopardo1 commented on July 18, 2024

Next steps:

Compare the metrics with the dataset indexed without outliers to check the similarity with the CDF approach.
Evaluate how much time is spent on the computation of the histogram columns. Since the methods to compute the statistics of the table can be expensive, we would need to understand if the tradeoff is worth it to keep developing in that direction.
Test performance with bigger skewed datasets.

from qbeast-spark.

osopardo1 commented on July 18, 2024

IndexMetrics comparison

IndexMetrics with outliers + CDF

OTree Index Metrics:
dimensionCount: 4
elementCount: 77966321
depth: 3
cubeCount: 145
desiredCubeSize: 3248596
indexingColumns: DOLocationID_cdf,PULocationID_cdf,tpep_pickup_datetime_cdf,tpep_dropoff_datetime_cdf
avgFanout: 16.0
depthOnBalance: 1.6931305699633146

Stats on cube sizes:
Quartiles:
- min: 3247255
- 1stQ: 3248326
- 2ndQ: 3249473
- 3rdQ: 3249924
- max: 3250627
Stats:
- count: 9
- l1_dev: 3.389840479463196E-4
- l2_dev: 1.2293769288666004E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
    0     3249473           0         1   0.04166
    1     3249074        1135         8    0.3757

IndexMetrics without outliers

OTree Index Metrics:                                                            
dimensionCount: 4
elementCount: 77965705
depth: 3
cubeCount: 74
desiredCubeSize: 3248596
indexingColumns: tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID
avgFanout: 8.11111111111111
depthOnBalance: 1.9615397304356053

Stats on cube sizes:
Quartiles:
- min: 3246724
- 1stQ: 3248079
- 2ndQ: 3248494
- 3rdQ: 3249954
- max: 3250816
Stats:
- count: 9
- l1_dev: 3.079963022658267E-4
- l2_dev: 1.2662589262706753E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
    0     3250816           0         1   0.04166
    1     3248510        1044         8   0.51244

We can conclude that both indexes look pretty similar, which is a very good outcome: outliers no longer impact the index distribution, making values more evenly spread within childs.

Write Performance

What is the overhead of computing the CDF over a column?

170389 ms vs 83130 ms. It is 2x slower to generate the dataset with a single CDF column.
The execution does a sort + window operation in a single partition, which results in heavy computation for even a small dataset.

What is the overhead of computing the CDF over multiple columns?

429183 ms vs 83130 ms. It is 5x slower to generate the dataset with 4 CDF columns.

Can the performance be improved?

This overhead would only be noticed during the first write.
Also, note that the cumulative distribution function is executed over a window without partitionBy.

from qbeast-spark.

Analyze Histogram Transformation for outliers about qbeast-spark HOT 5 OPEN

Comments (5)

WITH CDF (histogram scenario):

WITHOUT CDF

Min-Max File ranges

With CDF

NO CDF

Queries with the table indexed with CDF are 20% faster

IndexMetrics comparison

IndexMetrics with outliers + CDF

IndexMetrics without outliers

Write Performance

What is the overhead of computing the CDF over a column?

What is the overhead of computing the CDF over multiple columns?

Can the performance be improved?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent