Comments (5)
Here's the summary of the initial experiment results.
WITH CDF (histogram scenario):
- 9 files
- avg size 296.8MB.
- More cubes, but fractioned in less files.
OTree Index Metrics:
dimensionCount: 4
elementCount: 77966321
depth: 3
cubeCount: 145
desiredCubeSize: 3248596
indexingColumns: DOLocationID_cdf,PULocationID_cdf,tpep_pickup_datetime_cdf,tpep_dropoff_datetime_cdf
avgFanout: 16.0
depthOnBalance: 1.6931305699633146
Stats on cube sizes:
Quartiles:
- min: 3247255
- 1stQ: 3248326
- 2ndQ: 3249473
- 3rdQ: 3249924
- max: 3250627
Stats:
- count: 9
- l1_dev: 3.389840479463196E-4
- l2_dev: 1.2293769288666004E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
0 3249473 0 1 0.04166
1 3249074 1135 8 0.3757
WITHOUT CDF
- 15 files
- Avg File Size 180,42MB.
- Less cubes, more defragmented.
OTree Index Metrics:
dimensionCount: 4
elementCount: 77966321
depth: 5
cubeCount: 79
desiredCubeSize: 3248596
indexingColumns: DOLocationID,PULocationID,tpep_pickup_datetime,tpep_dropoff_datetime
avgFanout: 5.2
depthOnBalance: 3.2196329121951317
Stats on cube sizes:
Quartiles:
- min: 3247408
- 1stQ: 3248302
- 2ndQ: 3249966
- 3rdQ: 3260554
- max: 3344342
Stats:
- count: 15
- l1_dev: 0.004943222651672702
- l2_dev: 0.00278163396661163
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
0 3249966 0 1 0.04166
1 3249943 1184 4 0.2768
2 3268407 32146 7 0.6381
3 3278608 43492 3 0.90334
Min-Max File ranges
I've analyzed the file ranges for the column tpep_pickup_datetime
, which is the one that presented more outliers in the dataset. We can see that the min max of the files if more stable when using the CDF columns to index. (See image below)
from qbeast-spark.
First analysis of the following queries:
// QUERY 1
qbeastTable
.filter(col("PULocationID") >= 180 && col("PULocationID") <= 220 && col(
"tpep_pickup_datetime") > 1600995211 && col("tpep_pickup_datetime") <= 1676000000)
.count
// QUERY 2
qbeastTable
.filter(col("PULocationID") >= 180 && col("PULocationID") <= 220 && col(
"tpep_pickup_datetime") > 1600995211 && col("tpep_pickup_datetime") <= 1676000000)
.sort("PULocationID")
.head(20)
With CDF
- QUERY 1: Time taken: 2912 ms
- QUERY 2: Time taken: 9181 ms
NO CDF
- QUERY 1: Time taken: 3262 ms
- QUERY 2: Time taken: 11493 ms
Queries with the table indexed with CDF are 20% faster
from qbeast-spark.
Next steps:
- Compare the metrics with the dataset indexed without outliers to check the similarity with the CDF approach.
- Evaluate how much time is spent on the computation of the histogram columns. Since the methods to compute the statistics of the table can be expensive, we would need to understand if the tradeoff is worth it to keep developing in that direction.
- Test performance with bigger skewed datasets.
from qbeast-spark.
IndexMetrics comparison
IndexMetrics with outliers + CDF
OTree Index Metrics:
dimensionCount: 4
elementCount: 77966321
depth: 3
cubeCount: 145
desiredCubeSize: 3248596
indexingColumns: DOLocationID_cdf,PULocationID_cdf,tpep_pickup_datetime_cdf,tpep_dropoff_datetime_cdf
avgFanout: 16.0
depthOnBalance: 1.6931305699633146
Stats on cube sizes:
Quartiles:
- min: 3247255
- 1stQ: 3248326
- 2ndQ: 3249473
- 3rdQ: 3249924
- max: 3250627
Stats:
- count: 9
- l1_dev: 3.389840479463196E-4
- l2_dev: 1.2293769288666004E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
0 3249473 0 1 0.04166
1 3249074 1135 8 0.3757
IndexMetrics without outliers
OTree Index Metrics:
dimensionCount: 4
elementCount: 77965705
depth: 3
cubeCount: 74
desiredCubeSize: 3248596
indexingColumns: tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID
avgFanout: 8.11111111111111
depthOnBalance: 1.9615397304356053
Stats on cube sizes:
Quartiles:
- min: 3246724
- 1stQ: 3248079
- 2ndQ: 3248494
- 3rdQ: 3249954
- max: 3250816
Stats:
- count: 9
- l1_dev: 3.079963022658267E-4
- l2_dev: 1.2662589262706753E-4
Level-wise stats:
level avgCubeSize stdCubeSize cubeCount avgWeight
0 3250816 0 1 0.04166
1 3248510 1044 8 0.51244
We can conclude that both indexes look pretty similar, which is a very good outcome: outliers no longer impact the index distribution, making values more evenly spread within childs.
Write Performance
What is the overhead of computing the CDF over a column?
170389 ms vs 83130 ms. It is 2x slower to generate the dataset with a single CDF column.
The execution does a sort + window operation in a single partition, which results in heavy computation for even a small dataset.
What is the overhead of computing the CDF over multiple columns?
429183 ms vs 83130 ms. It is 5x slower to generate the dataset with 4 CDF columns.
Can the performance be improved?
This overhead would only be noticed during the first write.
Also, note that the cumulative distribution function is executed over a window without partitionBy
.
from qbeast-spark.
Related Issues (20)
- QbeastOptions should follow the Builder Pattern
- Empty DataFrame save should mimic Delta Lake behaviour
- Update Documentation for 0.6.0 release HOT 2
- Add method to get the Minimum Bounding Cube
- Optimization should not introduce data change
- Remove redundant classes and methods
- Incorrect Rollup cube element counts HOT 1
- Overhead during optimization
- Metadata time in queries with Qbeast Datasource is higher than expected HOT 1
- Add Commit Hooks to write extra information within the same transaction HOT 2
- Make Blocks addressable from the file reader
- QbeastOptions.toMap should return CaseInsensitiveMap HOT 1
- Change build version to 0.7.0
- Unable to overwrite a delta table HOT 3
- Analyse the impact of Delete operation in Qbeast Index
- Review and update documentation for Troubleshooting
- Add utility method to calculate the histogram
- Large Task Deserialization Time during Optimization
- Remove .compact() operation and discuss interaction between optimize and replication
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qbeast-spark.