Code Monkey home page Code Monkey logo

Comments (6)

samuelcolvin avatar samuelcolvin commented on July 18, 2024

Well that's embarrassing 🤦, I forgot --release.

In release mode:

  • unsorted string: 82ms
  • unsorted FWB: 82ms
  • sorted string: 11ms
  • sorted FWB: 82ms

Kind of even more surprising - Surely FWB should be faster as you don't need to calculate offsets?

from arrow-datafusion.

samuelcolvin avatar samuelcolvin commented on July 18, 2024

Well it keeps getting weirder.

In release mode:

  • unsorted string: 82ms
  • unsorted FWB: 82ms
  • unsorted UInt64: 51ms
  • sorted string: 11ms
  • sorted FWB: 82ms
  • sorted UInt64: 39ms

(Trying UInt64 was the first step towards using a struct of two UInt64, but it seems unlike that would be as fast as a string right now)

This is all very confusing.

TL;DR; - @alamb if you were storing

A 16-byte array with at least one non-zero byte. ref

In parquet to query with datafusion, and wanted it to be fast long term, what would you use?

from arrow-datafusion.

samuelcolvin avatar samuelcolvin commented on July 18, 2024

Okay last comment here (for now), I'll stop talking to myself.

It seems that Decimal128 is the best option for our case (we can rewrite it to look like hex and be queried with hex):

Times (unsorted, sorted):

  • DataType::FixedSizeBinary(16) - (51, 57)
  • DataType::LargeUtf8 - (81, 10)
  • DataType::UInt64 - (52, 36)
  • DataType::Decimal128(38, 10) - (57, 7)

from arrow-datafusion.

alamb avatar alamb commented on July 18, 2024

In parquet to query with datafusion, and wanted it to be fast long term, what would you use?

I would have recommended using FixedSizeBinary as you have done (and in fact I believe @hiltontj is doing something like this internall at InfluxData at the moment).

However I got broadly similar numbers to you in with the different types (and I agree Decimal128 looks quite good)

I checked out https://github.com/samuelcolvin/datafusion-id-experiment and got an explain plan with metrics (ran EXPLAIN ANALYZE {sql}):

FixedSizeBinary

select * from simple_fixed_sorted where id=arrow_cast(decode('57f16cbaf865bcd9adcc71c03200fd60', 'hex'),
 'FixedSizeBinary(16)')
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=0, elapsed_compute=3.172µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                   |   FilterExec: id@0 = 87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96, metrics=[output_rows=0, elapsed_compute=2.207689ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |     ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:0..2375388], [Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:2375388..4750776], [Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:4750776..7126164], [Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:7126164..9501552], [Users/andrewlamb/Downloads/datafusion-id-experiment/simple_fixed.parquet:9501552..11876940], ...]}, projection=[id, name], predicate=id@0 = 87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96, pruning_predicate=CASE WHEN id_null_count@2 = id_row_count@3 THEN false ELSE id_min@0 <= 87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96 AND 87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96 <= id_max@1 END, required_guarantees=[id in (87,241,108,186,248,101,188,217,173,204,113,192,50,0,253,96)], metrics=[output_rows=1000000, elapsed_compute=16ns, bytes_scanned=38037954, file_open_errors=0, row_groups_matched_statistics=1, row_groups_pruned_statistics=0, num_predicate_creation_errors=0, file_scan_errors=0, predicate_evaluation_errors=0, row_groups_pruned_bloom_filter=0, page_index_rows_filtered=0, row_groups_matched_bloom_filter=0, pushdown_rows_filtered=0, time_elapsed_opening=24.008333ms, time_elapsed_scanning_total=68.525251ms, time_elapsed_processing=75.557418ms, pushdown_eval_time=32ns, time_elapsed_scanning_until_data=10.007876ms, page_index_eval_time=218.687µs] |

Decimal

select * from decimal where id=arrow_cast('5714204269946304998258834512.6198419457', 'Decimal128(38, 10)')
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=0, elapsed_compute=2.492µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                   |   FilterExec: id@0 = Some(57142042699463049982588345126198419457),38,10, metrics=[output_rows=0, elapsed_compute=422.896µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |     ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:0..2376546], [Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:2376546..4753092], [Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:4753092..7129638], [Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:7129638..9506184], [Users/andrewlamb/Downloads/datafusion-id-experiment/decimal.parquet:9506184..11882730], ...]}, projection=[id, name], predicate=id@0 = Some(57142042699463049982588345126198419457),38,10, pruning_predicate=CASE WHEN id_null_count@2 = id_row_count@3 THEN false ELSE id_min@0 <= Some(57142042699463049982588345126198419457),38,10 AND Some(57142042699463049982588345126198419457),38,10 <= id_max@1 END, required_guarantees=[id in (Some(57142042699463049982588345126198419457),38,10)], metrics=[output_rows=1000000, elapsed_compute=16ns, bytes_scanned=38054970, file_open_errors=0, row_groups_matched_statistics=1, row_groups_pruned_statistics=0, num_predicate_creation_errors=0, file_scan_errors=0, predicate_evaluation_errors=0, row_groups_pruned_bloom_filter=0, page_index_rows_filtered=0, row_groups_matched_bloom_filter=0, pushdown_rows_filtered=0, time_elapsed_opening=4.448333ms, time_elapsed_scanning_total=60.640037ms, time_elapsed_processing=56.689831ms, pushdown_eval_time=32ns, time_elapsed_scanning_until_data=7.254625ms, page_index_eval_time=17.975µs] |
|                   |

I don't really have great insight as to why Decimal was better -- it may be because it is stored inline as i128 values (rather than out of line).

from arrow-datafusion.

samuelcolvin avatar samuelcolvin commented on July 18, 2024

What's weird is the behaviour with a decimal 128 is better than a uint64 when sorted. Is that that a fundamental side affect of the type, or some missing logic/optimisation?

from arrow-datafusion.

alamb avatar alamb commented on July 18, 2024

What's weird is the behaviour with a decimal 128 is better than a uint64 when sorted. Is that that a fundamental side affect of the type, or some missing logic/optimisation?

I suspect it is some missing optimization -- I don't know of any reason that fixed size binary would be less efficient than decimal.

I double checked that FixedSizeBinary is also stored inline

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.