Is your feature request related to a problem or challenge? The <co

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Implement a way to preserve partitioning through `UnionExec` without losing ordering about arrow-datafusion HOT 3 OPEN

alamb commented on September 13, 2024

Implement a way to preserve partitioning through `UnionExec` without losing ordering

from arrow-datafusion.

Comments (3)

alamb commented on September 13, 2024 1

Hi @alamb, I am trying to work on this.

I am not very familiar on the InterleaveExec in the optimizer. As initial thought, the interleaveExec is acting as a Repartition with equal number of input partitions and output partitions and thus a nature idea is to reuse streaming_merge with respect to the input size. Wdyt?

Hi @xinlifoobar -- this sounds like it is on the right track

from arrow-datafusion.

xinlifoobar commented on September 13, 2024

Hi @alamb, I am trying to work on this.

I am not very familiar on the InterleaveExec in the optimizer. As initial thought, the interleaveExec is acting as a Repartition with equal number of input partitions and output partitions and thus a nature idea is to reuse streaming_merge with respect to the input size. Wdyt?

from arrow-datafusion.

xinlifoobar commented on September 13, 2024

Hi @alamb, found another interesting case while testing. I am not very sure, do you think this could apply InterleaveExec with same order by sets?

 explain select count(*) from ((select distinct c1, c2 from t3 order by c1 ) union all (select distinct c1, c2 from t4 order by c1)) group by cube(c1,c2);
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                   |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: COUNT(*)                                                                                                                                                   |
|               |   Aggregate: groupBy=[[CUBE (t3.c1, t3.c2)]], aggr=[[COUNT(Int64(1)) AS COUNT(*)]]                                                                                     |
|               |     Union                                                                                                                                                              |
|               |       Sort: t3.c1 ASC NULLS LAST                                                                                                                                       |
|               |         Aggregate: groupBy=[[t3.c1, t3.c2]], aggr=[[]]                                                                                                                 |
|               |           TableScan: t3 projection=[c1, c2]                                                                                                                            |
|               |       Sort: t4.c1 ASC NULLS LAST                                                                                                                                       |
|               |         Aggregate: groupBy=[[t4.c1, t4.c2]], aggr=[[]]                                                                                                                 |
|               |           TableScan: t4 projection=[c1, c2]                                                                                                                            |
| physical_plan | ProjectionExec: expr=[COUNT(*)@2 as COUNT(*)]                                                                                                                          |
|               |   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[COUNT(*)], ordering_mode=PartiallySorted([0])                                              |
|               |     SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[true]                                                                                                 |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                                                                                      |
|               |         RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=14                                                                                      |
|               |           RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=2                                                                                        |
|               |             AggregateExec: mode=Partial, gby=[(c1@0 as c1, c2@1 as c2), (NULL as c1, c2@1 as c2), (c1@0 as c1, NULL as c2), (NULL as c1, NULL as c2)], aggr=[COUNT(*)] |
|               |               UnionExec                                                                                                                                                |
|               |                 CoalescePartitionsExec                                                                                                                                 |
|               |                   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                          |
|               |                     CoalesceBatchesExec: target_batch_size=8192                                                                                                        |
|               |                       RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=1                                                                         |
|               |                         AggregateExec: mode=Partial, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                             |
|               |                           MemoryExec: partitions=1, partition_sizes=[0]                                                                                                |
|               |                 CoalescePartitionsExec                                                                                                                                 |
|               |                   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                          |
|               |                     CoalesceBatchesExec: target_batch_size=8192                                                                                                        |
|               |                       RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=1                                                                         |
|               |                         AggregateExec: mode=Partial, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                             |
|               |                           MemoryExec: partitions=1, partition_sizes=[0]                                                                                                |
|               |                                                                                                                                                                        |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.

With InterleaveExec:

 ProjectionExec: 
   AggregateExec:
    InterleaveExec: 
      SortExec:
         AggregateExec:
      SortExec:
         AggregateExec:

from arrow-datafusion.

Implement a way to preserve partitioning through `UnionExec` without losing ordering about arrow-datafusion HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent