Describe the bug When the matching condition is on columns and com

Yea, just verified with Spark: <div class="snippet-clipboard-content notranslate p

Incorrect LEFT JOIN evaluation result on OR conditions about arrow-datafusion HOT 6 OPEN

appletreeisyellow commented on June 24, 2024 3

Incorrect LEFT JOIN evaluation result on OR conditions

from arrow-datafusion.

Comments (6)

viirya commented on June 24, 2024

SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob');

Hmm, as the join filter is only on employees, I think it is pushdown through the Join. The results seem correct to me. I will verify it on Spark later.

from arrow-datafusion.

redbaron commented on June 24, 2024

The results seem correct to me

No, what essentially this query is:

SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (FALSE);

which by definition of LEFT JOIN should not reduce number of rows on the LEFT

from arrow-datafusion.

viirya commented on June 24, 2024

Yea, just verified with Spark:

scala> df.show
+------+-----+-----------+
|emp_id| name|  dept_name|
+------+-----+-----------+
|     1|Alice|      Sales|
|     1|Alice|         HR|
|     1|Alice|Engineering|
|     2|  Bob|      Sales|
|     2|  Bob|         HR|
|     2|  Bob|Engineering|
|     3|Carol|       null|
+------+-----+-----------+

from arrow-datafusion.

viirya commented on June 24, 2024

This will cause correctness issue on Comet too. I will take a look.

from arrow-datafusion.

ozankabak commented on June 24, 2024

Could this be related to a bug in the optimizer (pushing down filter to the left where it shouldn't)?

I tried adding a trivially false term to the OR clause that references the department table. This shouldn't change the result but would inhibit pushdown to the employees table. I got the expected results:

> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob' OR d.department = 'FOOBAR');
+--------+-------+-------------+
| emp_id | name  | department  |
+--------+-------+-------------+
| 1      | Alice | HR          |
| 2      | Bob   | HR          |
| 1      | Alice | Engineering |
| 2      | Bob   | Engineering |
| 1      | Alice | Sales       |
| 2      | Bob   | Sales       |
| 3      | Carol |             |
+--------+-------+-------------+
7 row(s) fetched.
Elapsed 0.014 seconds.

from arrow-datafusion.

ozankabak commented on June 24, 2024

If we look at the following plan:

> EXPLAIN SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob');
+---------------+----------------------------------------------------------------------------+
| plan_type     | plan                                                                       |
+---------------+----------------------------------------------------------------------------+
| logical_plan  | Left Join:  Filter: e.name = Utf8("Alice") OR e.name = Utf8("Bob")         |
|               |   SubqueryAlias: e                                                         |
|               |     Filter: employees.name = Utf8("Alice") OR employees.name = Utf8("Bob") |
|               |       TableScan: employees projection=[emp_id, name]                       |
|               |   SubqueryAlias: d                                                         |
|               |     TableScan: department projection=[department]                          |
| physical_plan | NestedLoopJoinExec: join_type=Left, filter=name@0 = Alice OR name@0 = Bob  |
|               |   CoalesceBatchesExec: target_batch_size=8192                              |
|               |     FilterExec: name@1 = Alice OR name@1 = Bob                             |
|               |       MemoryExec: partitions=1, partition_sizes=[1]                        |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                            |
|               |                                                                            |
+---------------+----------------------------------------------------------------------------+

The filters in italics seem wrong. They indeed disappear in the example I gave above:

> EXPLAIN SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob' OR d.department = 'FOOBAR');
+---------------+-------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                  |
+---------------+-------------------------------------------------------------------------------------------------------+
| logical_plan  | Left Join:  Filter: e.name = Utf8("Alice") OR e.name = Utf8("Bob") OR d.department = Utf8("FOOBAR")   |
|               |   SubqueryAlias: e                                                                                    |
|               |     TableScan: employees projection=[emp_id, name]                                                    |
|               |   SubqueryAlias: d                                                                                    |
|               |     TableScan: department projection=[department]                                                     |
| physical_plan | ProjectionExec: expr=[emp_id@1 as emp_id, name@2 as name, department@0 as department]                 |
|               |   NestedLoopJoinExec: join_type=Right, filter=name@0 = Alice OR name@0 = Bob OR department@1 = FOOBAR |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                                     |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                                     |
|               |                                                                                                       |
+---------------+-------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.016 seconds.

These filters are also not present if have an AND clause instead:

> EXPLAIN SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' AND e.name = 'Bob');
+---------------+---------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                  |
+---------------+---------------------------------------------------------------------------------------+
| logical_plan  | Left Join:  Filter: e.name = Utf8("Alice") AND e.name = Utf8("Bob")                   |
|               |   SubqueryAlias: e                                                                    |
|               |     TableScan: employees projection=[emp_id, name]                                    |
|               |   SubqueryAlias: d                                                                    |
|               |     TableScan: department projection=[department]                                     |
| physical_plan | ProjectionExec: expr=[emp_id@1 as emp_id, name@2 as name, department@0 as department] |
|               |   NestedLoopJoinExec: join_type=Right, filter=name@0 = Alice AND name@0 = Bob         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                     |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                     |
|               |                                                                                       |
+---------------+---------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.013 seconds.

from arrow-datafusion.

Incorrect LEFT JOIN evaluation result on OR conditions about arrow-datafusion HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent