Code Monkey home page Code Monkey logo

Comments (6)

viirya avatar viirya commented on June 24, 2024

SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob');

Hmm, as the join filter is only on employees, I think it is pushdown through the Join. The results seem correct to me. I will verify it on Spark later.

from arrow-datafusion.

redbaron avatar redbaron commented on June 24, 2024

The results seem correct to me

No, what essentially this query is:

SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (FALSE);

which by definition of LEFT JOIN should not reduce number of rows on the LEFT

from arrow-datafusion.

viirya avatar viirya commented on June 24, 2024

Yea, just verified with Spark:

scala> df.show
+------+-----+-----------+
|emp_id| name|  dept_name|
+------+-----+-----------+
|     1|Alice|      Sales|
|     1|Alice|         HR|
|     1|Alice|Engineering|
|     2|  Bob|      Sales|
|     2|  Bob|         HR|
|     2|  Bob|Engineering|
|     3|Carol|       null|
+------+-----+-----------+

from arrow-datafusion.

viirya avatar viirya commented on June 24, 2024

This will cause correctness issue on Comet too. I will take a look.

from arrow-datafusion.

ozankabak avatar ozankabak commented on June 24, 2024

Could this be related to a bug in the optimizer (pushing down filter to the left where it shouldn't)?

I tried adding a trivially false term to the OR clause that references the department table. This shouldn't change the result but would inhibit pushdown to the employees table. I got the expected results:

> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob' OR d.department = 'FOOBAR');
+--------+-------+-------------+
| emp_id | name  | department  |
+--------+-------+-------------+
| 1      | Alice | HR          |
| 2      | Bob   | HR          |
| 1      | Alice | Engineering |
| 2      | Bob   | Engineering |
| 1      | Alice | Sales       |
| 2      | Bob   | Sales       |
| 3      | Carol |             |
+--------+-------+-------------+
7 row(s) fetched.
Elapsed 0.014 seconds.

from arrow-datafusion.

ozankabak avatar ozankabak commented on June 24, 2024

If we look at the following plan:

> EXPLAIN SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob');
+---------------+----------------------------------------------------------------------------+
| plan_type     | plan                                                                       |
+---------------+----------------------------------------------------------------------------+
| logical_plan  | Left Join:  Filter: e.name = Utf8("Alice") OR e.name = Utf8("Bob")         |
|               |   SubqueryAlias: e                                                         |
|               |     Filter: employees.name = Utf8("Alice") OR employees.name = Utf8("Bob") |
|               |       TableScan: employees projection=[emp_id, name]                       |
|               |   SubqueryAlias: d                                                         |
|               |     TableScan: department projection=[department]                          |
| physical_plan | NestedLoopJoinExec: join_type=Left, filter=name@0 = Alice OR name@0 = Bob  |
|               |   CoalesceBatchesExec: target_batch_size=8192                              |
|               |     FilterExec: name@1 = Alice OR name@1 = Bob                             |
|               |       MemoryExec: partitions=1, partition_sizes=[1]                        |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                            |
|               |                                                                            |
+---------------+----------------------------------------------------------------------------+

The filters in italics seem wrong. They indeed disappear in the example I gave above:

> EXPLAIN SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob' OR d.department = 'FOOBAR');
+---------------+-------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                  |
+---------------+-------------------------------------------------------------------------------------------------------+
| logical_plan  | Left Join:  Filter: e.name = Utf8("Alice") OR e.name = Utf8("Bob") OR d.department = Utf8("FOOBAR")   |
|               |   SubqueryAlias: e                                                                                    |
|               |     TableScan: employees projection=[emp_id, name]                                                    |
|               |   SubqueryAlias: d                                                                                    |
|               |     TableScan: department projection=[department]                                                     |
| physical_plan | ProjectionExec: expr=[emp_id@1 as emp_id, name@2 as name, department@0 as department]                 |
|               |   NestedLoopJoinExec: join_type=Right, filter=name@0 = Alice OR name@0 = Bob OR department@1 = FOOBAR |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                                     |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                                     |
|               |                                                                                                       |
+---------------+-------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.016 seconds.

These filters are also not present if have an AND clause instead:

> EXPLAIN SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' AND e.name = 'Bob');
+---------------+---------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                  |
+---------------+---------------------------------------------------------------------------------------+
| logical_plan  | Left Join:  Filter: e.name = Utf8("Alice") AND e.name = Utf8("Bob")                   |
|               |   SubqueryAlias: e                                                                    |
|               |     TableScan: employees projection=[emp_id, name]                                    |
|               |   SubqueryAlias: d                                                                    |
|               |     TableScan: department projection=[department]                                     |
| physical_plan | ProjectionExec: expr=[emp_id@1 as emp_id, name@2 as name, department@0 as department] |
|               |   NestedLoopJoinExec: join_type=Right, filter=name@0 = Alice AND name@0 = Bob         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                     |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                     |
|               |                                                                                       |
+---------------+---------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.013 seconds.

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.