Comments (6)
SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob');
Hmm, as the join filter is only on employees
, I think it is pushdown through the Join. The results seem correct to me. I will verify it on Spark later.
from arrow-datafusion.
The results seem correct to me
No, what essentially this query is:
SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (FALSE);
which by definition of LEFT JOIN
should not reduce number of rows on the LEFT
from arrow-datafusion.
Yea, just verified with Spark:
scala> df.show
+------+-----+-----------+
|emp_id| name| dept_name|
+------+-----+-----------+
| 1|Alice| Sales|
| 1|Alice| HR|
| 1|Alice|Engineering|
| 2| Bob| Sales|
| 2| Bob| HR|
| 2| Bob|Engineering|
| 3|Carol| null|
+------+-----+-----------+
from arrow-datafusion.
This will cause correctness issue on Comet too. I will take a look.
from arrow-datafusion.
Could this be related to a bug in the optimizer (pushing down filter to the left where it shouldn't)?
I tried adding a trivially false term to the OR clause that references the department
table. This shouldn't change the result but would inhibit pushdown to the employees
table. I got the expected results:
> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob' OR d.department = 'FOOBAR');
+--------+-------+-------------+
| emp_id | name | department |
+--------+-------+-------------+
| 1 | Alice | HR |
| 2 | Bob | HR |
| 1 | Alice | Engineering |
| 2 | Bob | Engineering |
| 1 | Alice | Sales |
| 2 | Bob | Sales |
| 3 | Carol | |
+--------+-------+-------------+
7 row(s) fetched.
Elapsed 0.014 seconds.
from arrow-datafusion.
If we look at the following plan:
> EXPLAIN SELECT e.emp_id, e.name, d.department FROM employees e LEFT JOIN department d ON (e.name = 'Alice' OR e.name = 'Bob'); +---------------+----------------------------------------------------------------------------+ | plan_type | plan | +---------------+----------------------------------------------------------------------------+ | logical_plan | Left Join: Filter: e.name = Utf8("Alice") OR e.name = Utf8("Bob") | | | SubqueryAlias: e | | | Filter: employees.name = Utf8("Alice") OR employees.name = Utf8("Bob") | | | TableScan: employees projection=[emp_id, name] | | | SubqueryAlias: d | | | TableScan: department projection=[department] | | physical_plan | NestedLoopJoinExec: join_type=Left, filter=name@0 = Alice OR name@0 = Bob | | | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: name@1 = Alice OR name@1 = Bob | | | MemoryExec: partitions=1, partition_sizes=[1] | | | MemoryExec: partitions=1, partition_sizes=[1] | | | | +---------------+----------------------------------------------------------------------------+
The filters in italics seem wrong. They indeed disappear in the example I gave above:
> EXPLAIN SELECT e.emp_id, e.name, d.department FROM employees e LEFT JOIN department d ON (e.name = 'Alice' OR e.name = 'Bob' OR d.department = 'FOOBAR'); +---------------+-------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-------------------------------------------------------------------------------------------------------+ | logical_plan | Left Join: Filter: e.name = Utf8("Alice") OR e.name = Utf8("Bob") OR d.department = Utf8("FOOBAR") | | | SubqueryAlias: e | | | TableScan: employees projection=[emp_id, name] | | | SubqueryAlias: d | | | TableScan: department projection=[department] | | physical_plan | ProjectionExec: expr=[emp_id@1 as emp_id, name@2 as name, department@0 as department] | | | NestedLoopJoinExec: join_type=Right, filter=name@0 = Alice OR name@0 = Bob OR department@1 = FOOBAR | | | MemoryExec: partitions=1, partition_sizes=[1] | | | MemoryExec: partitions=1, partition_sizes=[1] | | | | +---------------+-------------------------------------------------------------------------------------------------------+ 2 row(s) fetched. Elapsed 0.016 seconds.
These filters are also not present if have an AND clause instead:
> EXPLAIN SELECT e.emp_id, e.name, d.department FROM employees e LEFT JOIN department d ON (e.name = 'Alice' AND e.name = 'Bob'); +---------------+---------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------+ | logical_plan | Left Join: Filter: e.name = Utf8("Alice") AND e.name = Utf8("Bob") | | | SubqueryAlias: e | | | TableScan: employees projection=[emp_id, name] | | | SubqueryAlias: d | | | TableScan: department projection=[department] | | physical_plan | ProjectionExec: expr=[emp_id@1 as emp_id, name@2 as name, department@0 as department] | | | NestedLoopJoinExec: join_type=Right, filter=name@0 = Alice AND name@0 = Bob | | | MemoryExec: partitions=1, partition_sizes=[1] | | | MemoryExec: partitions=1, partition_sizes=[1] | | | | +---------------+---------------------------------------------------------------------------------------+ 2 row(s) fetched. Elapsed 0.013 seconds.
from arrow-datafusion.
Related Issues (20)
- Implement SQLancer (a end-to-end SQL fuzz testing library) HOT 1
- Bugs in LCM/GCD scalar functions (found by SQLancer) HOT 1
- Improve `LIKE` performance for Dictionary arrays HOT 1
- to_timestamp functions should preserve timezone from inputs
- Int64 should be coercible to timestamp types
- Potential memory issue when using COPY with PARTITIONED BY HOT 2
- Make modulos with negative float zero compat with other engines HOT 8
- Verify if DISTINCT supports all types, incl binary, complex, etc
- Handle overflow in GCD scalar function HOT 1
- Order of Interval Addition Should Affect Final Output HOT 6
- `Expr::Wildcard` should use `TableReference` as the qualifier type HOT 1
- Overflow bug in FACTORIAL scalar function (found by SQLancer) HOT 1
- Overflow bug in POW scalar function (found by SQLancer) HOT 4
- Overflow bug in negate arithmetic operator (found by SQLancer) HOT 1
- Release DataFusion `40.0.0`
- Tracking issue: Overflow bugs in scalar math functions found by SQLancer
- Add example for writing a `FileFormat` HOT 3
- Add `union_tag` function HOT 1
- Add `union_extract` function HOT 1
- Add more support for ScalarValue::Float16 where Float32 and Float64 are supported
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.