Comments (7)
Is it common to find joins which aren't equi-joins in the workloads you are running?
I'm not sure we ran until this before, which I guess it's because most of the joins in our workloads are equi? Cc: @mbasmanova @Yuhta
from velox.
@pedroerp : Join planning for correlated queries often use non-equi joins. It is important for us to plan this efficiently.
I've heard in passing that most Meta queries use equi-joins. Though I remember Amit pushing for some NestedLoopJoin fixes in the past. So Meta workloads do have non-equi joins as well I think.
But the original NestedLoopJoin operators are non-Meta contribution AFAIK.
from velox.
I checked the LookupJoin
in Presto a little bit and find it can do first hash join on some keys, then do range query on other keys of the matched rows to further filter down rows (like the join filter we have in velox hash join). Do we want to support this use case? Or we just want to have range query join as the main way of join?
from velox.
CC: @usurai
from velox.
@Yuhta : My use case if for range query join (without any equality condition). It becomes NestedLoopJoin today.
For hash join followed by range query we are fine with the HashJoin(Build/Probe) we have in Velox already.
from velox.
Presto has a join where it builds a tree and then does lookups in that. Usually, databases make durable indices and then maintain these. The Presto range lookup is not like that, it builds on demand and just like a hash join. Thre is I recall some class called IndexBuilder or such. Should there be that i Velox? How would thuis spill? Would this be like an index in a traditional database with a buffer pool and all? How would this be prtitioned? Or broadcast? If it were partitioned, it would have to have a broadcast on the probe side or a range partition. The latter would have to be adaptive, since we are talking about a query time artifact, not an index tat somebody creates and is then maintained by the system.
Imagine a minimal implementation: Build it like a hash join but make a B tree over it instead of a hash table. If there is a leading equality, the build can be partitioned on that. Otherwise there is a broadcast from build if the build is small and a broadcast on probe if it is large. Then there is a representation of a range condition.
We have found no use for this at Meta. But somebody could build thuis of course. I an specify how this is done if somebody wants to write this.
from velox.
@oerling : Agree with the idea of exploring a query time artifact instead of index in traditional database with a buffer pool.
Your suggestion to build it like a hash join but make it a B-tree with links to follow for a range search is a good starting point.
There isn't a pressing need from our side for this rightaway. But I'm intrigued to hear more about your ideas for this.
We might find interested folks to write this.
from velox.
Related Issues (20)
- Document Unicode version supported in Velox HOT 7
- Support inplace replace if possible for spark split function
- `setup-adapters.sh` script doesn't work on ARM Linux machines HOT 1
- Support "parquet_use_column_names" = false in Velox HOT 2
- Failed to read the parquet file HOT 3
- An unsupported nested encoding was found. HOT 5
- Velox doesn't support to read binary as string in Parquet HOT 6
- VeloxBackend logging does not respect spark.redaction.regex config property when logging spark config. HOT 1
- Split adapters install to corresponding platform setup scripts
- The find_package and linking of the 'gtest' library is wrong HOT 1
- Incorrect results when query has filters on info/metadata columns HOT 2
- Untracked directories will appears after the setup script is executed。
- Aggregation fuzzer is failing in nightly runs HOT 2
- Support inplace replace if possible for spark mask function
- Sort merge join failed with state.data == nullptr exception HOT 3
- Change FunctionSignature returnType to std::optional to support the return type varied with config HOT 16
- SIMDComparisonUtil.h:43:69: error: no matching function for call to ‘xsimd::batch_bool<signed char, xsimd::neon64>::batch_bool(xsimd::batch<signed char>)’
- bugs in Conbench performance report HOT 5
- Linux CI jobs have not been running since 2 weeks
- Parquet reader: can't read null map row in a single line file
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from velox.