Comments (8)
An update here on the plan from @XiangpengHao:
- He has local changes (that also require some additional features that will be released in arrow
52.2.0
) that show significant performance improvements for TPCH and ClickBench queries - The hope is that these proposals are all up for review by the end of this week
- So by the time arrow
52.2.0
is released (early August 2024) we'll be able to add an option that makes DataFusion use StringView when reading from Parquet / filtering - We also plan to write a blog post about this work / adventure
I for one am very excited
from arrow-datafusion.
Now that we have upgraded to arrow 52.1.0, I think we could merge the string-view
branch to main. I'll try and make a PR if no one beats me to it
from arrow-datafusion.
I think we should aim for a first "milestone" of showing improvements for some clickbench queries
from arrow-datafusion.
Will we completely change StringArray to StringViewArray in Datafusion?
While I try to utilize StringViewArray in #10976 , I found there is schema mismatched issue UTF8 vs UTF8View. To avoid converting StringViewArray to StringArray, we might need to change the schema to UTF8View overall from logical plan to physical plan. If we need to keep both String and StringView, then we need to think about how to deal with the conversion between these two types.
A more concrete example is
statement ok
create table t(a int, b varchar, c int) as values (1, 'a', 3), (2, 'c', 1), (1, 'c', 2), (1, 'a', 4);
We have the string column b
as StringArray and DataType::Utf8 now. Should we convert it to StringViewArray and DataType::Utf8View?
If not, if we somehow want to utilize StringViewArray, how do we minize the cost of conversion between String and StringView?
It seems Polars completely refactor their String to StringView 🤔
from arrow-datafusion.
Will we completely change StringArray to StringViewArray in Datafusion?
I think since they are two separate types in Arrow we couldn't fully switch to StringView the way polars could as it controls the whole stack. Users could still feed DataFusion StringViewArray from custom TableProviders and would expect StringView at the output.
However what I think we could do is internally to DataFusion (e.g. within the plan, before the final output) is use StringView in the batches that flow through intermediate nodes in the plan.
I found there is schema mismatched issue UTF8 vs UTF8View. To avoid converting StringViewArray to StringArray, we might need to change the schema to UTF8View
Indeed, As you point out, I don't think we can transparently switch to using StringView -- instead we would have to start encoding information in the plans about the new types.
I wonder if we could have a new logical optimzier pass that tried to annotate operations that support it to use StringView in their schema rather than String. Then the ExecutionPlans would know if they were supposed to generate StringView as output or the more traditional StringArray 🤔
Here is an idea of one place to start: #9403 (comment)
from arrow-datafusion.
I think @XiangpengHao is looking into another place to use StringView which is #10921 -- where we have a similar idea to use StringView in some sub portion of the plan. Here is more info about the optimizer pass idea: #10921 (comment)
from arrow-datafusion.
I think this project is going pretty well
We are at the point of starting to implement some basic functions using StringView.
from arrow-datafusion.
Update: we merged the basic support for StringView in #11402
I have created a branch to collect any changes that rely on the pre-release parquet/arrow 52.2.0
version here: https://github.com/apache/datafusion/tree/string-view2
from arrow-datafusion.
Related Issues (20)
- Improve memory pool reservation `shrink` error handling HOT 1
- Return scalar result when all inputs are constant for `map` and `make_map` udfs HOT 1
- Impl hash for ScalarValue::Map HOT 1
- `DFSchema::check_names` allows ambiguous references. HOT 1
- Feat: Support `GROUP BY unnest expr`
- Add support for `newlines_in_values` to `CsvOptions` HOT 4
- Move `sql_compound_identifier_to_expr` to `ExprPlanner` HOT 6
- DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 HOT 1
- Release DataFusion `41.0.0`
- ExprPlanner not propagated to SqlToRel HOT 7
- Update the parquet code `prune_pages_in_one_row_group` to use the `StatisticsExtractor`
- Optimize CASE WHEN for "expression or null" case HOT 1
- Add has_side_effects to PhysicalExpr HOT 1
- `SanityCheckPlan` Error during planning: ... does not satisfy parent order requirements: ... HOT 1
- Move spill related functions to `spill.rs` HOT 5
- Consolidate optimizer readme into datafusion user guide HOT 2
- Reduce repetition in `try_process_group_by_unnest` and `try_process_unnest`
- Investigate memory use in debug builds for deeply nested array constants
- [EPIC] Extract remaining physical optimizer out of core
- Leverage dictionary-encode when turning a scalar columnar value into an array HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.