Code Monkey home page Code Monkey logo

Comments (8)

alamb avatar alamb commented on July 22, 2024 3

An update here on the plan from @XiangpengHao:

  • He has local changes (that also require some additional features that will be released in arrow 52.2.0) that show significant performance improvements for TPCH and ClickBench queries
  • The hope is that these proposals are all up for review by the end of this week
  • So by the time arrow 52.2.0 is released (early August 2024) we'll be able to add an option that makes DataFusion use StringView when reading from Parquet / filtering
  • We also plan to write a blog post about this work / adventure

I for one am very excited

from arrow-datafusion.

alamb avatar alamb commented on July 22, 2024 1

Now that we have upgraded to arrow 52.1.0, I think we could merge the string-view branch to main. I'll try and make a PR if no one beats me to it

from arrow-datafusion.

alamb avatar alamb commented on July 22, 2024

I think we should aim for a first "milestone" of showing improvements for some clickbench queries

from arrow-datafusion.

jayzhan211 avatar jayzhan211 commented on July 22, 2024

Will we completely change StringArray to StringViewArray in Datafusion?
While I try to utilize StringViewArray in #10976 , I found there is schema mismatched issue UTF8 vs UTF8View. To avoid converting StringViewArray to StringArray, we might need to change the schema to UTF8View overall from logical plan to physical plan. If we need to keep both String and StringView, then we need to think about how to deal with the conversion between these two types.

A more concrete example is

statement ok
create table t(a int, b varchar, c int) as values (1, 'a', 3), (2, 'c', 1), (1, 'c', 2), (1, 'a', 4);

We have the string column b as StringArray and DataType::Utf8 now. Should we convert it to StringViewArray and DataType::Utf8View?

If not, if we somehow want to utilize StringViewArray, how do we minize the cost of conversion between String and StringView?

It seems Polars completely refactor their String to StringView 🤔

from arrow-datafusion.

alamb avatar alamb commented on July 22, 2024

Will we completely change StringArray to StringViewArray in Datafusion?

I think since they are two separate types in Arrow we couldn't fully switch to StringView the way polars could as it controls the whole stack. Users could still feed DataFusion StringViewArray from custom TableProviders and would expect StringView at the output.

However what I think we could do is internally to DataFusion (e.g. within the plan, before the final output) is use StringView in the batches that flow through intermediate nodes in the plan.

I found there is schema mismatched issue UTF8 vs UTF8View. To avoid converting StringViewArray to StringArray, we might need to change the schema to UTF8View

Indeed, As you point out, I don't think we can transparently switch to using StringView -- instead we would have to start encoding information in the plans about the new types.

I wonder if we could have a new logical optimzier pass that tried to annotate operations that support it to use StringView in their schema rather than String. Then the ExecutionPlans would know if they were supposed to generate StringView as output or the more traditional StringArray 🤔

Here is an idea of one place to start: #9403 (comment)

from arrow-datafusion.

alamb avatar alamb commented on July 22, 2024

I think @XiangpengHao is looking into another place to use StringView which is #10921 -- where we have a similar idea to use StringView in some sub portion of the plan. Here is more info about the optimizer pass idea: #10921 (comment)

from arrow-datafusion.

alamb avatar alamb commented on July 22, 2024

I think this project is going pretty well :bowtie:

We are at the point of starting to implement some basic functions using StringView.

from arrow-datafusion.

alamb avatar alamb commented on July 22, 2024

Update: we merged the basic support for StringView in #11402

I have created a branch to collect any changes that rely on the pre-release parquet/arrow 52.2.0 version here: https://github.com/apache/datafusion/tree/string-view2

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.