Code Monkey home page Code Monkey logo

Comments (8)

2010YOUY01 avatar 2010YOUY01 commented on August 17, 2024 1

Do you think it would still be desirable to be able to turn off file-level partioning for CSVs only (and/or to vary it on a per-type basis)?

I'm not familiar enough with Parquet to understand if there would be a functional reason to disable it for that format?

Yes, using the global option is only a quick fix.

I think it's a good idea to add an extra option field for individual files, and later the planner can make better decisions based on per file's newlines_in_values property (like for now disable parallel execution)

And it can be CSV-specific, parquet seems not relevant here.

from arrow-datafusion.

2010YOUY01 avatar 2010YOUY01 commented on August 17, 2024

I haven't dug into the implementation, but I imagine it becomes harder to find the right split point for multi-threaded reading

That is correct, DataFusion parallel CSV scan does:

  1. For example it gets a 1000B CSV file and wants to read in 2 threads.
  2. File is split into [0, 500), [500, 1000) two ranges, then probe the actual line boundary and adjust the range
  3. Let arrow-rs handle reading a valid file byte range.

So I think the first step will be to add arrow support

from arrow-datafusion.

connec avatar connec commented on August 17, 2024

I see, so an approach could be to:

  1. Add newlines_in_values: bool to arrow_csv::reader::Format. The implementation could use this to consume terminators within quotes. arrow-csv operates in this way by default.

  2. Add newlines_in_values: bool to datafusion::config::CsvOptions (and datafusion::datasource::file_format::csv::CsvFormat). The implementation could use this in two ways:

    1. Pass the newlines_in_values flag on to csv-arrow.
    2. Disable parallelism in CSV scanning (I don't think it would be possible to reliably determine if you're inside a quoted string without seeing the whole file).~~

from arrow-datafusion.

connec avatar connec commented on August 17, 2024

Turns out arrow-csv does correctly parse newlines in quoted values, so the issue comes from reading CSVs in parallel. Limiting the target number of partitions to 1 solves the issue.

For example, if I add this to my code before I run ctx.sql("SELECT * FROM <csv including linebreaks>") then the query executed successfully (without it there are errors about wrong numbers of columns):

ctx.state_weak_ref()
    .upgrade()
    .unwrap()
    .write_arc()
    .config_mut()
    .options_mut()
    .execution
    .target_partitions = 1;

From a UX perspective, this setting is quite disconnected from the intent. It also required changing the session-level target_partitions setting, which (I assume, untested) would adversely affect reads of CSVs that might not be expected to contain linebreaks. A per-CSV newlines_in_values setting would help to signpost this situation and allow more efficient plans when only some executed CSVs have this requirement.

from arrow-datafusion.

2010YOUY01 avatar 2010YOUY01 commented on August 17, 2024

Turns out arrow-csv does correctly parse newlines in quoted values, so the issue comes from reading CSVs in parallel. Limiting the target number of partitions to 1 solves the issue.

Thank you for the investigation @connec , if that's the case, setting datafusion.optimizer.repartition_file_scans to false might help: https://datafusion.apache.org/user-guide/configs.html
IIRC this option will only limit file scan to one thread, other parts of the plan will still be parallelized using target_partitions

from arrow-datafusion.

connec avatar connec commented on August 17, 2024

Thanks for the pointer! That would certainly be better than disabling parallelization throughout.

Do you think it would still be desirable to be able to turn off file-level partioning for CSVs only (and/or to vary it on a per-type basis)?

I'm not familiar enough with Parquet to understand if there would be a functional reason to disable it for that format?

from arrow-datafusion.

connec avatar connec commented on August 17, 2024

Furthermore, I think there could be value in controlling this on a per-file basis. E.g. if you're working with multiple CSVs, one of which you know contains newlines in values and others that you know do not (and so can benefit from parallelisation).

from arrow-datafusion.

connec avatar connec commented on August 17, 2024

I've opened a PR with the most straightforward implementation of this I could think of. I'd be glad to receive any feedback on that approach.

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.