Code Monkey home page Code Monkey logo

Comments (5)

mjakubowski84 avatar mjakubowski84 commented on June 16, 2024

from parquet4s.

kazaferovic avatar kazaferovic commented on June 16, 2024

Thank you.
This doesn't work if the repetition_type is required.
But I found another possibility: read the parquet as generic RowParquetRecord and then get the field as Long. Drawback is that projection is not possible. But in my use-case there is no or only little performance impact.

from parquet4s.

mjakubowski84 avatar mjakubowski84 commented on June 16, 2024

Hmm, I assume that we misunderstood each other and you mean that you have files with different schemas. One set of files has required fields and the second set has optional fields. And you try to read those files at once in a single job. Am I right?
In such a situation you cannot use projection at all, because by defining projection you enforce your own schema for the file. If your schema does not match the schema of the column that you read then you will get an error.
However, if you will read the data without projection, either by generic records (RowParquetRecord) or by case class (use Options) then all should be fine.
Another way is to group your files so that you have homogenous files in each group and read them separately with projection. Then use Akka Stream or experimental ETL of Parquet4s to concat the streams. Check the example: https://github.com/mjakubowski84/parquet4s/blob/master/examples/src/main/scala/com/github/mjakubowski84/parquet4s/core/ColumnProjectionAndDataConcatenationApp.scala

from parquet4s.

kazaferovic avatar kazaferovic commented on June 16, 2024

You are right in what I am trying to do.
ParquetReader.as[CaseClass] works even without wrapping the Long in Option and also with a case class that contains a subset of the available fields. But I assume that it is not as efficient as projection (at least theoratically)?
However that is a good enough solution. Thanks again.

from parquet4s.

mjakubowski84 avatar mjakubowski84 commented on June 16, 2024

But I assume that it is not as efficient as projection

Yes, it does no projection, it just decodes the whole set of data to a smaller class

from parquet4s.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.