Hi there. I have an existing case class: abstract case class

Question. How to define a TypedSchemaDef for a Map[String, Any] about parquet4s HOT 8 CLOSED

reggieperry commented on July 20, 2024

Question. How to define a TypedSchemaDef for a Map[String, Any]

from parquet4s.

Comments (8)

mjakubowski84 commented on July 20, 2024

Hi!
Parquet itself does not support fields of type Any. You need to specify a fixed type. So I suggest you change the model of DataMap. For example, you can have two maps: stringIds: Map[String, String] and decimalIds: Map[String, BigDecimal].

from parquet4s.

reggieperry commented on July 20, 2024

Unfortunately, there’s way too much legacy code that depends on this. Can I dynamically generate the TypedSchemaDef via Ref[A] somehow? How is it that the SchemaDef I wrote actually works? I didn’t reason it out so much as I tried different things.

from parquet4s.

mjakubowski84 commented on July 20, 2024

The schema is for the whole Parquet file - not for a single row. So, if you keep writing decimals to one file, and then all strings to another (with another schema) - then it will work.

However, you can expect later problems with reading files with conflicting schemas.

from parquet4s.

reggieperry commented on July 20, 2024

The thing is that I wrote the encoder to always write strings but it seems like the type of the input data is checked against the output schema as opposed to the encoder output being validated against the schema. So if I change that map to use stringSchema instead of decimalSchema, it fails to compile.

from parquet4s.

normana400 commented on July 20, 2024

if the value of the Map[String,Any] can be of a finite set of possibilities (i.e either the value is a string or it is a long then I think the structure could feasibly be described as an Either.

Is there support for an Either structure? (ie a Map described as a Map[String, Either[String,Long]])

from parquet4s.

mjakubowski84 commented on July 20, 2024

Of course, there is :)
As I said before - do not insist on saving heterogeneous values of a map to a single collection. Partition your map into two: one for strings and the second for decimals. E.g. you can encode Ref directly as a RowParquetRecord if creating an intermediary case class is such a problem:

implicit def myEncoder[T]: OptionalValueEncoder[Ref[T]] = 
  new OptionalValueCodec[CustomType] {
    override def encodeNonNull(ref: Ref[T], configuration: ValueCodecConfiguration): Value =
      RowParquetRecord("type" -> [type as string], "stringIds" -> MapParquetRecord(stringIds entries), "decimalIds" -> MapParquetRecord([decimalIds entries])
}

And define a corresponding groupSchema.

from parquet4s.

mjakubowski84 commented on July 20, 2024

There's another low-level option - you can implement a custom version of MapParquetRecord, which writes several types of map entries: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetRecord.scala#L814 (not strictly one type, as it is done now).

However, I do not recommend it because it would be a non-standard approach to a map and reading such a map would be a challenge using any existing application/framework.

from parquet4s.

normana400 commented on July 20, 2024

my map seems to write okay however when I attempt to read it in parquet tools, I get a ArrowInvalid: Map keys must be provided. Is there something I need to explicitly do to add the annotation here?
implicit def refSchema[A <: MyObject[_]](implicit stringSchema: TypedSchemaDef[String]): TypedSchemaDef[Ref[A]] = { SchemaDef .group( stringSchema("type"), SchemaDef.map(stringSchema, stringSchema)("ids") ).typed[Ref[A]] }

from parquet4s.

Question. How to define a TypedSchemaDef for a Map[String, Any] about parquet4s HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent