mitchelllisle / sparkdantic Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 6.0 2.08 MB

✨ A Pydantic to PySpark schema library

Home Page: https://mitchelllisle.github.io/sparkdantic/

License: MIT License

Makefile 6.91% Python 93.09%

pydantic pyspark schema

sparkdantic's Introduction

Hi there 👋

My name is Mitchell. I am a Data & Privacy Engineer based in Sydney, Australia 🇦🇺

🔭 I’m a big fan of anything in the data privacy & security space including anonymisation, technical data governance & cryptography.
🏛️ I also enjoy architecting, working with and developing large-scale processing systems
🌱 I very occasionally post articles here
💭 I primarily work in Python, Go & Scala

sparkdantic's People

Contributors

Stargazers

Watchers

Forkers

jbpdl22 mitchstockdale shanehead chidifrank drew-moonhub dan1elt0m

sparkdantic's Issues

Proposal: Default Casting of Integers to Long Types

Feature Description

I propose that Sparkdantic should by default cast integers to Spark's LongType instead of IntegerType. This change would help prevent issues with integer overflow and make the library more robust for handling large numbers.

Proposed Solution

Introduce an option called safe_casting that, when enabled, would cast integers to LongType. The default value of this option could be True to ensure safety by default.

Example

Here's an example of how this feature could be used:

from pyspark.sql import SparkSession
from sparkdantic import SparkModel

class MyModel(SparkModel):
    my_field: int

# With safe_casting enabled, this would cast 'my_field' to LongType
model = MyModel.spark_schema(safe_casting=True)

Look forward to hearing your thoughts. I am also willing to create a PR for this feature.

Support for built-in collection types

Hey,
using python3.10, sparkdantic==0.17.0, pydantic==2.5.3, pydantic_core==2.14.6

from sparkdantic import SparkModel


class FooBar(SparkModel):
    id: int
    name: str
    words: list[str]



print(FooBar.model_spark_schema())

Getting following error:

Traceback (most recent call last):
  File "/home/alsh/tmp/code.py", line 11, in <module>
    print(FooBar.model_spark_schema())
  File "/home/alsh/Envs/sparkdantic-reproduce/lib/python3.10/site-packages/sparkdantic/model.py", line 238, in model_spark_schema
    if cls._is_spark_model_subclass(t):
  File "/home/alsh/Envs/sparkdantic-reproduce/lib/python3.10/site-packages/sparkdantic/model.py", line 273, in _is_spark_model_subclass
    return (inspect.isclass(value)) and (issubclass(value, SparkModel))
  File "/usr/lib/python3.10/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
TypeError: issubclass() arg 1 must be a class

The code works if I import typing.List and use it instead. Would it be possible to add support for built-in collection types?

Feature Request: Support BaseModel inside complex types

Sparkdantic can't parse pydantic models with BaseModel inside array or dict.

PR: #349

Allow lists of SparkModel objects

Nice project! Looks like SparkModel doesn't currently allow lists of SparkModel objects but this seems like a common case (in trying this on one of my pydantic models I ran into this error right off the bat).

I have a small change for this but looks like I can't push to your repo and create a PR. Let me know if you'd like it.

Cheers!

License problems?

Thank you again for that nice library! The overall idea looks very nice!

I was checking the code and found that sparkdantic depends of dbldatagen. But that library is distributed under commercial Databricks License. Are terms of dbldatagen license applied to the sparkdantic too? In other words, can I use sparkdantic outside of Databricks Platform?

Better handling of typing.Literal

I'm trying to derive the Spark schema of a BaseModel which I don't want to turn into a SparkModel, thus I'm creating a new model which subclasses from my original BaseModel and from SparkModel.

class MyClas(BaseModel):
    is_this_a_field: Literal["yes", "no"]


class SparkMyClass(SparkModel, MyClass):
    pass

When when trying to build a schema from SparkMyClass I get

>       if issubclass(t, Enum):
E       TypeError: issubclass() arg 1 must be a class

To fix this I need to redefine the Literal field in the base class.

My suggestion would be to add a check for Literal types and add a Spark field corresponding to the type of all the elements in the Literal, or fail if there's more than one type.

Model conversion from pydantic

Hi :)

i have some existing pydantic models in my codebase, I want to convert these to spark models automatically. eg like this

from pydantic import BaseModel
from sparkdantic import SparkModel
from typing import List

class A(BaseModel):
    a: str
 
class B(BaseModel):
    b: List[A]
 
class NaiveSparkB(SparkModel):
    b: List[A]
 
NaiveSparkB.model_spark_schema() ## <-- fails here with following error

# suggested solution something like
SparkB = SparkModel.convert_pydantic(B)

Error

TypeError: Type <class '__main__.A'> not recognized

I don't have the option to change A and B to SparkModels

MapType and ArrayType

Thanks for sharing this wonderful library.

I had a question regarding how padantic hints are interpreted to generate spark schema for dict and list.

I would expect the following:

for a: dict[str, str], I would expect it to generate MapType(StringType(), StringType(), False)
for a: dict[str, str | None], I would expect it to generate MapType(StringType(), StringType(), True)
for a: list[str], I would expect ArrayType(StringType(), False)
for a: list[str | None], I would expect ArrayType(StringType(), True)

But this is not what we see. The nullable flag for the values are ignored. Any good reason for ignoring this?

From reading the code, the fix to do the "right thing" is very simple:

            value_type, nullable_ = cls._type_to_spark(args[1], [])
            return MapType(key_type, value_type, nullable_), nullable

and

            inner_type = args[0]
            if cls._is_spark_model_subclass(inner_type):
                array_type = inner_type.model_spark_schema()
                nullable_ = nullable
            else:
                # Check if it's an accepted Enum
                array_type, nullable_ = cls._type_to_spark(inner_type, [])
            return ArrayType(array_type, nullable_), nullable

What was the motivation behind ignoring it? Thanks!

Nullability Parent of Nested Models

Good afternoon!

I am trying to implement the following models

from sparkdantic import SparkModel

class ChildSchema(SparkModel):
    ncol1: str
    ncol2: str

class ParentSchema(SparkModel):
    col1: ChildSchema

I would assume that because ChildSchema is not defined as optional that the spark schema would reflect this so that ChildSchema struct nullability is False.

However, for ParentSchema defined above this is the output of ParentSchema.model_spark_schema().jsonValue()

{
    'type': 'struct', 
    'fields': [
        {'name': 'col1', 
         'type': 
            {
                'type': 'struct', 'fields': [
                    {'name': 'ncol1', 'type': 'string', 'nullable': False, 'metadata': {}}, 
                    {'name': 'ncol2', 'type': 'string', 'nullable': False, 'metadata': {}}
                ]
            }, 
        'nullable': True, 'metadata': {}}]
}

Should 'nullable' be false for the case above? Or is this behavior on purpose?

UUID type

Great looking library. Would love to use it in my projects but I am missing UUID type. Would you be open to support UUID type (that simply maps to a spark StringType() )

Decimal precision support

Hey,
using python3.10, sparkdantic==0.18.0, pydantic==2.5.3, pydantic_core==2.14.6

I'm trying to use a field that is Decimal. When sparkdantic is generating a schema it ignores the decimal_places and uses default values for spark DecimalType which causes scale set to 0. Later when I'm generating a DataFrame, I'm getting wrong values.

from decimal import Decimal

from pydantic import Field
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, IntegerType, StringType, DecimalType, StructType
from pyspark.testing import assertDataFrameEqual
from sparkdantic import SparkModel


class FooBar(SparkModel):
    id: int
    name: str
    threshold: Decimal = Field(decimal_places=2)


def test_decimal():
    spark = SparkSession.builder.master('local[*]').getOrCreate()

    expected_schema = StructType([StructField('id', IntegerType(), False),
                                  StructField('name', StringType(), False),
                                  StructField('threshold', DecimalType(10, 2), False)])

    foobar = FooBar(id=95, name='test name', threshold=Decimal('0.33'))
    df_expected = spark.createDataFrame([(95, 'test name', Decimal('0.33'))],
                                        expected_schema)

    df = spark.createDataFrame([foobar.model_dump()], FooBar.model_spark_schema())

    df.show(truncate=False)
    df_expected.show(truncate=False)
    print(df.schema)
    print(df_expected.schema)
    assertDataFrameEqual(df, df_expected)
    spark.stop()

Getting following:

+---+---------+---------+
|id |name     |threshold|
+---+---------+---------+
|95 |test name|0        |
+---+---------+---------+

+---+---------+---------+
|id |name     |threshold|
+---+---------+---------+
|95 |test name|0.33     |
+---+---------+---------+

StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), False), StructField('threshold', DecimalType(10,0), False)])
StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), False), StructField('threshold', DecimalType(10,2), False)])

E           pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_ROWS] Results do not match: ( 100.00000 % )
E           *** actual ***
E           ! Row(id=95, name='test name', threshold=Decimal('0'))
E           
E           
E           *** expected ***
E           ! Row(id=95, name='test name', threshold=Decimal('0.33'))

I'm not sure if it's me doing it wrong or there is a bug here.

License?

Thank you for such a nice library? May you clarify, please, under which License is it distributed?

Make `t` variables more verbose

Suggestion from @dan1elt0m is to make the t type variables more verbose so that its easier to distinguish between python and pyspark types in future.

Leading underscores

Currently Pydantic does not allow leading underscores for fields.

They state we should use Field(alias='_id') (for _id field).

But at the moment if I do this with sparkdantic, this field is not generated with model_spark_schema()

Any thoughts on how to do this?

Unable to represent BIGINT

This is such a great project idea, just what I needed. Unfortunately, since Python unified int and long(PEP-237), the language no longer makes a type distinction between 4-byte integers and 8-byte integers. That makes it impossible to add a field to SparkModel whose spark type is LongType. I'm not sure what the best way forward is.

The LongType has at least one very critical use case: Unix Timestamps. If timestamps are stored as 4-byte signed integers (which is what IntegerType is), then the maximum date they can represent is 2038-01-19T04:14:07, which is not as far away as it seems :)

Feature request: Add Support for UUID Type

It would be great if Sparkdantic could support the UUID type in Pydantic models.

Why?

Useful: Lots of applications use UUIDs. Having this support would save time and effort.
Consistent: Pydantic supports UUIDs, so Sparkdantic should too.
In Demand: Many users probably need this feature.

TypeError: issubclass() arg 1 must be a class

Following is my code:

from enum import Enum
from typing import Optional

from pydantic import Field, conint
from sparkdantic import SparkModel


class FooBar(SparkModel):
    count: int = Field(..., title='Count')
    size: Optional[float] = Field(None, title='Size')


class Gender(str, Enum):
    male = 'male'
    female = 'female'
    other = 'other'
    not_given = 'not_given'


class Main(SparkModel):
    foo_bar: FooBar
    Gender: Optional[Gender]
    snap: Optional[conint(lt=50, gt=30)] = Field(
        42, description='this is the value of snap', title='The Snap'
    )

print(Main.model_spark_schema())

When I execute it, I get the following error:

Traceback (most recent call last):
  File "model.py", line 27, in <module>
    print(Main.model_spark_schema())
  File "\env\lib\site-packages\sparkdantic\model.py", line 241, in model_spark_schema
    t, nullable = cls._type_to_spark(v.annotation)
  File "\env\lib\site-packages\sparkdantic\model.py", line 355, in _type_to_spark
    if issubclass(t, Enum):
TypeError: issubclass() arg 1 must be a class

Is there something that I am missing or doing wrong? I am not able to debug or fix this issue.