Checks <input type="c

minimal reproducable code: <div class="highlight highlight-source-python notransla

Trying to find out when this changed: <div class="highlight highlight-source-pytho

Can you show an example of the problem? e.g. <div class="highlig

The struct also does not seem to influence the bug <div class="highlight highlight

Using pl.write_parquet() gives wrong results for values inside lists.,about pola-rs/polars

Comments (12)

Matthias-Warlop commented on August 27, 2024 1

minimal reproducable code:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": ["", "hello", "hi", ""],
    }
)

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.struct(
                    pl.col("description").alias("translation"),
                )
            )
        )
        .alias("description"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)

shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[1]] │
╞═════════════════╡
│ null            │
│ [{"hello"}]     │
│ [{"hi"}]        │
│ null            │
└─────────────────┘
shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[1]] │
╞═════════════════╡
│ null            │
│ [{""}]          │
│ [{"hello"}]     │
│ null            │
└─────────────────┘

I left away the None value and check for None, because this does not seem to influence the test. Also remove the locale field.

from polars.

cmdlineluser commented on August 27, 2024 1

So it seems to end up being a problem with write_parquet specifically?

If we switch to sink_parquet - it does not replicate the bug:

df_gold.lazy().sink_parquet(f"test.parquet")
pl.read_parquet(f"test.parquet")

# shape: (4, 1)
# ┌─────────────────┐
# │ description_new │
# │ ---             │
# │ list[str]       │
# ╞═════════════════╡
# │ null            │
# │ ["hello"]       │
# │ ["hi"]          │
# │ null            │
# └─────────────────┘

from polars.

Matthias-Warlop commented on August 27, 2024 1

The lazy().sink_parquet does not seem to fix the issue. The bug is extemely specific as to when it occurs, but when it does, it occurs consistently.
Take the following code:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": [
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            None,
            "Hello",
            None,
            None,
            "Hello",
            None,
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
        ],
    }
)


df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])

shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ [""]            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘

when the "" value is placed one row higher during the creation of the dataframe:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": [
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            None,
            "Hello",
            None,
            None,
            "Hello",
            None,
            "Hello",
            "Hello",
            "Hello",
            "",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
        ],
    }
)


df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])

shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘

There is no longer a bug. lazy().sink_parquet seems to alter the behaviour of the bug, but the bug is still very much there. Again when reading the parquet I can see the wrong values too, so the bug is not because of the pl.read_parquet

from polars.

cmdlineluser commented on August 27, 2024 1

Trying to find out when this changed:

pl.__version__
df_gold[10:].equals(df_gold_read[10:])

It seems it happened after 0.20.15

'0.20.16'
False

'0.20.15'
True

Looking at the release notes for 0.20.16:

https://github.com/pola-rs/polars/releases/tag/py-0.20.16

add new when-then-otherwise kernels #15089

As the bug seems to depend on when-then, this may be a starting point for further investigation.

from polars.

cmdlineluser commented on August 27, 2024

Can you show an example of the problem?

e.g.

df = pl.DataFrame({
    "description": ["", None, "hi", ""],
    "locale": ["a", "b", "c", "d"],
})

df.select(
    pl.when(pl.col("description").is_null()).then(pl.lit(None))
      .when(pl.col("description") == "")
      .then(pl.lit(None))
      .otherwise(
          pl.concat_list(
              pl.struct(
                  pl.lit("en").alias("locale"),
                  pl.col("description").alias("translation"),
              )
          )
      )
      .alias("description")
)

shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[2]] │
╞═════════════════╡
│ null            │
│ null            │
│ [{"en","hi"}]   │
│ null            │
└─────────────────┘

from polars.

Matthias-Warlop commented on August 27, 2024

I have been investigating further. The issue seems to be more complex.

df_silver = pl.read_parquet(f"data/silver/locations.parquet")
df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.struct(
                    pl.lit("en").alias("locale"),
                    pl.col("description").alias("translation"),
                )
            )
        )
        .alias("description"),
    ]
)
# printing the the range of rows where the issue occurs
print(df_gold[60:70])

shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

As you can see, no issue.
If I then do

df_gold.write_parquet(f"data/gold/locations.parquet")
df_gold_read = pl.read_parquet(f"data/gold/locations.parquet")
print(df_gold_read[60:70])

shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en",""}]                     │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

I see the empty translation field again, just like I saw when exploring the parquet file in data wrangler.
Even stranger is when i do this:

df_gold.write_json(f"data/gold/eldrive_locations.csv")
df_gold_read = pl.read_json(f"data/gold/eldrive_locations.csv")
print(df_gold_read[60:70])

shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en",null}]                   │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en",null}]                   │
│ [{"en",null}]                   │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en",null}]                   │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

To show the complete values in the dataframe I ran:

for row in df_gold[60:70].iter_rows(named=True):
    print(row["description"])

None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms.'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]

from polars.

ritchie46 commented on August 27, 2024

Can you update to latest Polars (1.2.1) and confirm it still occurs?

from polars.

Matthias-Warlop commented on August 27, 2024

Sorry, I thought that I already was on the latest version. The ouput of my tests are identical with the new version.

pl.show_versions()

--------Version info---------
Polars:               1.2.1
Index type:           UInt32
Platform:             Linux-6.8.0-38-generic-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

I also tried using the dataframe you used in your example. Saving as json and parquet give the same results here. printing the read dataframe:

shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[2]] │
╞═════════════════╡
│ null            │
│ null            │
│ [{"en",""}]     │
│ null            │
└─────────────────┘

from polars.

ritchie46 commented on August 27, 2024

And have you got a minimal repro? I don't see any way to reproduce your query? Ideally on syntetic data in memory, otherwise from the file. Cut out everything that isn't involved.

from polars.

Matthias-Warlop commented on August 27, 2024

The struct also does not seem to influence the bug

import polars as pl

df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)

shape: (4, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["hello"]       │
│ ["hi"]          │
│ null            │
└─────────────────┘
shape: (4, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ [""]            │
│ ["hello"]       │
│ null            │
└─────────────────┘

from polars.

Matthias-Warlop commented on August 27, 2024

This indeed fixes the issue. Am I right to think this should be considered as a major issue, as it negatively affects the data quality?
Also I tried further testing and it does have something to do with the creation of the df_gold dataframe (so the .select part). When I create a dataframe similar to df_gold directly like this:

df_silver = pl.DataFrame({"description": [None, ["hello"], ["hi"], None]})

print(df_silver)
df_silver.write_parquet(f"test.parquet")
df_silver_read = pl.read_parquet(f"test.parquet")
print(df_silver_read)

┌─────────────┐
│ description │
│ ---         │
│ list[str]   │
╞═════════════╡
│ null        │
│ ["hello"]   │
│ ["hi"]      │
│ null        │
└─────────────┘
shape: (4, 1)
┌─────────────┐
│ description │
│ ---         │
│ list[str]   │
╞═════════════╡
│ null        │
│ ["hello"]   │
│ ["hi"]      │
│ null        │
└─────────────┘

Both the hello and hi appear correctly in the output.

from polars.

ritchie46 commented on August 27, 2024

import polars as pl

df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)

@coastalwhite can you take a look?

from polars.

Using pl.write_parquet() gives wrong results for values inside lists. about polars HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent