Code Monkey home page Code Monkey logo

Comments (12)

Matthias-Warlop avatar Matthias-Warlop commented on August 27, 2024 1

minimal reproducable code:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": ["", "hello", "hi", ""],
    }
)

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.struct(
                    pl.col("description").alias("translation"),
                )
            )
        )
        .alias("description"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)
shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[1]] │
╞═════════════════╡
│ null            │
│ [{"hello"}]     │
│ [{"hi"}]        │
│ null            │
└─────────────────┘
shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[1]] │
╞═════════════════╡
│ null            │
│ [{""}]          │
│ [{"hello"}]     │
│ null            │
└─────────────────┘

I left away the None value and check for None, because this does not seem to influence the test. Also remove the locale field.

from polars.

cmdlineluser avatar cmdlineluser commented on August 27, 2024 1

So it seems to end up being a problem with write_parquet specifically?

If we switch to sink_parquet - it does not replicate the bug:

df_gold.lazy().sink_parquet(f"test.parquet")
pl.read_parquet(f"test.parquet")

# shape: (4, 1)
# ┌─────────────────┐
# │ description_new │
# │ ---             │
# │ list[str]       │
# ╞═════════════════╡
# │ null            │
# │ ["hello"]       │
# │ ["hi"]          │
# │ null            │
# └─────────────────┘

from polars.

Matthias-Warlop avatar Matthias-Warlop commented on August 27, 2024 1

The lazy().sink_parquet does not seem to fix the issue. The bug is extemely specific as to when it occurs, but when it does, it occurs consistently.
Take the following code:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": [
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            None,
            "Hello",
            None,
            None,
            "Hello",
            None,
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
        ],
    }
)


df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ [""]            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘

when the "" value is placed one row higher during the creation of the dataframe:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": [
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            None,
            "Hello",
            None,
            None,
            "Hello",
            None,
            "Hello",
            "Hello",
            "Hello",
            "",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
        ],
    }
)


df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘

There is no longer a bug. lazy().sink_parquet seems to alter the behaviour of the bug, but the bug is still very much there. Again when reading the parquet I can see the wrong values too, so the bug is not because of the pl.read_parquet

from polars.

cmdlineluser avatar cmdlineluser commented on August 27, 2024 1

Trying to find out when this changed:

pl.__version__
df_gold[10:].equals(df_gold_read[10:])

It seems it happened after 0.20.15

'0.20.16'
False
'0.20.15'
True

Looking at the release notes for 0.20.16:

https://github.com/pola-rs/polars/releases/tag/py-0.20.16

  • add new when-then-otherwise kernels #15089

As the bug seems to depend on when-then, this may be a starting point for further investigation.

from polars.

cmdlineluser avatar cmdlineluser commented on August 27, 2024

Can you show an example of the problem?

e.g.

df = pl.DataFrame({
    "description": ["", None, "hi", ""],
    "locale": ["a", "b", "c", "d"],
})

df.select(
    pl.when(pl.col("description").is_null()).then(pl.lit(None))
      .when(pl.col("description") == "")
      .then(pl.lit(None))
      .otherwise(
          pl.concat_list(
              pl.struct(
                  pl.lit("en").alias("locale"),
                  pl.col("description").alias("translation"),
              )
          )
      )
      .alias("description")
)
shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[2]] │
╞═════════════════╡
│ null            │
│ null            │
│ [{"en","hi"}]   │
│ null            │
└─────────────────┘

from polars.

Matthias-Warlop avatar Matthias-Warlop commented on August 27, 2024

I have been investigating further. The issue seems to be more complex.

df_silver = pl.read_parquet(f"data/silver/locations.parquet")
df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.struct(
                    pl.lit("en").alias("locale"),
                    pl.col("description").alias("translation"),
                )
            )
        )
        .alias("description"),
    ]
)
# printing the the range of rows where the issue occurs
print(df_gold[60:70])
shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

As you can see, no issue.
If I then do

df_gold.write_parquet(f"data/gold/locations.parquet")
df_gold_read = pl.read_parquet(f"data/gold/locations.parquet")
print(df_gold_read[60:70])
shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en",""}]                     │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

I see the empty translation field again, just like I saw when exploring the parquet file in data wrangler.
Even stranger is when i do this:

df_gold.write_json(f"data/gold/eldrive_locations.csv")
df_gold_read = pl.read_json(f"data/gold/eldrive_locations.csv")
print(df_gold_read[60:70])
shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en",null}]                   │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en",null}]                   │
│ [{"en",null}]                   │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en",null}]                   │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

To show the complete values in the dataframe I ran:

for row in df_gold[60:70].iter_rows(named=True):
    print(row["description"])
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms.'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]

from polars.

ritchie46 avatar ritchie46 commented on August 27, 2024

Can you update to latest Polars (1.2.1) and confirm it still occurs?

from polars.

Matthias-Warlop avatar Matthias-Warlop commented on August 27, 2024

Sorry, I thought that I already was on the latest version. The ouput of my tests are identical with the new version.

pl.show_versions()
--------Version info---------
Polars:               1.2.1
Index type:           UInt32
Platform:             Linux-6.8.0-38-generic-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

I also tried using the dataframe you used in your example. Saving as json and parquet give the same results here. printing the read dataframe:

shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[2]] │
╞═════════════════╡
│ null            │
│ null            │
│ [{"en",""}]     │
│ null            │
└─────────────────┘

from polars.

ritchie46 avatar ritchie46 commented on August 27, 2024

And have you got a minimal repro? I don't see any way to reproduce your query? Ideally on syntetic data in memory, otherwise from the file. Cut out everything that isn't involved.

from polars.

Matthias-Warlop avatar Matthias-Warlop commented on August 27, 2024

The struct also does not seem to influence the bug

import polars as pl

df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)
shape: (4, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["hello"]       │
│ ["hi"]          │
│ null            │
└─────────────────┘
shape: (4, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ [""]            │
│ ["hello"]       │
│ null            │
└─────────────────┘

from polars.

Matthias-Warlop avatar Matthias-Warlop commented on August 27, 2024

This indeed fixes the issue. Am I right to think this should be considered as a major issue, as it negatively affects the data quality?
Also I tried further testing and it does have something to do with the creation of the df_gold dataframe (so the .select part). When I create a dataframe similar to df_gold directly like this:

df_silver = pl.DataFrame({"description": [None, ["hello"], ["hi"], None]})

print(df_silver)
df_silver.write_parquet(f"test.parquet")
df_silver_read = pl.read_parquet(f"test.parquet")
print(df_silver_read)
┌─────────────┐
│ description │
│ ---         │
│ list[str]   │
╞═════════════╡
│ null        │
│ ["hello"]   │
│ ["hi"]      │
│ null        │
└─────────────┘
shape: (4, 1)
┌─────────────┐
│ description │
│ ---         │
│ list[str]   │
╞═════════════╡
│ null        │
│ ["hello"]   │
│ ["hi"]      │
│ null        │
└─────────────┘

Both the hello and hi appear correctly in the output.

from polars.

ritchie46 avatar ritchie46 commented on August 27, 2024
import polars as pl

df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)

@coastalwhite can you take a look?

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.