Comments (3)
The issue is that null values are outputed as
"geometry":{"type":null,"coordinates":null,"features":null}
instead of justnull
.
Well this is one of the oldest feature request in polars project, #3462. Your geometry
column has dtype Struct[Utf8, Utf8]
and currently polars Struct
dtype has no validity buffer, so any null
in Struct[Utf8,Utf8]
is represented as {null, null}
.
For an workaround one must venture outside polars
(e.g. move to pyarrow
and manually replace {null, null}
with null
), which is precisely what geojson
spec requires...
My personal thinking is that #3462 should be marked P-high and 1.0 todo list, current implementation of Struct
dtype in polars
makes it hard to inter-op with other dataframe libraries due to its inconsistency with standard arrow spec.
from polars.
Is the problem that you have something like:
df = pl.select(a=1,b=2,c=pl.lit('{"foo":[1, 2, 3],"bar":[4,5,6]}'))
df.select(pl.col.c.str.json_decode()).write_ndjson()
# '{"c":{"foo":[1,2,3],"bar":[4,5,6]}}\n'
But you need the data written without the outer column name label?
# {"foo":[1,2,3],"bar":[4,5,6]}\n'
from polars.
Here are some examples that hopefully help better explain my issue with that.
The issue is that null values are outputed as "geometry":{"type":null,"coordinates":null,"features":null}
instead of just null
.
# Issue #17054
import polars as pl
import json
import tempfile
big_geojson_obj = {"type":"FeatureCollection","features":[{"id":"baddba6f1276e861263d05d9cbecff74","type":"Feature","properties":{"lineColor":"#ffa000","lineWidth":2,"fillColor":"#ffe082","fillOpacity":0.1},"geometry":{"coordinates":[[[-114.42286807,55.199035275],[-118.90384586,53.413681626],[-115.7853142,51.95024781],[-111.63559015,53.23660491],[-114.42286807,55.199035275]]],"type":"Polygon"}}]}
df = pl.DataFrame({
"id": [1, 2, 3, 4],
"name": ["Location1", "Location2", "LocationWithLongGeom", "LocationWithNullGeom"],
"geometry": [
'{"type":"Point","coordinates":[102.0,0.5]}',
'{"type":"Point","coordinates":[103.0,1.0]}',
json.dumps(big_geojson_obj),
None
]
})
print("================ START BASIC WAY ================")
# Just output the column as a String (not what I want)
with tempfile.NamedTemporaryFile(suffix=".ndjson") as f:
df.write_ndjson(f.name)
f.seek(0)
print(f.read().decode())
print("================ END BASIC WAY ================")
print("================ START DEMO GOAL ================")
# Obviously this way is very slow
for row in df.iter_rows(named=True):
row_out = row.copy()
row_out["geometry"] = json.loads(row_out["geometry"]) if row_out["geometry"] is not None else None
print(json.dumps(row_out)) # fill write would happen here
print("================ END DEMO GOAL ================")
print("================ START SUGGESTION 1 ================")
# This is the previous suggestion.
# The issue is that null values are outputed as `"geometry":{"type":null,"coordinates":null,"features":null}` instead of just `null`.
df1 = df.with_columns(pl.col('geometry').str.json_decode())
with tempfile.NamedTemporaryFile(suffix=".ndjson") as f:
df1.write_ndjson(f.name)
f.seek(0)
print(f.read().decode())
print("================ END SUGGESTION 1 ================")
================ START BASIC WAY ================
{"id":1,"name":"Location1","geometry":"{\"type\":\"Point\",\"coordinates\":[102.0,0.5]}"}
{"id":2,"name":"Location2","geometry":"{\"type\":\"Point\",\"coordinates\":[103.0,1.0]}"}
{"id":3,"name":"LocationWithLongGeom","geometry":"{\"type\": \"FeatureCollection\", \"features\": [{\"id\": \"baddba6f1276e861263d05d9cbecff74\", \"type\": \"Feature\", \"properties\": {\"lineColor\": \"#ffa000\", \"lineWidth\": 2, \"fillColor\": \"#ffe082\", \"fillOpacity\": 0.1}, \"geometry\": {\"coordinates\": [[[-114.42286807, 55.199035275], [-118.90384586, 53.413681626], [-115.7853142, 51.95024781], [-111.63559015, 53.23660491], [-114.42286807, 55.199035275]]], \"type\": \"Polygon\"}}]}"}
{"id":4,"name":"LocationWithNullGeom","geometry":null}
================ END BASIC WAY ================
================ START DEMO GOAL ================
{"id": 1, "name": "Location1", "geometry": {"type": "Point", "coordinates": [102.0, 0.5]}}
{"id": 2, "name": "Location2", "geometry": {"type": "Point", "coordinates": [103.0, 1.0]}}
{"id": 3, "name": "LocationWithLongGeom", "geometry": {"type": "FeatureCollection", "features": [{"id": "baddba6f1276e861263d05d9cbecff74", "type": "Feature", "properties": {"lineColor": "#ffa000", "lineWidth": 2, "fillColor": "#ffe082", "fillOpacity": 0.1}, "geometry": {"coordinates": [[[-114.42286807, 55.199035275], [-118.90384586, 53.413681626], [-115.7853142, 51.95024781], [-111.63559015, 53.23660491], [-114.42286807, 55.199035275]]], "type": "Polygon"}}]}}
{"id": 4, "name": "LocationWithNullGeom", "geometry": null}
================ END DEMO GOAL ================
================ START SUGGESTION 1 ================
{"id":1,"name":"Location1","geometry":{"type":"Point","coordinates":[102.0,0.5],"features":null}}
{"id":2,"name":"Location2","geometry":{"type":"Point","coordinates":[103.0,1.0],"features":null}}
{"id":3,"name":"LocationWithLongGeom","geometry":{"type":"FeatureCollection","coordinates":null,"features":[{"id":"baddba6f1276e861263d05d9cbecff74","type":"Feature","properties":{"lineColor":"#ffa000","lineWidth":2,"fillColor":"#ffe082","fillOpacity":0.1},"geometry":{"coordinates":[[[-114.42286807,55.199035275],[-118.90384586,53.413681626],[-115.7853142,51.95024781],[-111.63559015,53.23660491],[-114.42286807,55.199035275]]],"type":"Polygon"}}]}}
{"id":4,"name":"LocationWithNullGeom","geometry":{"type":null,"coordinates":null,"features":null}}
================ END SUGGESTION 1 ================
from polars.
Related Issues (20)
- Panic when mismatching types between glob files HOT 8
- `write_database` fails for UInts and Time dtypes when ADBC used HOT 6
- rust-polars 0.41.3?
- Use BigQuery Dataframes as Read-Connector to BigQuery
- panic calling `collect_schema` on lazy group_by + map_batches HOT 1
- Python releases do not (always) correctly update the user guide HOT 2
- Add "periods" parameter to pl.datetime_range() HOT 1
- Python new release 1.0.0 causes regression. HOT 2
- `Series.scatter` operates in-place HOT 1
- Argument suggestion for "pl.DataFrame.to_dict()" HOT 3
- Polars use nest_asyncio
- Cannot scan cloud files containing spaces in path name HOT 8
- DataFrame rename method is extremely slow for dictionary argument
- Support for Business Day Lengths in group_by_dynamic
- Support for making all possible combinations of elements in a df, multiIndex from the cartesian product of multiple iterables? HOT 2
- Slow rolling_skew performance and inconsistent signature vs other rolling methods
- sum() and cum_sum() do not cast to UInt64 when data overflows UInt32 HOT 3
- Support casting String to Time HOT 4
- `pl.col('str_col').cast(pl.UInt8).fill_null(pl.lit(0))` gets converted to Int16 type HOT 4
- add `with_source_column` to `scan_parquet` and `read_parquet`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.