Comments (13)
Yes, but if you're implementing the GeozeroDatasource trait directly on Arrow structs, you won't be able to do that from a third party crate. If you're creating your own structs... then you need to copy the memory from Arrow into your structs, which kinda defeats the purpose.
Do you need to actually copy the memory though? Things like moves into a new-type struct are optimized away:
https://rust.godbolt.org/z/8GabGrhbT
from geozero.
A GeoPartquet reader would be welcome for sure. It shouldn't be very difficult to implement, since there are Rust implementations for reading Parquet (https://github.com/apache/arrow-rs) and AFAIK geometries are stored as WKB, which is already supported in geozero.
from geozero.
I'm new to geospatial, but I'm gonna take a look at trying this out! (assuming that pull requests are welcome)
from geozero.
Just a quick note that in versions of geoparquet so far WKB is the standard, but in future versions we hope to also enable an Arrow-native geometry encoding, see https://github.com/geopandas/geo-arrow-spec/blob/main/format.md. The difference between the two is that a binary buffer of WKB in memory still needs to be parsed to be used. In contrast, an Arrow-native implementation would store geometries in a layout standardized by the Arrow project. This means that it's O(1)
to access any individual coordinate.
What this means is that using geozero on a GeoParquet dataset with geometries stored as WKB is essentially already supported by geozero
, you just iterate over the items in the array and use the existing geozero WKB reader. Note here that an improvement would be to figure out how to let the WKB reader read directly from the Arrow buffer. I seemed to have to copy the section of the arrow buffer into a Vec<u8>
to get geozero's WKB reader to accept it.
A separate geozero-geoarrow
implementation would be to add support for the Arrow-native encoding. Note that https://github.com/geopandas/geo-arrow-spec/blob/main/format.md is a provisional, unreleased spec (though it was implemented by GDAL in 3.5.0). So if we have feedback while implementing a Rust reader, we should comment on the project. For prior work, there are WIP native arrow implementations in Python, C++, R, and a data repo.
from geozero.
I am starting work on a geoparquet crate that will use geozero and could potentially be merged into geozero at some point.
FYI: I also will be working on a geoarrow crate and geodatafusion crate, stay tuned for developments....
from geozero.
A few notes... I've already started planning how to implement geoparquet
and geoarrow
and I'd love to discuss each of these so that we can align our work. It might also be worthwhile to discuss this offline, e.g. in the georust discord: https://discord.gg/63ueuxHu.
On GeoParquet:
- If you're ok with loading data into Arrow, in many cases you don't need a
geoparquet
crate at all! If you load a geoparquet file into arrow, you automatically have geoarrow. - If you want to support loading data from Parquet into
geo-types
structs, then I think usinggeozero
's setup will likely make things much faster. But you're going to hit the orphan rule, which says that you can't implement an external trait onto an external type. Becausegeozero
owns the geozero traits andarrow(2)/parquet(2)
own the structs, you have to create your own traits, which are then less interoperable with the rest of the system. - Are you planning to use parquet and arrow or parquet2 and arrow2?
On GeoArrow:
- I've already been prototyping how to design geoarrow (around the proposed struct approach: geoarrow/geoarrow#26) to make it fast and need fewer casts, so I'd love to chat more about your thoughts on this.
- Similarly, are you planning to use arrow or arrow2? I'm locked into the arrow2 ecosystem, because I'm building geopolars around polars, and polars uses arrow2. The crates really aren't compatible with each other
- Again, if you want to define traits onto arrow structs, you're going to hit the orphan rule again. Or at least, long term, we need
geo
to support reading data directly from arrow memory, and the way that will happen is by having traits defined in thegeo
crate, and then you won't be able to implement external traits fromgeo
onto external structs fromarrow/arrow2
from geozero.
I dont quite understand the problem you are referring to. I have written a
struct that has the following:
impl GeozeroDatasource for GeoParquetReader { .... }
and there are other crates eg geozero-shp that implement readers for geozero
The implementation I have currently uses parquet, and I was planning on having a feature to be able to select between parquet/parquet2.
I do want to use parquet rather than loading it into memory as I am targeting the use case of files that will not fit into memory. (Although I know that the arrow file could be memory mapped)
from geozero.
It's still worthwhile to have a geoparquet/geoarrow implementation in geozero
, so I don't want to dissuade you from that. My point is that today the georust/geo
algorithms all use geo-specific structs. It's useful to be able to load geoparquet/geoarrow into those structs, but it also misses out on a huge reason for Arrow, being easy to transfer memory across languages without serialization.
I think my discussion comment on geo
here as well as this blog post serve as useful reasons for why I see support for geoparquet/geoarrow in geozero
as only a short-term stepping stone.
As mentioned in that discussion, I have a WIP implementation of geoarrow here.
and there are other crates eg geozero-shp that implement readers for geozero
Yes, but if you're implementing the GeozeroDatasource
trait directly on Arrow structs, you won't be able to do that from a third party crate. If you're creating your own structs... then you need to copy the memory from Arrow into your structs, which kinda defeats the purpose.
Although I know that the arrow file could be memory mapped
This is true... but with the geozero approach you need to first load a chunk of data into memory and then copy all of it into geo
structs, which is not very memory efficient. This is why I'm excited about the idea of geometry access traits, as mentioned in this discussion, so that you could actually memory map Arrow directly without a full copy.
from geozero.
Things like moves into a new-type struct are optimized away
That's super interesting. I've never looked into low-level compiled output before.
@apps4uco Re-reading my comment I think my tone might've come off too harsh... I'm fully supportive of both approaches! I've tried to closely follow spec discussions on both GeoParquet and GeoArrow and I'd be happy to review your PRs if you'd like 🙂. Keep in mind that neither spec has reached 1.0 yet. WKB-encoded GeoParquet is the most stable (native encodings are still up in the air) so I'd start with that one. You might even be able to leverage my previous geozero PR: #39
from geozero.
Geoparquet is getting ready for a 1.0.0-beta.1 soon opengeospatial/geoparquet#147, and I'd love to see support for this in geozero. @apps4uco do you have any existing work to share? Just want to avoid duplication of work
from geozero.
Would love some feedback on #80
from geozero.
Geoparquet is getting ready for a 1.0.0-beta.1 soon opengeospatial/geoparquet#147, and I'd love to see support for this in geozero. @apps4uco do you have any existing work to share? Just want to avoid duplication of work
Hi, I have been busy on other things, and so havent really advanced on the implementation, For what its work here is the git repo with my latest changes: github.com:apps4uco/geoparquet.git
from geozero.
The geoarrow
crate now includes a geozero integration for all GeoArrow arrays as well as a GeoParquet reader. The upcoming 0.2 release will include a GeoParquet writer (it was merged yesterday).
In #186 we removed the minimal, outdated geoarrow integration from the geozero crate and are now pointing users to the geoarrow
crate. Therefore, I'll close this issue.
Feel free to create issues on the geoarrow
crate for any bugs with the GeoParquet reader or writer.
from geozero.
Related Issues (20)
- [request] make WkbInfo and header readers public HOT 1
- Docs: Add example how to use with Serde HOT 1
- Implement an sqlx (sqlite) driver HOT 3
- Consider using pure rust protobuf codegen HOT 3
- Make readme examples part of `cargo test`
- Create a sample MVT file with all types of geometries/properties for testing
- Error refactoring HOT 2
- Make srid & dim fields of all writers private, and add ctor
- Introduce global dimension constants
- Better support for sparse properties by declaring schema when available HOT 1
- Clarification of `GeomProcessor` methods HOT 1
- `point_begin`/`point_end` should be called around `xy` for each Point in a MultiPoint HOT 1
- GeoPackage reader (GeozeroDatasource impl) HOT 3
- Move geozero-shp into geozero? HOT 3
- Impl `GeozeroGeometry` for concrate type should impls srid() and dims() HOT 1
- `BufferedGeozeroDatasource` for spatial filtering on reading
- Documentation for parsing GeoJSON from a file HOT 6
- Support for writing WKT with precision
- `geos::Geometry` to `ewkt` panic
- Add "Geospatial" category to `Cargo.toml` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geozero.