Code Monkey home page Code Monkey logo

Comments (13)

michaelkirk avatar michaelkirk commented on June 10, 2024 1

Yes, but if you're implementing the GeozeroDatasource trait directly on Arrow structs, you won't be able to do that from a third party crate. If you're creating your own structs... then you need to copy the memory from Arrow into your structs, which kinda defeats the purpose.

Do you need to actually copy the memory though? Things like moves into a new-type struct are optimized away:

https://rust.godbolt.org/z/8GabGrhbT

from geozero.

pka avatar pka commented on June 10, 2024

A GeoPartquet reader would be welcome for sure. It shouldn't be very difficult to implement, since there are Rust implementations for reading Parquet (https://github.com/apache/arrow-rs) and AFAIK geometries are stored as WKB, which is already supported in geozero.

from geozero.

BKDaugherty avatar BKDaugherty commented on June 10, 2024

I'm new to geospatial, but I'm gonna take a look at trying this out! (assuming that pull requests are welcome)

from geozero.

kylebarron avatar kylebarron commented on June 10, 2024

Just a quick note that in versions of geoparquet so far WKB is the standard, but in future versions we hope to also enable an Arrow-native geometry encoding, see https://github.com/geopandas/geo-arrow-spec/blob/main/format.md. The difference between the two is that a binary buffer of WKB in memory still needs to be parsed to be used. In contrast, an Arrow-native implementation would store geometries in a layout standardized by the Arrow project. This means that it's O(1) to access any individual coordinate.

What this means is that using geozero on a GeoParquet dataset with geometries stored as WKB is essentially already supported by geozero, you just iterate over the items in the array and use the existing geozero WKB reader. Note here that an improvement would be to figure out how to let the WKB reader read directly from the Arrow buffer. I seemed to have to copy the section of the arrow buffer into a Vec<u8> to get geozero's WKB reader to accept it.

A separate geozero-geoarrow implementation would be to add support for the Arrow-native encoding. Note that https://github.com/geopandas/geo-arrow-spec/blob/main/format.md is a provisional, unreleased spec (though it was implemented by GDAL in 3.5.0). So if we have feedback while implementing a Rust reader, we should comment on the project. For prior work, there are WIP native arrow implementations in Python, C++, R, and a data repo.

from geozero.

apps4uco avatar apps4uco commented on June 10, 2024

I am starting work on a geoparquet crate that will use geozero and could potentially be merged into geozero at some point.

FYI: I also will be working on a geoarrow crate and geodatafusion crate, stay tuned for developments....

from geozero.

kylebarron avatar kylebarron commented on June 10, 2024

A few notes... I've already started planning how to implement geoparquet and geoarrow and I'd love to discuss each of these so that we can align our work. It might also be worthwhile to discuss this offline, e.g. in the georust discord: https://discord.gg/63ueuxHu.

On GeoParquet:

  • If you're ok with loading data into Arrow, in many cases you don't need a geoparquet crate at all! If you load a geoparquet file into arrow, you automatically have geoarrow.
  • If you want to support loading data from Parquet into geo-types structs, then I think using geozero's setup will likely make things much faster. But you're going to hit the orphan rule, which says that you can't implement an external trait onto an external type. Because geozero owns the geozero traits and arrow(2)/parquet(2) own the structs, you have to create your own traits, which are then less interoperable with the rest of the system.
  • Are you planning to use parquet and arrow or parquet2 and arrow2?

On GeoArrow:

  • I've already been prototyping how to design geoarrow (around the proposed struct approach: geoarrow/geoarrow#26) to make it fast and need fewer casts, so I'd love to chat more about your thoughts on this.
  • Similarly, are you planning to use arrow or arrow2? I'm locked into the arrow2 ecosystem, because I'm building geopolars around polars, and polars uses arrow2. The crates really aren't compatible with each other
  • Again, if you want to define traits onto arrow structs, you're going to hit the orphan rule again. Or at least, long term, we need geo to support reading data directly from arrow memory, and the way that will happen is by having traits defined in the geo crate, and then you won't be able to implement external traits from geo onto external structs from arrow/arrow2

from geozero.

apps4uco avatar apps4uco commented on June 10, 2024

I dont quite understand the problem you are referring to. I have written a
struct that has the following:
impl GeozeroDatasource for GeoParquetReader { .... }
and there are other crates eg geozero-shp that implement readers for geozero

The implementation I have currently uses parquet, and I was planning on having a feature to be able to select between parquet/parquet2.

I do want to use parquet rather than loading it into memory as I am targeting the use case of files that will not fit into memory. (Although I know that the arrow file could be memory mapped)

from geozero.

kylebarron avatar kylebarron commented on June 10, 2024

It's still worthwhile to have a geoparquet/geoarrow implementation in geozero, so I don't want to dissuade you from that. My point is that today the georust/geo algorithms all use geo-specific structs. It's useful to be able to load geoparquet/geoarrow into those structs, but it also misses out on a huge reason for Arrow, being easy to transfer memory across languages without serialization.

I think my discussion comment on geo here as well as this blog post serve as useful reasons for why I see support for geoparquet/geoarrow in geozero as only a short-term stepping stone.

As mentioned in that discussion, I have a WIP implementation of geoarrow here.

and there are other crates eg geozero-shp that implement readers for geozero

Yes, but if you're implementing the GeozeroDatasource trait directly on Arrow structs, you won't be able to do that from a third party crate. If you're creating your own structs... then you need to copy the memory from Arrow into your structs, which kinda defeats the purpose.

Although I know that the arrow file could be memory mapped

This is true... but with the geozero approach you need to first load a chunk of data into memory and then copy all of it into geo structs, which is not very memory efficient. This is why I'm excited about the idea of geometry access traits, as mentioned in this discussion, so that you could actually memory map Arrow directly without a full copy.

from geozero.

kylebarron avatar kylebarron commented on June 10, 2024

Things like moves into a new-type struct are optimized away

That's super interesting. I've never looked into low-level compiled output before.

@apps4uco Re-reading my comment I think my tone might've come off too harsh... I'm fully supportive of both approaches! I've tried to closely follow spec discussions on both GeoParquet and GeoArrow and I'd be happy to review your PRs if you'd like 🙂. Keep in mind that neither spec has reached 1.0 yet. WKB-encoded GeoParquet is the most stable (native encodings are still up in the air) so I'd start with that one. You might even be able to leverage my previous geozero PR: #39

from geozero.

kylebarron avatar kylebarron commented on June 10, 2024

Geoparquet is getting ready for a 1.0.0-beta.1 soon opengeospatial/geoparquet#147, and I'd love to see support for this in geozero. @apps4uco do you have any existing work to share? Just want to avoid duplication of work

from geozero.

kylebarron avatar kylebarron commented on June 10, 2024

Would love some feedback on #80

from geozero.

apps4uco avatar apps4uco commented on June 10, 2024

Geoparquet is getting ready for a 1.0.0-beta.1 soon opengeospatial/geoparquet#147, and I'd love to see support for this in geozero. @apps4uco do you have any existing work to share? Just want to avoid duplication of work

Hi, I have been busy on other things, and so havent really advanced on the implementation, For what its work here is the git repo with my latest changes: github.com:apps4uco/geoparquet.git

from geozero.

kylebarron avatar kylebarron commented on June 10, 2024

The geoarrow crate now includes a geozero integration for all GeoArrow arrays as well as a GeoParquet reader. The upcoming 0.2 release will include a GeoParquet writer (it was merged yesterday).

In #186 we removed the minimal, outdated geoarrow integration from the geozero crate and are now pointing users to the geoarrow crate. Therefore, I'll close this issue.

Feel free to create issues on the geoarrow crate for any bugs with the GeoParquet reader or writer.

from geozero.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.