Code Monkey home page Code Monkey logo

Comments (10)

mgdenno avatar mgdenno commented on June 26, 2024 3

I have a few thoughts to contribute to the conversation.

  1. We have some tools in the TEEHR library that use Dask to parallelize the building of the Kerchunck headers (JSON files). We have been building them and storing them locally as needed and then using them to access the data with XArray. This seems to work pretty well. It is obviously not as fast as it would be if the files were already generated, but provides significant speed up compared to downloading an entire NetCFD file to pull out one variable. So far we are only doing this for the data on GCP as we are using the Zarr files in AWS for the retrospective, but it could be pretty easily extended to the NetCDF data in AWS. This may not be necessary though - see item 3 below.
  2. As far as the gridded data is concerned, our tools aggregate the gridded data values to polygons (think basin average precipitation) but could pretty easily be refactored to provide a tool that returns an XArray (this may already be possible, I'd have to look to see).
  3. I think that James Halgren's group at AWI just recently built the Kerchunk headers for the 2.1 retrospective NetCDF's on AWS. I think they are currently in an un-advertised AWS S3 bucket. @jameshalgren

Regardless, we are certainly interested in collaborating on common tooling so we can try not to reinvent "the wheel".

FRSA @samlamont

from hydrotools.

jameshalgren avatar jameshalgren commented on June 26, 2024 2

I think that James Halgren's group at AWI just recently built the Kerchunk headers for the 2.1 retrospective NetCDF's on AWS. I think they are currently in an un-advertised AWS S3 bucket. @jameshalgren

@fernando-aristizabal Please take a look.
https://ciroh-nwm-zarr-retrospective-data-copy.s3.amazonaws.com/index.html#noaa-nwm-retrospective-2-1-zarr-pds/
(Everyone is welcome to explore; only forcing data are complete there now, but we're working on a complete archive of materials.)

@igarousi, we should connect about this and add some material to the comment thread here.

from hydrotools.

jarq6c avatar jarq6c commented on June 26, 2024 1

@fernando-aristizabal I would never discourage the development of new tools. :) There's actually a stale issue about building retrospective client here: #157 We never actually got around to building the tool, but you may find some of the discussion useful. Us and others have encountered some difficulty reliably retrieving and validating the zarr data.

from hydrotools.

fernando-aristizabal avatar fernando-aristizabal commented on June 26, 2024 1

@samlamont Thanks for jumping in with interesting input.

Regarding TEEHR, yes we create the single file jsons and then, in some cases, the combined json using MultiZarrToZarr, although we found that for this use case, there is no real performance gain provided by the MultiZarrToZarr step and are considering removing this step. In general, I think combining the single file jsons is helpful when you're dealing with many contiguous files (ie, the entire NWM retropective dataset) since it allows you to read the file metadata only once across the entire dataset. So far with TEEHR, we've been focusing on subsets of operational NWM forecasts (~monthly) and have not seen much advantage by including the MultiZarrToZarr step. This is very much a work in progress however, and we welcome any feedback!

This is my general understanding as well since kerchunk doesn't actually rechunk the files just builds an index around them allowing for access of meta-data and lazy loading. What would you say the advantage of single file jsons are without aggregating them?

If so, I'm curious what the advantages are in accessing the data using the Kerchunk reference files vs. optimizing the chunking scheme of the provided Zarr dataset. As I understand, with Kerchunk we're tied to the original chunking scheme (or multiples of) of the .nc files, and the original NWM retrospective files consist of only one chunk across all features? Although I fsspec/kerchunk#124?

Building on the previous comment, it's my understanding that the value of kerchunk is when building a multi-file json you get the advantages we've previously mentioned. Zarr offers the same benefits while also rechunking and recompressing in cloud optimized formats. The link I shared previously demonstrates how this can speed up access if done properly for the correct applications.

Also just to clarify, is the overall discussion here around how to best support a variety of querying schemes for the NWM retrospective (and forecast?) dataset(s) (for instance fetching the entire time series for one feature vs. the partial time series of many features)?

This discussion started by wanting to add some of the AWS retro Zarr references to the hydrotools repo as to supplement the repo's existing NWM data access tools. @jarq6c brought up some concerns with the Zarr rechunking there and the conversation expanded to various indexing/chunking efforts. It's apparent there are many efforts here across groups without a clear, consistent solution to gather around.

During the time of this thread, I learned that @GautamSood-NOAA and @sudhirshrestha-noaa will also be doing some rechunking, they want to solicit feedback from SMEs on what variables in addition to streamflow, qSfcLatRunoff, and qBucket might be useful. I suggested the forcing data as well as the lake variables as they all maybe relevant for FIM eventually. They are eager for people's opinions to feel free to communicate your needs to them.

I'm also curious if sharding could be helpful here? I believe this capability allows for a sort of nested chunking scheme and has been released as zarr-developers/zarr-python#1111 (comment). Could be something to investigate?

Lastly, sharding seems like a partial chunk read? It's hard to tell because some of their links appear down. If so I'm sure this would add value if we have large chunks with specific queries.

from hydrotools.

aaraney avatar aaraney commented on June 26, 2024

@fernando-aristizabal, thanks for opening this! I share @jarq6c's sentiment.

What do you envision the api(s) would return? A xarray.Dataset or some flavor of dataframe (pandas, dask, etc.)?

from hydrotools.

fernando-aristizabal avatar fernando-aristizabal commented on June 26, 2024

Hey @aaraney, initial thought was to keep it to xarray since that's what natively works best for these zarr/netcdf files. It would also keep data lazily loaded and up to the user to slice or convert to a desired object type.

Given some of the issues with Zarr, has anyone produced a kerchunk index of the NetCDF retro data that we can use? It would load in a similar fashion and likely avoid some of the problems introduced in the Zarr rechuncking.

from hydrotools.

jarq6c avatar jarq6c commented on June 26, 2024

We might ask @mgdenno to contribute to this conversation. The TEEHR project (https://github.com/RTIInternational/teehr) has a system in place to retrieve these data (time series, point, and gridded) for exploratory evaluations. There's may be an opportunity to collaborate with CIROH.

from hydrotools.

fernando-aristizabal avatar fernando-aristizabal commented on June 26, 2024

Hey everyone!

Thanks for contributing to this! It seems like a great survey of the various efforts to better access NWM data.
I'm going to rope in @GautamSood-NOAA and @sudhirshrestha-noaa who also have an interest in this and specifically what other variables might be useful to have rechuncked or indexed.

I'll start off commenting on @mgdenno's insightful points.

  1. Thanks for sharing some of TEEHR's data access methods. It seems as if some of the low level functionality is located here and here. My understanding is that this tooling creates single file jsons for the GCP data? Also, It seems to call MultiZarrToZarr for the purpose of creating local parquet file(s)?
  2. This sort of tooling looks for useful to add-on!
  3. 👍

Moving on to @jameshalgren's info on some of the work that CIROH has been doing on this. It seems very helpful as a few CIROH people have reached out to me or mentioned their questions on NWM data access.

I took at look at the README.md but wasn't able to get a successful request.

>>> import requests
>>> d = requests.get('https://ciroh-nwm-zarr-retrospective-data-copy.s3.amazonaws.com/noaa-nwm-retrospective-2-1-zarr-pds/README.md')
>>> d.status_code
200
>>> d.text
'404: Not Found'

My understanding is that these are single file jsons for the forcing data? The forcing data seems of interest to people based on feedback. Is there a single multi-file json or a plan to?

I'd like to share that there is some work here showing how Zarr rechunked across time instead of features showed significant improvement in time series based queries. The repo for this is available here as well as more specifically here and here. This work was influenced by @jarq6c, @sudhirshrestha-noaa, and @AustinJordan-NOAA.

Hopefully this adds to the various efforts at improving NWM data access and builds towards generating a comprehensive solution for research and dissemination applications.

from hydrotools.

samlamont avatar samlamont commented on June 26, 2024

Hi all, this is Sam, I'm working with @mgdenno on the TEEHR tooling and have a few points/questions to add.

Regarding TEEHR, yes we create the single file jsons and then, in some cases, the combined json using MultiZarrToZarr, although we found that for this use case, there is no real performance gain provided by the MultiZarrToZarr step and are considering removing this step. In general, I think combining the single file jsons is helpful when you're dealing with many contiguous files (ie, the entire NWM retropective dataset) since it allows you to read the file metadata only once across the entire dataset. So far with TEEHR, we've been focusing on subsets of operational NWM forecasts (~monthly) and have not seen much advantage by including the MultiZarrToZarr step. This is very much a work in progress however, and we welcome any feedback!

Also just to clarify, is the overall discussion here around how to best support a variety of querying schemes for the NWM retrospective (and forecast?) dataset(s) (for instance fetching the entire time series for one feature vs. the partial time series of many features)?

If so, I'm curious what the advantages are in accessing the data using the Kerchunk reference files vs. optimizing the chunking scheme of the provided Zarr dataset. As I understand, with Kerchunk we're tied to the original chunking scheme (or multiples of) of the .nc files, and the original NWM retrospective files consist of only one chunk across all features? Although I could be wrong here?

I'm also curious if sharding could be helpful here? I believe this capability allows for a sort of nested chunking scheme and has been released as experimental. Could be something to investigate?

I hope these comments are helpful, happy to discuss further if not!

from hydrotools.

samlamont avatar samlamont commented on June 26, 2024

Hi @fernando-aristizabal thanks for the feedback. On the single json vs. aggregated approach for NWM forecasts, we noticed a much smoother Dask task stream when using the single file jsons as opposed to aggregating with MultiZarrToZarr. The overall performance/run time was about the same however so I'm not sure I can say there was a huge advantage. I did notice that when concatentating forecasts, MultiZarrToZarr will append nan values to the individual forecasts in order to build a contiguous array over the requested time period. Again, not sure what the overall impact of this behavior is (if any), so I might be over-complicating things here, and should apologize if I'm taking this thread down a technical rabbit hole! 😃

Thanks for the additional clarification, I'm happy to contribute to this effort in any way. I'll post back here if I learn of any benefits to sharding.

from hydrotools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.