Code Monkey home page Code Monkey logo

Comments (6)

jarq6c avatar jarq6c commented on September 24, 2024

Looping in @christophertubbs

from hydrotools.

aaraney avatar aaraney commented on September 24, 2024

Thanks for writing up this issue, I agree with you that it is worth starting this likely tough to accomplish task. I have a few things id like to weigh in on, so ill try to order this is what I think is easiest to address to likely needs further discussion.

HydroTools Canonical DataFrame Column Definitions

value [float32]: Indicates the real value of an individual or aggregated measured, simulated, or forecasted quantity.

Just nit-picky, I can see in the future supporting either temporally or spatially aggregated metrics. Unless we want to add a field like aggregated_value to be explicit, I think including that the value can be both an individual or aggregated quantity is important. I'd love to hear thoughts.

IMO adding forecasted in this context is helpful. I get the argument that a forecasted value is a simulated value, but just in the context of hydrology I think adding the distinction is helpful given how split this field is between forecast hydrology and just regular old running a hydrological model.

valid_time

More on that later...

In general I think these fields and definitions are great! One question I have is if location metadata columns (usgs_site_code, nwm_feature_id, etc.) should be included in the canonical format? Personally, since these columns will likely vary the most between subpackages, I wonder if we could instead create a standard way to communicate additional non-canonical location metadata fields in documentation and/or a standard method across subpackages which lists the available or supported location metadata providers. To me, the "canonical" columns are value, variable_name, valid_time, reference_time, measurement_unit, longitude, latitude, and geometry. From my perspective, I can see the supported location metadata provider list growing as we support other evaluations and data sources (e.g. precipitation data).

geometry

I think we should include an expected CRS in the definition, EPSG:4326 is likely the most appropriate. Additionally, specifying a projection on which any computations are done is also probably a good idea just given 4326 is lat-lon not in meters.

More on that later...

valid_time

Personally, I think valid_time is the toughest and a compromise is likely necessary. The use of the term valid_time, to me, implies that the values are forecasted values. This is strictly contextual given how often valid_time is used when talking about forecasts. I get that most of the time we are dealing with forecasts and analysis simulations so for this audience the word choice for the most part makes sense. However, in the case of an aggregated value (e.g. mean daily discharge) or a measurement (e.g. observed discharge) valid_time doesnt seem fitting IMO. Instead I personally would offer the name value_time instead of valid_time as I think its more inclusive and understandable in either an observation or forecast context. With that being said, as for the types of a value_time (and reference_time), I agree and support [datetime64[]] just maybe not to the nanosecond, personally I think millisecond is fine for our use cases as it fits well in JSON if web api's were ever to implement hydrotools. Additionally, I think its worth thinking about including Period as an acceptable value_time to cover the case of an aggregated value. Maybe this is too far as datetime's and Period's often don't play well or as expected, but I do think its worth brainstorming how to represent the time scale of a temporally aggregated value.

from hydrotools.

jarq6c avatar jarq6c commented on September 24, 2024

Thanks for the response! My responses are below.

value_time vs valid_time

I like value_time. It makes it more explicit that this time is specifically associated with value.

datetime64[ns]

No preference from me. Current tools are defaulting to nanoseconds, however. I believe this is the default behavior for numpy and pandas. Changing this may require special interventions.

Location Identifiers

I am also uncomfortable supporting explicit identifier types with special column names (usgs_site_code, nwm_feature_id, etc). I included those mostly to point out where the current model might be headed. In the long term, we probably want to replace all of these with a generic identifier column. We could employ the USGS identifier template with a provider:identifier format. For examples: USGS:01013500, NWM:171, NWS:WOOM6.

Forecasts vs simulations

IMO this is just a label. We can accommodate this distinction by adding some kind of series_label column that contains a string/category tag. The string could be "forecast", "simulation", "short_range", "SRF", "analysis", "sally_model_run", or some other user managed label.

CRS EPSG:4326

I'm fine with this default. This means canonical dataframes are directly compatible with GeoJSON output.

Aggregated values

My preference is to table this for now. It could become an issue, but we could also end up engineering something we don't need. Two immediate cases come to mind: resampling and reduction methods. In the case of resampling, the value column represents the new aggregated value. In the case of reduction (for example df.groupy.mean) a new dataframe is produced with the aggregated values. In both cases, non-aggregated and aggregated values are not mixed in the same dataframe. We may not need a special column indicator to specify aggregated values because this can be effectively tracked at a level of abstraction above columns.

Periods as value_time

I prefer not to mix datatypes. This will tend to make data processing messy. Methods that return data relevant to a certain time period could return a start and end for that period similar to what's produced by event detection methods. Isolated periods could also be recorded as pandas.Period, pandas.Timedelta or pandas.Interval. We may want to explore where and when these datatypes are most relevant to our uses.

from hydrotools.

aaraney avatar aaraney commented on September 24, 2024

For sure! See my responses below:

value_time vs valid_time

I like value_time. It makes it more explicit that this time is specifically associated with value.

Sweet 🚀

datetime64[ns]

No preference from me. Current tools are defaulting to nanoseconds, however. I believe this is the default behavior for numpy and pandas. Changing this may require special interventions.

Thanks for mentioning what you thought was the default behavior, having looked into that this morning I agree with you that sticking with datetime64[ns] is probably best. I ran a few tests converting dates with pandas and numpy and it seems like it is in our best interest to stick with the default as to not break things.

Location Identifiers

I am also uncomfortable supporting explicit identifier types with special column names (usgs_site_code, nwm_feature_id, etc). I included those mostly to point out where the current model might be headed. In the long term, we probably want to replace all of these with a generic identifier column. We could employ the USGS identifier template with a provider:identifier format. For examples: USGS:01013500, NWM:171, NWS:WOOM6.

I agree with you that we are likely headed to a more generic solution in the future. It may be worth looking into pandas.DataFrame.attrs as a part of whatever solution is decided in the future. Its worth noting that pandas attrs is an experimental feature that may be removed in the future, but we can reevaluate when the time comes. The attrs property is attractive to store other metadata that may not be appropriate as a populated dataframe field.

Forecasts vs simulations

IMO this is just a label. We can accommodate this distinction by adding some kind of series_label column that contains a string/category tag. The string could be "forecast", "simulation", "short_range", "SRF", "analysis", "sally_model_run", or some other user managed label.

Agreed 👍.

CRS EPSG:4326

I'm fine with this default. This means canonical dataframes are directly compatible with GeoJSON output.

Sweet 👍.

Aggregated values

My preference is to table this for now. It could become an issue, but we could also end up engineering something we don't need. Two immediate cases come to mind: resampling and reduction methods. In the case of resampling, the value column represents the new aggregated value. In the case of reduction (for example df.groupy.mean) a new dataframe is produced with the aggregated values. In both cases, non-aggregated and aggregated values are not mixed in the same dataframe. We may not need a special column indicator to specify aggregated values because this can be effectively tracked at a level of abstraction above columns.

Right I agree, I mainly just wanted to bring it up as a long term conversation topic that we can point back to ( I should have more explicitly stated my intentions ). I think this topic may overly pedantic with little gain. One case I am thinking of is in the context of a canonical output csv format how best to communicate that the values represented are aggregated? If they were temporally aggregated, how best should the scale and frequency from which each aggregate was derived be represented?

Periods as value_time

I prefer not to mix datatypes. This will tend to make data processing messy. Methods that return data relevant to a certain time period could return a start and end for that period similar to what's produced by event detection methods. Isolated periods could also be recorded as pandas.Period, pandas.Timedelta or pandas.Interval. We may want to explore where and when these datatypes are most relevant to our uses.

I agree with you that this may cause more headaches than actually help out. This is related to my above comment regarding representing a temporal aggregate time scale and frequency, just curious how we may do that in the future. No need to figure that out now though.

from hydrotools.

jarq6c avatar jarq6c commented on September 24, 2024

Note: This was more words than I expected. I promise I care a lot less than this diatribe would suggest. 😁

It just occurred to me the series_label functionality mentioned may already be covered by the configuration column.

pandas.DataFrame.attrs is an interesting development that seems to address an outstanding issue with pandas.DataFrame. It's generally not a good idea to create new objects that derive from DataFrames because the extra properties are not guaranteed to survive any operation that yields a copy of a DataFrame. The attrs solution might also be intended to allow more easy integration with attributes as they are used by HDF5. Currently, you can store DataFrames directly in HDF5 as tabular datasets, but group and dataset attributes have to be added later and it's unclear if this is safe.

attrs is also a dict of global attributes for the dataset. So in most cases, this may only contain a configuration, measurement_unit, and crs. It's also unclear how these attrs will play with other output formats like GeoJSON, CSV, and sql. Categorical columns are effectively a dict of values that apply to supersets, global or not. So we may already be using a suitable solution.

from hydrotools.

jarq6c avatar jarq6c commented on September 24, 2024

Based on the above discussion, I'll propose the following, subject to future expansion as needed:

"Canonical" labels are protected and part of a fixed lexicon. Canonical labels are shared among all hydrotools subpackages. Subpackage methods should avoid changing or redefining these columns where they appear to encourage cross-compatibility.

HydroTools Canonical DataFrame Column Labels

value [float32]: Indicates the real value of an individual measurement or simulated quantity.
value_time [datetime64[ns]]: formerly value_date, this indicates the valid time of value.
variable_name [category]: string category that indicates the real-world type of value (e.g. streamflow, gage height, temperature).
measurement_unit [category]: string category indicating the measurement unit (SI or standard) of value
qualifiers [category]: string category that indicates any special qualifying codes or messages that apply to value
series [integer32]: Use to disambiguate multiple coincident time series returned by a data source.
configuration [category]: string category used as a label for a particular time series, often used to distinguish types of model runs (e.g. short_range, medium_range, assimilation)
reference_time [datetime64[ns]]: formerly, start_date, some reference time for a particular model simulation. Could be considered an issue time, start time, end time, or other meaningful reference time. Interpretation is simulation or forecast specific.
longitude [category]: float32 category, WGS84 decimal longitude
latitude [category]: float32 category, WGS84 decimal latitude
crs [category]: string category, Coordinate Reference System, typically "EPSG:4326"
geometry [geometry]: GeoPandas compatible GeoSeries used as the default "geometry" column

"Non-Canonical" labels are subpackage specific extensions to the canonical standard. Packages may share these non-canonical lables, but cross-compatibility is not guaranteed. Examples of non-canonical labels are given below.

Non-Canonical Column Labels

usgs_site_code [category]: string category indicating the USGS Site Code/gage ID
nwm_feature_id [category]: string category indicating the NWM reach feature ID/ComID
nws_lid [category]: string category indicating the NWS Location ID/gage ID
usace_gage_id [category]: string category indicating the USACE gage ID
start [datetime64[ns]]: datetime returned by event_detection that indicates the beginning of an event
end [datetime64[ns]]: datetime returned by event_detection that indicates the end of an event

from hydrotools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.