Comments (6)
Looping in @christophertubbs
from hydrotools.
Thanks for writing up this issue, I agree with you that it is worth starting this likely tough to accomplish task. I have a few things id like to weigh in on, so ill try to order this is what I think is easiest to address to likely needs further discussion.
HydroTools Canonical DataFrame Column Definitions
value
[float32]: Indicates the real value of an individual or aggregated measured, simulated, or forecasted quantity.
Just nit-picky, I can see in the future supporting either temporally or spatially aggregated metrics. Unless we want to add a field like aggregated_value
to be explicit, I think including that the value
can be both an individual or aggregated quantity is important. I'd love to hear thoughts.
IMO adding forecasted in this context is helpful. I get the argument that a forecasted value is a simulated value, but just in the context of hydrology I think adding the distinction is helpful given how split this field is between forecast hydrology and just regular old running a hydrological model.
valid_time
More on that later...
In general I think these fields and definitions are great! One question I have is if location metadata columns (usgs_site_code
, nwm_feature_id
, etc.) should be included in the canonical format? Personally, since these columns will likely vary the most between subpackages, I wonder if we could instead create a standard way to communicate additional non-canonical location metadata fields in documentation and/or a standard method across subpackages which lists the available or supported location metadata providers. To me, the "canonical" columns are value
, variable_name
, valid_time
, reference_time
, measurement_unit
, longitude
, latitude
, and geometry
. From my perspective, I can see the supported location metadata provider list growing as we support other evaluations and data sources (e.g. precipitation data).
geometry
I think we should include an expected CRS in the definition, EPSG:4326
is likely the most appropriate. Additionally, specifying a projection on which any computations are done is also probably a good idea just given 4326
is lat-lon not in meters.
More on that later...
valid_time
Personally, I think valid_time
is the toughest and a compromise is likely necessary. The use of the term valid_time
, to me, implies that the values are forecasted values. This is strictly contextual given how often valid_time
is used when talking about forecasts. I get that most of the time we are dealing with forecasts and analysis simulations so for this audience the word choice for the most part makes sense. However, in the case of an aggregated value (e.g. mean daily discharge) or a measurement (e.g. observed discharge) valid_time
doesnt seem fitting IMO. Instead I personally would offer the name value_time
instead of valid_time
as I think its more inclusive and understandable in either an observation or forecast context. With that being said, as for the types of a value_time
(and reference_time
), I agree and support [datetime64[]]
just maybe not to the nanosecond, personally I think millisecond is fine for our use cases as it fits well in JSON if web api's were ever to implement hydrotools
. Additionally, I think its worth thinking about including Period
as an acceptable value_time
to cover the case of an aggregated value. Maybe this is too far as datetime
's and Period
's often don't play well or as expected, but I do think its worth brainstorming how to represent the time scale of a temporally aggregated value.
from hydrotools.
Thanks for the response! My responses are below.
value_time
vs valid_time
I like value_time
. It makes it more explicit that this time is specifically associated with value
.
datetime64[ns]
No preference from me. Current tools are defaulting to nanoseconds, however. I believe this is the default behavior for numpy
and pandas
. Changing this may require special interventions.
Location Identifiers
I am also uncomfortable supporting explicit identifier types with special column names (usgs_site_code
, nwm_feature_id
, etc). I included those mostly to point out where the current model might be headed. In the long term, we probably want to replace all of these with a generic identifier
column. We could employ the USGS identifier template with a provider:identifier
format. For examples: USGS:01013500
, NWM:171
, NWS:WOOM6
.
Forecasts vs simulations
IMO this is just a label. We can accommodate this distinction by adding some kind of series_label
column that contains a string/category tag. The string could be "forecast", "simulation", "short_range", "SRF", "analysis", "sally_model_run", or some other user managed label.
CRS EPSG:4326
I'm fine with this default. This means canonical dataframes are directly compatible with GeoJSON output.
Aggregated values
My preference is to table this for now. It could become an issue, but we could also end up engineering something we don't need. Two immediate cases come to mind: resampling and reduction methods. In the case of resampling, the value
column represents the new aggregated value. In the case of reduction (for example df.groupy.mean
) a new dataframe is produced with the aggregated values. In both cases, non-aggregated and aggregated values are not mixed in the same dataframe. We may not need a special column indicator to specify aggregated values because this can be effectively tracked at a level of abstraction above columns.
Periods as value_time
I prefer not to mix datatypes. This will tend to make data processing messy. Methods that return data relevant to a certain time period could return a start
and end
for that period similar to what's produced by event detection methods. Isolated periods could also be recorded as pandas.Period
, pandas.Timedelta
or pandas.Interval
. We may want to explore where and when these datatypes are most relevant to our uses.
from hydrotools.
For sure! See my responses below:
value_time
vsvalid_time
I like
value_time
. It makes it more explicit that this time is specifically associated withvalue
.
Sweet 🚀
datetime64[ns]
No preference from me. Current tools are defaulting to nanoseconds, however. I believe this is the default behavior for
numpy
andpandas
. Changing this may require special interventions.
Thanks for mentioning what you thought was the default behavior, having looked into that this morning I agree with you that sticking with datetime64[ns]
is probably best. I ran a few tests converting dates with pandas
and numpy
and it seems like it is in our best interest to stick with the default as to not break things.
Location Identifiers
I am also uncomfortable supporting explicit identifier types with special column names (
usgs_site_code
,nwm_feature_id
, etc). I included those mostly to point out where the current model might be headed. In the long term, we probably want to replace all of these with a genericidentifier
column. We could employ the USGS identifier template with aprovider:identifier
format. For examples:USGS:01013500
,NWM:171
,NWS:WOOM6
.
I agree with you that we are likely headed to a more generic solution in the future. It may be worth looking into pandas.DataFrame.attrs
as a part of whatever solution is decided in the future. Its worth noting that pandas attrs
is an experimental feature that may be removed in the future, but we can reevaluate when the time comes. The attrs
property is attractive to store other metadata that may not be appropriate as a populated dataframe field.
Forecasts vs simulations
IMO this is just a label. We can accommodate this distinction by adding some kind of
series_label
column that contains a string/category tag. The string could be "forecast", "simulation", "short_range", "SRF", "analysis", "sally_model_run", or some other user managed label.
Agreed 👍.
CRS EPSG:4326
I'm fine with this default. This means canonical dataframes are directly compatible with GeoJSON output.
Sweet 👍.
Aggregated values
My preference is to table this for now. It could become an issue, but we could also end up engineering something we don't need. Two immediate cases come to mind: resampling and reduction methods. In the case of resampling, the
value
column represents the new aggregated value. In the case of reduction (for exampledf.groupy.mean
) a new dataframe is produced with the aggregated values. In both cases, non-aggregated and aggregated values are not mixed in the same dataframe. We may not need a special column indicator to specify aggregated values because this can be effectively tracked at a level of abstraction above columns.
Right I agree, I mainly just wanted to bring it up as a long term conversation topic that we can point back to ( I should have more explicitly stated my intentions ). I think this topic may overly pedantic with little gain. One case I am thinking of is in the context of a canonical output csv format how best to communicate that the values represented are aggregated? If they were temporally aggregated, how best should the scale and frequency from which each aggregate was derived be represented?
Periods as
value_time
I prefer not to mix datatypes. This will tend to make data processing messy. Methods that return data relevant to a certain time period could return a
start
andend
for that period similar to what's produced by event detection methods. Isolated periods could also be recorded aspandas.Period
,pandas.Timedelta
orpandas.Interval
. We may want to explore where and when these datatypes are most relevant to our uses.
I agree with you that this may cause more headaches than actually help out. This is related to my above comment regarding representing a temporal aggregate time scale and frequency, just curious how we may do that in the future. No need to figure that out now though.
from hydrotools.
Note: This was more words than I expected. I promise I care a lot less than this diatribe would suggest. 😁
It just occurred to me the series_label
functionality mentioned may already be covered by the configuration
column.
pandas.DataFrame.attrs
is an interesting development that seems to address an outstanding issue with pandas.DataFrame
. It's generally not a good idea to create new objects that derive from DataFrames because the extra properties are not guaranteed to survive any operation that yields a copy of a DataFrame. The attrs
solution might also be intended to allow more easy integration with attributes as they are used by HDF5. Currently, you can store DataFrames directly in HDF5 as tabular datasets, but group and dataset attributes have to be added later and it's unclear if this is safe.
attrs
is also a dict
of global attributes for the dataset. So in most cases, this may only contain a configuration
, measurement_unit
, and crs
. It's also unclear how these attrs
will play with other output formats like GeoJSON, CSV, and sql. Categorical
columns are effectively a dict
of values that apply to supersets, global or not. So we may already be using a suitable solution.
from hydrotools.
Based on the above discussion, I'll propose the following, subject to future expansion as needed:
"Canonical" labels are protected and part of a fixed lexicon. Canonical labels are shared among all hydrotools
subpackages. Subpackage methods should avoid changing or redefining these columns where they appear to encourage cross-compatibility.
HydroTools Canonical DataFrame Column Labels
value
[float32]: Indicates the real value of an individual measurement or simulated quantity.
value_time
[datetime64[ns]]: formerly value_date
, this indicates the valid time of value
.
variable_name
[category]: string category that indicates the real-world type of value
(e.g. streamflow, gage height, temperature).
measurement_unit
[category]: string category indicating the measurement unit (SI or standard) of value
qualifiers
[category]: string category that indicates any special qualifying codes or messages that apply to value
series
[integer32]: Use to disambiguate multiple coincident time series returned by a data source.
configuration
[category]: string category used as a label for a particular time series, often used to distinguish types of model runs (e.g. short_range, medium_range, assimilation)
reference_time
[datetime64[ns]]: formerly, start_date
, some reference time for a particular model simulation. Could be considered an issue time, start time, end time, or other meaningful reference time. Interpretation is simulation or forecast specific.
longitude
[category]: float32 category, WGS84 decimal longitude
latitude
[category]: float32 category, WGS84 decimal latitude
crs
[category]: string category, Coordinate Reference System, typically "EPSG:4326"
geometry
[geometry]: GeoPandas
compatible GeoSeries
used as the default "geometry" column
"Non-Canonical" labels are subpackage specific extensions to the canonical standard. Packages may share these non-canonical lables, but cross-compatibility is not guaranteed. Examples of non-canonical labels are given below.
Non-Canonical Column Labels
usgs_site_code
[category]: string category indicating the USGS Site Code/gage ID
nwm_feature_id
[category]: string category indicating the NWM reach feature ID/ComID
nws_lid
[category]: string category indicating the NWS Location ID/gage ID
usace_gage_id
[category]: string category indicating the USACE gage ID
start
[datetime64[ns]]: datetime returned by event_detection
that indicates the beginning of an event
end
[datetime64[ns]]: datetime returned by event_detection
that indicates the end of an event
from hydrotools.
Related Issues (20)
- Pandas >= 2.0.0 package compliance audit HOT 4
- `nwis_client` "sqlite3.OperationalError: database is locked" HOT 6
- Move `hydrotools` namespace packages to separate repositories HOT 3
- "Run Slow Unit Tests" Action has been failing for some time HOT 2
- 3.7 Tests failing: xarray EntryPoints has no attribute get HOT 6
- DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace HOT 1
- AWS Retrospective HOT 10
- SVI Client slow unit tests failing HOT 8
- nwm_client_new documentation is incomplete for private servers. HOT 1
- nwm_client_new `get` methods fails with custom Parquet Store
- Consider supporting MS Azure (`nwm_client_new`) HOT 1
- Determine feasibility of _restclient's continued dependence on `aiohttp_cache_client` HOT 5
- SVI Client get method failing due to Pydantic>2 issue HOT 1
- New version of `_restclient` cannot be pushed to PyPI b.c. namespace packages with leading `_` in package name cannot be uploaded HOT 1
- Add some basic information about the NWM operational configuration to the `nwm_client_new` package. HOT 1
- Event Detection methods are raising `FutureWarning` HOT 3
- question about update cycle for hydrotools HOT 3
- NWPS API Available HOT 4
- `pint` caching fail leads to `FileNotFoundError` again. (`nwm_client_new`)
- Organize, and test old eHydro code (Move towards eHydro STAC) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hydrotools.