noaa-owp / hydrotools Goto Github PK
View Code? Open in Web Editor NEWSuite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.
License: Other
Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.
License: Other
Trying to limit the scope of the recent massive PR. I will address this later.
As shown, right now the version number requires manual updating. I need to explore how to improve this.
Tangentially, I think we could improve version bumping for this package and the sub-packages included. I know that bumb2version
is commonly used, but I think it makes sense to do some research to find the best option.
While developing a separate workflow I found that evaluation_tools.metrics.compute_contingency_table
produced contingency tables where the false positives and false negative counts were switched. The unit test was built to pass this bad behavior.
For the benefit of users, it makes sense to demonstrate example usages of evaluation_tools
sub-packages. As of now, examples are provided in docstrings and readme's. This form of documentation is suitable for quick reference, however there is a need for more complete example style documentation.
@jarq6c has written a great example detailing peak flow analysis for little hope creek. This example is expected to be the first added to the repo and in doing so, pave the way for future examples. That being said, once added to the repo, the little hope example should serve as an example for future example additions.
/examples/<name-of-example>
/examples/<name-of-example>/README.md
) that:
requirements.txt
file for installing reproducible dependencies.When retrieving data using the nwis_client
tool, something is happening when specifying the startDT
and endDT
options where the returned data is shift forward in time by 1 hour. May be able to clean-up the date handling and hand-off a lot to pandas
.
The contents of CONTRIBUTING.md
, specifically the guide to create a new subpackage is little outdated. This is just a placeholder for when someone finds time to update the CONTRIBUTING.md
document.
To support deployment to PyPI let's add individual README.md
documents to each subpackage.
obs = nexus._hydro_location.get_data("2015-12-01 00:00:00", "2015-12-30 23:00:00")
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/hypy/hydrolocation/nwis_location.py", line 50, in get_data
return self._nwis.get(self._station_id, startDT=start, endDT=end)
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/hydrotools/nwis_client/iv.py", line 262, in get
dfs.loc[:, "value"] = pd.to_numeric(dfs["value"], downcast="float")
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in getitem
indexer = self.columns.get_loc(key)
File "/home/shengting.cui/ngen-cal-test/ngen-cal/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
raise KeyError(key) from err
KeyError: 'value'
Add a simple use case showcasing the gcp_client
in the root README.md
.
Need to update documentation workflow.
As I'm revisiting the nwis_client
and gcp_client
tools, I think this is a good time to be daring and document the fixed vocabulary that describes hydrotools
canonical dataframes. I'm open to discussion on this topic, but so far I'm leaning toward the column definitions given below. These definitions are not exactly compatible with our internal services, but our internal services are not exactly compatible with each other. My motivation was to establish a vocabulary for hydrotools
that was consistent and sufficiently descriptive. Not all of these columns will be present or relevant to all dataframes, but where present definitions should be consistent.
There are two possible breaking changes, value_date
is now valid_time
and start_date
is now reference_time
.
These definitions also lean a lot on categorical types to avoid the use of multi-indexes. However, nothing prevents users from recasting these dataframes to use multi-indexes or from adding new columns to their data (like a custom_site_identifier
column for example). These column labels are just meant to cover and define what the various client tools might return.
value
[float32]: Indicates the real value of an individual measurement or simulated quantity.
valid_time
[datetime64[ns]]: formerly value_date
, this indicates the valid time of value
.
variable_name
[category]: string category that indicates the real-world type of value
(e.g. streamflow, gage height, temperature).
usgs_site_code
[category]: string category indicating the USGS Site Code/gage ID
nwm_feature_id
[category]: string category indicating the NWM reach feature ID/ComID
nws_lid
[category]: string category indicating the NWS Location ID/gage ID
usace_gage_id
[category]: string category indicating the USACE gage ID
measurement_unit
[category]: string category indicating the measurement unit (SI or standard) of value
qualifiers
[category]: string category that indicates any special qualifying codes or messages that apply to value
series
[integer32]: Use to disambiguate multiple coincident time series returned by a data source.
configuration
[category]: string category used as a label for a particular model simulation configuration (e.g. short_range, medium_range)
reference_time
[datetime64[ns]]: formerly, start_date
, some reference time for a particular model simulation. Could be considered an issue time, start time, end time, or other meaningful reference time. Interpretation is simulation or forecast specific.
longitude
[category]: float32 category, WGS84 decimal longitude
latitude
[category]: float32 category, WGS84 decimal latitude
geometry
[geometry]: GeoPandas
compatible GeoSeries
@aaraney @hellkite500
I think it's a good idea to explicitly document some peculiarities of dealing with pandas.Categorical
which are quite common in evaluation_tools
canonical pandas.Dataframe
. Bare minimum, I'll add something like this to README.md. Thoughts?
pandas.Categorical
data typesevaluation_tools
uses pandas.Dataframe
that contain pandas.Categorical
values to increase memory efficiency. Depending upon your use-case, these values may require special consideration. To see if a Dataframe
returned by evaluation_tools
contains pandas.Categorical
you can use pandas.Dataframe.info
like so:
print(my_dataframe.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5706954 entries, 0 to 5706953
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 value_date datetime64[ns]
1 variable_name category
2 usgs_site_code category
3 measurement_unit category
4 value float32
5 qualifiers category
6 series category
dtypes: category(5), datetime64[ns](1), float32(1)
memory usage: 141.5 MB
None
Columns with Dtype
category
are pandas.Categorical
. It's important to note that these categories persist even if your Dataframe
does not contain corresponding values. A possible consequence of this can be found on this stackoverflow question.
Three possible solutions to this issue include:
string
my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].astype(str)
my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].cat.remove_unused_categories()
observed
option with groupby
mean_flow = my_dataframe.groupby('usgs_site_code', observed=True).mean()
Yeah good catch, I was not aware of that either. That sounds like something a pytest --run-slow
should have caught. I need to create a new gh-action that runs our slow tests every Sunday if something new has been added since the last time the action ran.
Originally posted by @aaraney in #90 (comment)
Further elaborate the GitFlow used in this repo. Elucidate on the use of develop
branch, forking, pull requests, and how to pull upstream changes into your fork to continue work.
In response to Andy's request to make individual tools more discoverable, please add more details to the repository's description.
For example,
"Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data."
Is there a better description?
It would be nice to have a dedicated documentation page, via gh-pages
. This issue is just stating the desire for this feature.
Due to the way requests-cache
(used in _restclient
) is implemented, if the cache has been "installed" and a downstream package uses requests, the downstream package's requests will be cached. Meaning that requests_cache
implicitly changes the behavior of the requests
package for all downstream callers.
Per requests-cache
's documentation, they do provide a context manager to "disable" the cache. However, as @christophertubbs pointed out their implementation is not thread safe:
@contextmanager
def cache_disabled(self):
"""
Context manager for temporary disabling cache
::
>>> s = CachedSession()
>>> with s.cache_disabled():
... s.get('http://httpbin.org/ip')
"""
self._is_cache_disabled = True
try:
yield
finally:
self._is_cache_disabled = False
In discussing this issue with @hellkite500, @hellkite500 mentioned the (actively developed) project CacheControl. Before active development to fix this behavior, its worth exploring a bit and determining what long term solution to implement.
Good morning.
I am trying to create a list of events for the past 10 days at station FLRV2
https://water.weather.gov/ahps2/hydrograph.php?wfo=rnk&gage=flrv2
There has been a gage relocation due to a bridge construction. The function rolling_minimum is failing.
Thanks,
Alex
I noticed while looking around the package today that the event_detection
module is nested one level deeper than typical and does not have a parent subsubpackage (directory above with __init__.py
).
More concretely, on the path hydrotools/python/events/src/hydrotools/events/event_detection/
, there is an __init__.py
in the event_detection
directory but not in the events
directory. @jarq6c can you clarify why this is the case? I assume there should be an __init__.py
in the events
directory as well.
Thanks!
Looks like we'll need to rename the package... again.
Importing the latest nwis-client using python 3.6
from hydrotools.nwis_client.iv import IVDataService
fails with
TypeError: 'type' object is not subscriptable
The error may be more ubiquitous than just the IVDataService import, but that is where we have had trouble with it.
A brief conversation with the developers suggested that a down stream dependency forces a bump to python 3.7 as the minimum requirement and they are considering workarounds to allow backward compatibility. The newer version is worth the effort, with order-of-magnitude faster retrieval speeds from the NWIS service. We do not have any real reason to continue using 3.6, so we may look at an upgrade path.
For now, uninstalling the nwis-client, -restclient, and events modules
pip uninstall hydrotools.nwis_client
pip uninstall hydrotools.-restclient
pip uninstall hydrotools.events
then reinstalling the following versions allowed us to continuing using the service in the meantime on python 3.6.8
pip install hydrotools.-restclient==2.0.0a0
pip install hydrotools.nwis-client==2.0.0a0
So annoying to remember to transform units from gcp_client to compatible units with nwis_client.
This is a low priority, but CLIs for different evaluation tools may be useful to non-Python users and various old wizards that like doing everything through bash and system calls. The workflows of these users might benefit from evaluation_tools features like standardized data formats, efficient data retrieval, auto-parsing, caching, canned evaluations, etc. A CLI is a way to meet them halfway.
I can imagine something like:
$ evaluation_tools nwis_client --sites 02146600 --output USGS_site_data_02146600.csv
I had a good experience using Click to implement simple CLIs for my personal workflows. I'd like to explore this more.
The top level namespace package evaluation_tools
does not have a setup.py
script. This means there is no way to grab the whole toolbox. Some users may prefer to install all tools at once.
Possible bug. The event detection methods validate the index, but not the values. The presence of NaN in the value series may produce undefined behavior given the recursive nature of the filters. I want to see if this case needs handling.
The top level metapackage should either pull in gcp dependencies by default or include a target like hydrotools[gcp]
that does.
In google collab instantiating a RestClient
( this is implicitly done by nwis_client.IVDataService
) will cause a RuntimeError: This event loop is already running
in google collab
. This issue is well documented in the jupyter notebook
repo. In that thread, a work around using nest_asyncio
was mentioned as shown below. The problem and a solution to this issue as shown below.
!pip install hydrotools.nwis_client
from hydrotools import nwis_client
client = nwis_client.IVDataService()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-10-d37c2bf4ee70> in <module>()
----> 1 service = nwis_client.IVDataService()
4 frames
/usr/lib/python3.7/asyncio/base_events.py in _check_runnung(self)
521 def _check_runnung(self):
522 if self.is_running():
--> 523 raise RuntimeError('This event loop is already running')
524 if events._get_running_loop() is not None:
525 raise RuntimeError(
RuntimeError: This event loop is already running
!pip install hydrotools.nwis_client
import nest_asyncio
nest_asyncio.apply()
from hydrotools import nwis_client
client = nwis_client.IVDataService()
IMO I think the best way to get around this is to try
catch
where the error propagates from and then try to import nest_asyncio
and call nest_asyncio.apply()
. If nest_asyncio
is not installed, throw a ModuleNotFoundError
refing this issue and noting to install nest_asyncio
. Given that this is such an edge case and nest_asyncio
is required by nbclient
which is required by nbconvert
which is required by jupyter notebook
, it is unlikely that a user will ever not have nest_asyncio
installed and run into this issue. Before I open a PR to resolve this, I'd like to hear your thoughts @jarq6c.
To more closely mimic the default behavior of NWIS itself, the active status should default to ALL and users should configure their client for the more specific cases of "inactive" and "active".
Given the usage of numpy.typing
in hydrotools.metrics
if a numpy version < 1.20 is not required the user will get an error. See release note for feature addition in numpy project.
Switch from calling multiprocessing
directly and use concurrent.futures.ProcessPoolExecutor
. This will allow process chunking, which may yield a slight performance increase.
At the moment, all development deps are installed at the namespace package level, namely pytest
. I would like to come to an agreed upon standard for specifying development deps to setuptools
. Below are two examples setup.py
file implementations that solve this problem:
# file: setup.py
from setuptools import setup
from setuptools.command.develop import develop
import subprocess
DEVELOPMENT_REQUIREMENTS = ["pytest"]
# Development installation
class Develop(develop):
def run(self):
# Install development requirements
for dev_requirement in DEVELOPMENT_REQUIREMENTS:
subprocess.check_call(
[sys.executable, "-m", "pip", "install", dev_requirement]
)
develop.run(self)
setup(
name="mypackage",
description="some package that does something",
install_requires=["pandas"],
cmdclass={"develop": Develop},
)
The above is used as such, python setup.py develop
. This will install all the deps, development deps, and the package in an "editable" form (i.e. like pip install -e .
).
# file: setup.py
from setuptools import setup
DEVELOPMENT_REQUIREMENTS = ["pytest"]
setup(
name="mypackage",
description="some package that does something",
install_requires=["pandas"],
extras_require={"develop" : DEVELOPMENT_REQUIREMENTS},
)
2 is used as follows, pip install -e ".[develop]"
(the quotes are not required in bash, but are required in zsh and likely other shells).
Personally, I prefer the second option over the first. The second option uses pip
and thus files like pyproject.toml
are regarded, whereas in 1, they are not. Likewise, this functionality should work on pypi
, I am unsure at this moment how you would specify to include tests in the package though, so that may be a moot point. The second solution is also far less code and more maintainable across subpackages IMO.
Currently, nwis_client.iv.IVDataService
sets up caching on initialization. As near as I can tell _restclient.RestClient
also decides whether to cache at initialization. The result is that there is no native way to disable caching prior to or when requesting data.
Previously, I've successfully used the requests_cache
context manager to temporarily disable caching, but we might want to offer an interface through the tools themselves.
Add an evaluation_tools.metrics
subpackage with standard methods used to compute evaluation statistics.
Add a parameter that will shift start times produced by evaluation_tools.events.event_detection.decomposition
to a local minima value.
nwis_client
returns a value_date
column. We will add a value_time_label="value_date"
option to continue this behavior.
Update 1: Add value_time_label="value_date"
to __init__
that will raise a deprecation warning. Default behavior will return value_date
. value_time_label="value_time"
will return value_time
.
Update 2: Default behavior is value_time_label="value_time"
. Warn if this option is not explicitly set by user.
Update 3: Default behavior is value_time_label="value_time"
with no warning.
These subpackages need to be added to the sphinx docs so they are deployed on gh-pages.
As is the examples fail with:
TypeError: get() missing 1 required positional argument: 'self'
Case study: A user needs to conduct a long evaluation and wants 25 years of streamflow data from 2000+ USGS gage locations.
Obstacles:
requests_cache
?DataFrame
cache help us here?Based on a use-case from @hellkite500
@aaraney not necessarily looking for solutions here yet. Just wanted to start the discussion. The nwis_client
tool was really the first tool we fleshed out. So, it seems fitting to start discussions about scaling evaluation_tools
here.
Pytest output:
============================= test session starts ==============================
platform linux -- Python 3.7.9, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/runner/work/evaluation_tools/evaluation_tools, configfile: pytest.ini
collected 96 items / 11 deselected / 85 selected
python/_restclient/tests/test_restclient.py .................
python/events/tests/test_decomposition.py .
python/gcp_client/tests/test_gcp.py ..
python/gcp_client/tests/test_utils.py ......
python/nwis_client/tests/test_nwis.py ......F....................................................
=================================== FAILURES ===================================
____________________________ test_get_throw_warning ____________________________
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fb338867a10>
def test_get_throw_warning(monkeypatch):
def wrapper(*args, **kwargs):
return []
# Monkey patch get_raw method to return []
> monkeypatch.setattr(IVDataService, "get_raw", wrapper)
E NameError: name 'IVDataService' is not defined
python/nwis_client/tests/test_nwis.py:87: NameError
=============================== warnings summary ===============================
python/nwis_client/evaluation_tools/nwis_client/iv.py:910
/home/runner/work/evaluation_tools/evaluation_tools/python/nwis_client/evaluation_tools/nwis_client/iv.py:910: DeprecationWarning: invalid escape sequence \d
pattern = "^(-?)P(?=\d|T\d)(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)([DW]))?(?:T(?:(\d+)H)?(?:(\d+)M)?(?:(\d+(?:\.\d+)?)S)?)?$"
python/nwis_client/tests/test_nwis.py::test_handle_dates[2020-08-10T04:15-05:00-2020-08-10T09:15+0000]
python/nwis_client/tests/test_nwis.py::test_handle_dates[test5-2020-08-10T09:15+0000]
/home/runner/work/evaluation_tools/evaluation_tools/python/nwis_client/evaluation_tools/nwis_client/iv.py:883: DeprecationWarning: parsing timezone aware datetimes is deprecated;
the date has been converted to UTC and the tz information has been dropped, ergo the date is now considered `naive` UTC.
See https://github.com/NOAA-OWP/evaluation_tools/issues/46
warnings.warn(warning_message, DeprecationWarning)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED python/nwis_client/tests/test_nwis.py::test_get_throw_warning - NameEr...
=========== 1 failed, 84 passed, 11 deselected, 3 warnings in 1.68s ============
HT currently has no formal unit of measurement handling. We may be able to incorporate a library like Pint to deal with this.
Need to update the Wiki page to point to a real documentation page.
https://github.com/NOAA-OWP/hydrotools/wiki
This link currently points to a non-existent repository.
@hellkite500 found an edge case where an nwis_client
get
request can throw a ValueError
when pd.concat
has nothing to concatenate. This often occurs when a user asks for data from a site that is 'active' however the gage is not active. This should just return an empty df with the typical fields headers present.
from hydrotools.nwis_client.iv import IVDataService
data_service = IVDataService(value_time_label="value_time")
stage_df = data_service.get(
sites=["01646500", "01013500"],
startDT="2019-08-01",
endDT="2019-09-01",
parameterCd="00065"
)
Tagging @aaraney for advice. It seems #100 fixed the issue in Google Collab, but it still persists in Jupyter Notebook and Spyder (which may have further issues with asyncio
).
Running the code above inside a Jupyter Notebook or from Spyder resulted in RuntimeError: This event loop is already running
.
You can resolve the error by adding
import nest_asyncio
nest_asyncio.apply()
I didn't realize that we did not have unit tests via GHActions here. This is mainly just a note here for me to remember to add that.
The default settings limit retrieval to channel reaches that coincide with USGS gage sites. However, things quickly spiral out of control when attempting to conduct large scale analyses that require all 2.7+ million NWM reaches. This submodule needs a way to limit the maximum number of values held in memory. This likely means a bit of a redesign and a change to the default cache.
Each time the gcp client is used, it hits gcp to get the requested data. Given the size of the data and that repeated process, it only makes sense to implement some kind of cache. I propose that we use a file db (i.e. sqlite) to accomplish this for simplicity and broad support in python.
sqlite
db using the URL path as the keysqlite
lib must support multiprocessing and batch commitsThe sqlitedict
library is mature, maintained, and seems to fit this bill for this feature. The lib lets you create/connect with a db and use it like you would a python dictionary. Most importantly, it supports multiprocessing.
I had not realized certain options were non-functional with the current version of nwis_client
. These options appear to work with the version on the restclient_transition
branch. So, I would say that makes getting this branch merged into main a high priority.
Update event detection docstring with some suggested parameters, something like:
The evaluation_tools.events.event_detection.decomposition
method has two main parameters: halflife
and window
. These parameters are passed directly to underlying filters used to remove noise and model the underlying trend (AKA baseflow) in a streamflow time series. Significant contiguous deviations from this trend are flagged as "events". This method was originally conceived to detect rainfall-driven runoff events in small watersheds from records of volumetric discharge or total runoff. Before using decomposition
you will want to have some idea of the event timescales you hope to detect in your original time series.
pandas.Timedelta
compatible str
to specific halflife
and window
halflife
larger than the expected frequency of noise, but smaller than the event frequency/timescalewindow
larger than the event frequency/timescale, but at least 4 to 8 times smaller than the entire length of the time series.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.