noaa-owp / dmod Goto Github PK
View Code? Open in Web Editor NEWDistributed Model on Demand infrastructure for OWP's Model as a Service
License: Other
Distributed Model on Demand infrastructure for OWP's Model as a Service
License: Other
There are several spots in our async service logic where we call sleep()
, typically at the end of a looping periodic task to wait before running it again. These usages should be examined and, where appropriate, support for controlling the specific amount of time should be added via either externalized configuration or parameterization.
A particular example is in python/services/monitorservice/dmod/monitorservice/service.py within the exec_monitoring()
function.
The env variable established in commit 30c48a3 contains the path to the secret file, not the secret itself, but several places in the service code assume that DOCKER_SECRET_REDIS_
is a valid env prefix for the password itself.
Services look for ENV variable DOCKER_SECRET_REDIS_PASS
assuming it is the password.
Either DOCKER_SECRET_REDIS_PASS
needs to be re-written, or a different ENV variable needs to be used which is set from the value in the secret file.
The redis service does this approrpiately in the entrypoint.sh
SECRET_FILE="/run/secrets/${DOCKER_SECRET_REDIS_PASS:?}"
REDIS_PASS="$(cat ${SECRET_FILE})"
But the other python based services do not read this secret appropriately, and the function used to parse the env, _get_parsed_or_env_val
in the __main__.py
checks that DOCKER_SECRET_REDIS_PASS
exists, which it does, and uses its value (the path to the secret file in the container file system) as the password it uses to attempt connections with, which fail.
Issue to track task of creating the initial set of components for handling observation data as needed to support evaluation and calibration routines.
This description should continue to be updated as the specifics of the initial working version of these continue to come into focus.
The main Docker image for running Nextgen has become out of date. It needs to be reviewed and made current. It may be worth considering whether it should be expanded in some ways also (e.g., building separate images for parallel versus serial framework executables).
One specific observation has been that an error occurs during image builds at the step when the framework is compiled. The Docker image config is not installing a supported Python version, but the lack of Python is not accounted for properly when the image build compiles the framework.
GUI fails to find/create asyncio event loop to establish session.
Start the gui app server locally:
MAAS_ENDPOINT_HOST=hostname MAA_ENDPOINT_PORT=port ./manage.py runserver
from http://127.0.0.1:8000/
in the browser, configure a request.
The following error occurs in the server log/output
EditView.post: making job request
client Making Job Request
PostFormJobRequestClient._acquire_session_info: getting session info
Session from ModelRequestClient: force_new=False
Connection to request handler web socket
Expecting exception to follow
Failed _acquire_session_info
Traceback (most recent call last):
File "/Users/nels.frazier/workspace/DMOD/venv/lib/python3.8/site-packages/dmod/communication/client.py", line 349, in _acquire_new_session
auth_details = asyncio.get_event_loop().run_until_complete(self.authenticate_over_websocket())
File "/Users/nels.frazier/homebrew/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/events.py", line 639, in g
et_event_loop
raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Thread-1'.
Session Info Return: False
client Unable to aquire session details
The sessions should establish and push the request to the MAAS_ENDPOINT
As new models/workflow are envisioned for use within DMOD, a clear seperation of the DMOD services, such as request-service
, and the model images such as nwm
should be established.
The main
stack is built and pushed including the model images from the same configuration files (docker/main/docker-build.yml
and docker/main/docker-deploy.yml
The main stack should build and deploy only required DMOD services, irregardless of the available model images. A separate directory, like docker/main/models
or maybe even docker/models
would hold the main build definitions (since we aren't decploying these as services, no deploy file is needed) and the upstack/control_stack utilities can be refactored to build and push model images (or verify they are available).
The only place this becomes potentially problematic is connecting the image_and_domain.yml
that is used by the scheduler/launcher.
Compiling hdf5 1.10.4 on apline 3.4 fails when it cannot find libexecinfo
. While possible to get libexecinfo
on 3.4, it is somewhat tricky.
Fails to build deps image.
Build dep image.
Track the discussion and decision making process on
a.) whether this feature is wanted/needed, and
b.) whether the feature should be part of the FIHM, September, or other deliverable
Support could be added for user-uploaded hydrofabrics, in addition to those statically supplied at MaaS environment start. It needs to be established whether this is a desired feature and, if so, on what time table it should be included (in particular, whether it belongs in this FIHM milestone).
Tracks tasks related to making sure worker-type Docker images (e.g., the ngen
image for model execution) are available as needed, along with any dependencies needed for their operation or availability.
ngen
image supports all BMI integrationsngen
image bundles already-linked submodulesCurrently (perhaps after manually changing the urllib3 dependency; see #105), the following error is seen when loading the GUI:
type object 'MaaSRequest' has no attribute 'get_distribution_types'
Request Method:GETRequest URL:http://<server>/Django Version:2.2.24Exception Type:AttributeErrorException Value:type object 'MaaSRequest' has no attribute 'get_distribution_types'
Exception Location:/usr/maas_portal/MaaS/cbv/EditView.py in get, line 55Python Executable:/usr/local/bin/pythonPython Version:3.8.12Python Path:['/usr/maas_portal',
'/usr/local/bin',
'/usr/local/lib/python38.zip',
'/usr/local/lib/python3.8',
'/usr/local/lib/python3.8/lib-dynload',
'/usr/local/lib/python3.8/site-packages']
This appears to be due to changes made in the communication
library (and perhaps other places) that have yet to be properly reflected in the gui
package.
Tracking of any remaining design, implementation, or bug fix tasks for initial version of a full, basic model execution workflow.
Task definition is currently still in progress and will require additional expansion.
At first glance, class appears to be replicating some basic functionality of a Pandas DataFrame, i.e. labeled 2D array. Unless there is some serious flaw in doing so, we an replace these inputs with a DataFrame.
It would be nice to have a github io page for this project. We could use sphinx and autodocs to automatically build the api documentation and mirror the readme on the front page for starters. Is this a feature that desired? If so, have guidelines been set for gh-pages? More or less just staring a conversation here.
Personally, using sphinx with the readthedocs theme on a gh-pages hosted site is what I would propose.
using read()
to read the secret file contents doesn't remove EOL/EOF markers from the string, which are then passed on to the redis DB, causing an auth failure.
The dmod.communication.MaaSRequest class currently treats the version
attribute as a float
; i.e., it is type-hinted, documented, and otherwise treated as such. However, it probably makes more sense to interpret version
as a string.
This will need to be reflected in the class, the associated JSON schema, and anywhere else it is directly or indirectly used.
Need some unit/integration tests for the GUI module
All gui testing is manual
Should have some automated tests for the GUI
Django has built in testing support using python unittest
and can be executed using ./manage.py test
.
There is also a solid pytest-django plugin for pytest, and a good writeup on settiing up and using it here.
Pytest may also be worth looking at for all DMOD testing going forward (it works fine with traditional unittest
tests out of the box)
Add an interface (likely via the GUI) to be able to generate partition configurations in an independent workflow; i.e., not part of a large model execution workflow.
pip
has an editable option (pip install -e <package>
?) for libraries so that changes can be made and used directly instead of treating libraries as static entities. I'd like to be able to pass a flag to update_packages.sh
so that I can install the dmod
libraries as editable, but not the third party dependencies.
The change in resource management creates an explicit representation for Resource
, requiring the raw meta data to be deserialized before being passed to the ResourceManager
. Previously, these resources were added from raw dictionary definitions.
scheduler-service
reads a list of raw dictionaries defining the managed resources and passes this list to the ResourceManager
. This is currently broken.
scheduler-service
deserializes the raw dictionaries into Resource
objects appropriate for the ResoruceManager
to operate on.
Line 38 in ae5cae1
Currently missing in env but used in main/nwm docker builds
PYTHON_PACKAGE_DIST_NAME_ACCESS
PYTHON_PACKAGE_DIST_NAME_EXTERNAL_REQUESTS
The cache key is static, so as long a the cache has been built the same one will be used each time. But since the DMOD packages are installed into this venv, if multiple changes to various package components are made in subsequent PR's, the testing is likely to fail.
Need to invalidate this venv cache when DMOD packages are changed, or not cache DMOD installs and only cache external dependencies.
DMOD/docker/main/base/Dockerfile
Line 41 in ae5cae1
The base image still requires an ONBUILD ssh directory copy, if we aren't using static keys copied into this base image, we need to remove this section from the build.
When building the py_sources stack (./scripts/control_stack.sh py_sources build
) the build fails build the required dependency wheels:
This is the failing dockerfile line:
Building wheels for collected packages: cffi
Building wheel for cffi (setup.py): started
Building wheel for cffi (setup.py): finished with status 'error'
ERROR: Command errored out with exit status 1:
command: /usr/local/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-cqqauhp5/cffi/s
etup.py'"'"'; __file__='"'"'/tmp/pip-install-cqqauhp5/cffi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);cod
e=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pi
p-wheel-p4kxp7xk --python-tag cp37
cwd: /tmp/pip-install-cqqauhp5/cffi/
Complete output (48 lines):
unable to execute 'gcc': No such file or directory
The referenced line has pip
download the required depdencies, but if these downloads come in sdist
form, they are not built into wheels. When a downstream service requires one of the source depdencies, it attempts to build it, but doesn't have the correct tool chain in the image file system to build (i.e. scheduler-service
requires cryptography, which requires cffi
which is only a source distribution, so the scheduler-sercvice
attempts to build the source and fails.)
To build a NWM v2.1 docker image, the previous scripts can be used for the building process with the use of NWM v2.1 commit: NWM_COMMIT=4d0c8ad
For the testing of the image, change to the nwm/Dockerfile and a new entry.sh are needed.
Run to finish successfully
Run to finish successfully
NWM_COMMIT=4d0c8ad docker-compose -f docker-build-custom.yml build nwm
docker run -d --cpus=2 127.0.0.1:5000/nwm-2.1
$ docker logs 3828b75fb4dbe9baa09f99f10dcd0a45480fa6edc55817e55f608abf8872b67e
Calling config noahlsm_offline
Calling config noahlsm_offline
WARNING: KDAY is deprecated and may be removed in a future version, please use KHOUR.
WARNING: KDAY is deprecated and may be removed in a future version, please use KHOUR.
WARNING: In land_driver_ini() - KHOUR < 0. DEFINED USING KDAY.
WARNING: In land_driver_ini() - KHOUR < 0. DEFINED USING KDAY.
reading from hydrotbl_f(HYDRO.TBL.nc) file ....
reading from hydrotbl_f(HYDRO.TBL.nc) file ....
WARNING: get2d_real: failed to find the variables: CHAN_DEPTH and CHAN_DEPTH
Before read LAKEPARM from NetCDF ./DOMAIN/LAKEPARM.nc
NLAKES = 1
read gwbasmskfil as nc format: ./DOMAIN/GWBASINS.nc
read GWBUCKPARM file as nc format: ./DOMAIN/GWBUCKPARM.nc
Resetting RESTART Accumulation Variables to 0... 1
Resetting RESTART Accumulation Variables to 0... 1
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_OVERFLOW_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
==> diag_hydro.00000 <==
***DATE=2011-09-01_17:00:00 294.39764 2.27589 Timing: 0.19 Cumulative: 40.39 SFLX: 0.00
***DATE=2011-09-01_18:00:00 295.02222 2.27288 Timing: 0.19 Cumulative: 40.58 SFLX: 0.00
***DATE=2011-09-01_19:00:00 295.32843 2.26986 Timing: 0.19 Cumulative: 40.77 SFLX: 0.00
***DATE=2011-09-01_20:00:00 295.39072 2.26685 Timing: 0.20 Cumulative: 40.96 SFLX: 0.00
***DATE=2011-09-01_21:00:00 295.25891 2.26384 Timing: 0.20 Cumulative: 41.16 SFLX: 0.00
***DATE=2011-09-01_22:00:00 294.83615 2.26082 Timing: 0.19 Cumulative: 41.35 SFLX: 0.00
***DATE=2011-09-01_23:00:00 294.21228 2.25781 Timing: 0.20 Cumulative: 41.55 SFLX: 0.00
yw check output restart at 2011-09-02_00:00
***DATE=2011-09-02_00:00:00 293.52155 2.25479 Timing: 0.27 Cumulative: 41.82 SFLX: 0.00
The model finished successfully.......
==> diag_hydro.00001 <==
***DATE=2011-09-01_17:00:00 294.60822 2.27589 Timing: 0.19 Cumulative: 40.38 SFLX: 0.00
***DATE=2011-09-01_18:00:00 295.29651 2.27288 Timing: 0.19 Cumulative: 40.57 SFLX: 0.00
***DATE=2011-09-01_19:00:00 295.65710 2.26986 Timing: 0.19 Cumulative: 40.77 SFLX: 0.00
***DATE=2011-09-01_20:00:00 295.75607 2.26685 Timing: 0.20 Cumulative: 40.96 SFLX: 0.00
***DATE=2011-09-01_21:00:00 295.64462 2.26384 Timing: 0.20 Cumulative: 41.16 SFLX: 0.00
***DATE=2011-09-01_22:00:00 295.20270 2.26082 Timing: 0.19 Cumulative: 41.35 SFLX: 0.00
***DATE=2011-09-01_23:00:00 294.53528 2.25781 Timing: 0.20 Cumulative: 41.55 SFLX: 0.00
yw check output restart at 2011-08-26_00:00
***DATE=2011-09-02_00:00:00 293.75571 2.25479 Timing: 0.27 Cumulative: 41.81 SFLX: 0.00
The model finished successfully.......
mpirun returned with a return value: 0
Supported model container images should try to employ multi-stage builds to cut down on the final image sizes used to run the models.
In some cases, this can be tricky. When using MPI based applications, such as the NWM, the dependencies are runtime/dynamic. Is it possible to build these in a build
image and copy the built libraries to an intermidiate layer like the deps image is doing now? This would definitely help reduce the image sizes. At the very least, it should be possible to multi-stage the models themselves, building them in a build
image and simply copying the artifacts to an image that is based on the deps
image.
Some more work on this front is required to understand the balance between the level of effort required and the actual functionality needed.
When using scripts control_stack.sh
to rebuild and deploy the gui stack, it picks up stale builds of the gui because py-sources isn't pushed to the registry.
Edit GUI code, then run ./scripts/control_stack.sh nwm_gui build deploy
. The stack gets re-deployed, but not with the changes.
Edit GUI code, then run ./scripts/control_stack.sh nwm_gui build deploy
. py-sources gets updated AND pushed to the registry, then the gui services are built and deployed.
Need to add cryptography
to the install_requires
for this service package.
The pre-NextGen NWM does not operate properly with versions of NetCDF beyond 4.6. This is documented in NCAR/wrf_hydro_nwm_public#382. The DMOD workaround is to specify the version. Recently it was discovered this will need to be extended to the netcdf-fortran library also, as its latest is no longer compatible with the required older NetCDF versions.
A better long-term solution is needed. One potential option is an OWP fork of the NWM code, in which we patch the source as needed to work with the newer NetCDF code.
NWM (i.e., pre-NextGen) DMOD images cannot be built unless specific older versions of NetCDF C and Fortran libs are used.
It should be possible to use the latest NetCDF without issue.
Develop a forcing engine that can ingest AORC forcing data (and can be extended to include others) and process the data to the canonical format for the Nextgen framework.
We have a nice git usage markdown in doc/git_usage.md, it should be linked to the contributing doc.
Task to track development of initial formulation evaluation components to facilitate various MaaS workflows.
(Add and/or spin off to separate issues as needed)
Also, while not a subtask, this depends on #116.
This function appears in many test suites, and at the moment has multiple implementations. This should get consolidated and provided to all testing suites.
Documenting this so we get the expected behavior from the listener.
I was under the impression that the WebSocketInterface listener which is implemented as the "server" was a single function call at run time, meaning that each connection would connect to the same listener. However, after trial and error on another project where I am using a consumer producer pattern allowing for multiple connections at a given time, I determined empirically this is not the case. In the (websockets
docs)[https://websockets.readthedocs.io/en/stable/api.html#module-websockets.server], this behavior is documented but not as explicitly as I would have liked.
websockets.server(...)
Whenever a client connects, the server accepts the connection, creates a WebSocketServerProtocol, performs the opening handshake, and delegates to the connection handler defined by ws_handler. Once the handler completes, either normally or with an exception, the server performs the closing handshake and closes the connection.
So, instead, each connection to the listener calls a new instance of the listener that runs within the event loop.
There has been enough discussion about this that it is clearly feature that we will eventually need to support. The first subtask for this, though, it establishing whether it needs to be in the time table of the current milestone/artifact.
If it is decided to be part of a later milestone, this will need to be unlinked from the current parent issue and otherwise adjusted to reflect that.
Regardless, this will also need to eventually be broken down into other subtasks.
Determine if using remotely hosted hydrofabrics should be supported.
If so, this will lead to additional issues, likely linked under #122.
To provide solution for several data location and availability problems, an object storage service needs to be created. This (potentially meta) issue will track the requirements and progress.
Initial Usages:
Initial Design Notes:
As described in #101, there appear to be some problems in at least some environments with the current required version of urllib3. These need to be investigated and addressed.
The move to this version was done via an automated workflow (#80), so reverting to a previous version should not be done without careful examination.
In https://github.com/NOAA-OWP/DMOD/blob/master/doc/SUBSETTING_CLI_TOOL.md
In documentation, running same script twice. Is that correct?
./scripts/update_package.sh python/lib/modeldata
./scripts/update_package.sh python/services/subsetservice
I also get a warning and an error when following documentation:
WARNING: Skipping dmod-subsetservice as it is not installed
ERROR: Could not find a version that satisfies the requirement dmod-modeldata>=0.3.0 (from dmod-subsetservice) (from versions: none)
ERROR: No matching distribution found for dmod-modeldata>=0.3.0
git clone [email protected]:NOAA-OWP/DMOD.git
cd <dmod_project_dir> # replace with appropriate local directory
python -m venv venv # or 'python3' if appropriate
source venv/bin/activate # enter the venv in this terminal
pip install --upgrade pip
pip install -r requirements.txt
./scripts/update_package.sh python/lib/modeldata
./scripts/update_package.sh python/services/subsetservice
deactivate # exit the venv in this terminal
There is currently no documentation of the required/supported version(s) of the Docker Engine, Docker Compose, or Docker Swarm. At least some limited detail on what is supported should be added.
The referenced line sets the logger to write to a file. When this module is then imported into a service, the service subsequently logs to that configured logger. The service/client should control this behavior, not the module. Removing this line allows the service to log to to stdout and the logs to be easily viewd with docker service logs
.
To support NGEN model, subclasses of MaaSRequest
and MaaSRequestResponse
need to be implemented, these will be added to maas_request.py (for the time being) as NGENRequest
and NGENRequestResponse
.
As support for multiple models and workflows expands, we should rename agnostic components to be more clear about where in the overall architecture they reside.
For example, the main stack and service names should be end up looking like dmod_request-service
where the stack is dmod
and the service in the stack is request-service
, scheduler-service
ect.
The only place we might want to retain the model specific components in the names is in the worker services being started to support an instansiation of the model, so nwm_mpi-worker-tmpXXX
would be appropriate to differentiate worker services.
Building of internal Python library for evaluations. Plan is to leverage the hydrotools project where possible.
Implement service component for managing forcing, configuration, and other category dataset handling.
This may need to be developed from the partially implementeddatarequestservice
package, perhaps with some refactoring/renaming of things.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.