Code Monkey home page Code Monkey logo

dmod's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dmod's Issues

Generalizing service and stack names.

As support for multiple models and workflows expands, we should rename agnostic components to be more clear about where in the overall architecture they reside.

For example, the main stack and service names should be end up looking like dmod_request-service where the stack is dmod and the service in the stack is request-service, scheduler-service ect.

The only place we might want to retain the model specific components in the names is in the worker services being started to support an instansiation of the model, so nwm_mpi-worker-tmpXXX would be appropriate to differentiate worker services.

Running local GUI server crashes connecting to MAAS_ENDPOINT

GUI fails to find/create asyncio event loop to establish session.

Current behavior

Start the gui app server locally:
MAAS_ENDPOINT_HOST=hostname MAA_ENDPOINT_PORT=port ./manage.py runserver
from http://127.0.0.1:8000/ in the browser, configure a request.
The following error occurs in the server log/output

EditView.post: making job request
client Making Job Request
PostFormJobRequestClient._acquire_session_info:  getting session info
Session from ModelRequestClient: force_new=False
Connection to request handler web socket
Expecting exception to follow

Failed _acquire_session_info
Traceback (most recent call last):
  File "/Users/nels.frazier/workspace/DMOD/venv/lib/python3.8/site-packages/dmod/communication/client.py", line 349, in _acquire_new_session
    auth_details = asyncio.get_event_loop().run_until_complete(self.authenticate_over_websocket())
  File "/Users/nels.frazier/homebrew/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/events.py", line 639, in g
et_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Thread-1'.
Session Info Return: False
client Unable to aquire session details

Expected behavior

The sessions should establish and push the request to the MAAS_ENDPOINT

Implementing multi-stage builds in model containers

Supported model container images should try to employ multi-stage builds to cut down on the final image sizes used to run the models.

In some cases, this can be tricky. When using MPI based applications, such as the NWM, the dependencies are runtime/dynamic. Is it possible to build these in a build image and copy the built libraries to an intermidiate layer like the deps image is doing now? This would definitely help reduce the image sizes. At the very least, it should be possible to multi-stage the models themselves, building them in a build image and simply copying the artifacts to an image that is based on the deps image.

Some more work on this front is required to understand the balance between the level of effort required and the actual functionality needed.

Stand Alone Interface for Partitioning

Add an interface (likely via the GUI) to be able to generate partition configurations in an independent workflow; i.e., not part of a large model execution workflow.

DMOD gh-page documentation

It would be nice to have a github io page for this project. We could use sphinx and autodocs to automatically build the api documentation and mirror the readme on the front page for starters. Is this a feature that desired? If so, have guidelines been set for gh-pages? More or less just staring a conversation here.

Personally, using sphinx with the readthedocs theme on a gh-pages hosted site is what I would propose.

Support non-geojson hydrofabrics

There has been enough discussion about this that it is clearly feature that we will eventually need to support. The first subtask for this, though, it establishing whether it needs to be in the time table of the current milestone/artifact.

If it is decided to be part of a later milestone, this will need to be unlinked from the current parent issue and otherwise adjusted to reflect that.

Regardless, this will also need to eventually be broken down into other subtasks.

Initial Formulation Evaluation Capabilities

Task to track development of initial formulation evaluation components to facilitate various MaaS workflows.

Subtasks

(Add and/or spin off to separate issues as needed)

  • #112
  • #140
  • #141
  • #99
  • Finalize design of complete Evaluation user workflow
  • Create appropriate GUI, file output, and internal distribution components for using Evaluation output

Also, while not a subtask, this depends on #116.

scheduler-service needs to deserialize resource definitions correctly

The change in resource management creates an explicit representation for Resource, requiring the raw meta data to be deserialized before being passed to the ResourceManager. Previously, these resources were added from raw dictionary definitions.

Current behavior

scheduler-service reads a list of raw dictionaries defining the managed resources and passes this list to the ResourceManager. This is currently broken.

Expected behavior

scheduler-service deserializes the raw dictionaries into Resource objects appropriate for the ResoruceManager to operate on.

control_stack doesn't push py-sources to registry when building gui stack

When using scripts control_stack.sh to rebuild and deploy the gui stack, it picks up stale builds of the gui because py-sources isn't pushed to the registry.

Current behavior

Edit GUI code, then run ./scripts/control_stack.sh nwm_gui build deploy. The stack gets re-deployed, but not with the changes.

Expected behavior

Edit GUI code, then run ./scripts/control_stack.sh nwm_gui build deploy. py-sources gets updated AND pushed to the registry, then the gui services are built and deployed.

Building nwm deps image fails to compile hdf5 1.10.4

Compiling hdf5 1.10.4 on apline 3.4 fails when it cannot find libexecinfo. While possible to get libexecinfo on 3.4, it is somewhat tricky.

Current behavior

Fails to build deps image.

Expected behavior

Build dep image.

Create Nextgen-Canonical Forcing Engine Library Package

Develop a forcing engine that can ingest AORC forcing data (and can be extended to include others) and process the data to the canonical format for the Nextgen framework.

  • Process received raw data to canonical
  • Write canonical data to supported Dataset types
  • Read canonical data from supported Dataset types
  • Provide public interface for accessing canonical data directly

WebSocketInterface `listener(...)` creates instance for each connection

Documenting this so we get the expected behavior from the listener.

I was under the impression that the WebSocketInterface listener which is implemented as the "server" was a single function call at run time, meaning that each connection would connect to the same listener. However, after trial and error on another project where I am using a consumer producer pattern allowing for multiple connections at a given time, I determined empirically this is not the case. In the (websockets docs)[https://websockets.readthedocs.io/en/stable/api.html#module-websockets.server], this behavior is documented but not as explicitly as I would have liked.

websockets.server(...)
Whenever a client connects, the server accepts the connection, creates a WebSocketServerProtocol, performs the opening handshake, and delegates to the connection handler defined by ws_handler. Once the handler completes, either normally or with an exception, the server performs the closing handshake and closes the connection.

So, instead, each connection to the listener calls a new instance of the listener that runs within the event loop.

Forcing Data Handling

Design and implement necessary components for receipt/retrieval, management, and usage of "raw" AORC (and potentially other) forcing data.

  • #128
  • #121
  • Minimal dataset management CLI

NGen Docker Image Configuration Outdated

The main Docker image for running Nextgen has become out of date. It needs to be reviewed and made current. It may be worth considering whether it should be expanded in some ways also (e.g., building separate images for parallel versus serial framework executables).

One specific observation has been that an error occurs during image builds at the step when the framework is compiled. The Docker image config is not installing a supported Python version, but the lack of Python is not accounted for properly when the image build compiles the framework.

Complete Components for Full Model Exec Workflow

Tracking of any remaining design, implementation, or bug fix tasks for initial version of a full, basic model execution workflow.

Task definition is currently still in progress and will require additional expansion.

Implementation Details

  • #117
  • #120
  • Partitioning service support
  • #118
  • Ensure requests contain sufficient metadata to identify all necessary datasets for requested job
  • Ensure job objects contain/serialize state for referencing all required datasets
  • Add mechanism(s) for model containers to have forcing data (likely from object store files)
  • Add support for model output to be saved within object store
  • Ensure all config files (BMI init, realization, hydrofabric, etc.) are accessible to model containers (likely from object store)

Bugs

Hydrofabric Data Handling

Finalize design and implementation of functionality related to managing and providing hydrofabric data, with respect to the associated deliverable target.

Initial Subsystems For Observation Data Handling

Issue to track task of creating the initial set of components for handling observation data as needed to support evaluation and calibration routines.

This description should continue to be updated as the specifics of the initial working version of these continue to come into focus.

  • Add support to ingest uploaded observation data files
  • (Optional) Add support to ingest remotely downloaded observation data files
  • Add GUI views for managing and updating observation data
  • Design components and infrastructure for making observation data available to DMOD services as needed
  • Implement components and infrastructure for making observation data available to DMOD services as needed

Change MaaSRequest to treat version as string

The dmod.communication.MaaSRequest class currently treats the version attribute as a float; i.e., it is type-hinted, documented, and otherwise treated as such. However, it probably makes more sense to interpret version as a string.

This will need to be reflected in the class, the associated JSON schema, and anywhere else it is directly or indirectly used.

GUI error due to inconsistency with recent comms changes

Currently (perhaps after manually changing the urllib3 dependency; see #105), the following error is seen when loading the GUI:

type object 'MaaSRequest' has no attribute 'get_distribution_types'
Request Method:GETRequest URL:http://<server>/Django Version:2.2.24Exception Type:AttributeErrorException Value:type object 'MaaSRequest' has no attribute 'get_distribution_types'
Exception Location:/usr/maas_portal/MaaS/cbv/EditView.py in get, line 55Python Executable:/usr/local/bin/pythonPython Version:3.8.12Python Path:['/usr/maas_portal',
 '/usr/local/bin',
 '/usr/local/lib/python38.zip',
 '/usr/local/lib/python3.8',
 '/usr/local/lib/python3.8/lib-dynload',
 '/usr/local/lib/python3.8/site-packages']

This appears to be due to changes made in the communication library (and perhaps other places) that have yet to be properly reflected in the gui package.

Decouple model image handling from DMOD required services

As new models/workflow are envisioned for use within DMOD, a clear seperation of the DMOD services, such as request-service, and the model images such as nwm should be established.

Current behavior

The main stack is built and pushed including the model images from the same configuration files (docker/main/docker-build.yml and docker/main/docker-deploy.yml

Expected behavior

The main stack should build and deploy only required DMOD services, irregardless of the available model images. A separate directory, like docker/main/models or maybe even docker/models would hold the main build definitions (since we aren't decploying these as services, no deploy file is needed) and the upstack/control_stack utilities can be refactored to build and push model images (or verify they are available).

Discussion Points

The only place this becomes potentially problematic is connecting the image_and_domain.ymlthat is used by the scheduler/launcher.

Error installing - update_package.sh run twice

In https://github.com/NOAA-OWP/DMOD/blob/master/doc/SUBSETTING_CLI_TOOL.md

In documentation, running same script twice. Is that correct?
./scripts/update_package.sh python/lib/modeldata
./scripts/update_package.sh python/services/subsetservice

I also get a warning and an error when following documentation:
WARNING: Skipping dmod-subsetservice as it is not installed
ERROR: Could not find a version that satisfies the requirement dmod-modeldata>=0.3.0 (from dmod-subsetservice) (from versions: none)
ERROR: No matching distribution found for dmod-modeldata>=0.3.0

Steps to replicate behavior (include URLs)

git clone [email protected]:NOAA-OWP/DMOD.git
cd <dmod_project_dir> # replace with appropriate local directory
python -m venv venv # or 'python3' if appropriate
source venv/bin/activate # enter the venv in this terminal
pip install --upgrade pip
pip install -r requirements.txt
./scripts/update_package.sh python/lib/modeldata
./scripts/update_package.sh python/services/subsetservice
deactivate # exit the venv in this terminal

Decide on whether and when to add support for user-uploaded hydrofabrics

Track the discussion and decision making process on
a.) whether this feature is wanted/needed, and
b.) whether the feature should be part of the FIHM, September, or other deliverable

Support could be added for user-uploaded hydrofabrics, in addition to those statically supplied at MaaS environment start. It needs to be established whether this is a desired feature and, if so, on what time table it should be included (in particular, whether it belongs in this FIHM milestone).

Develop foundational library for evaluations

Building of internal Python library for evaluations. Plan is to leverage the hydrotools project where possible.

  • General library foundational code for routine setup and data ingest
  • Implement support for AUC metric
  • Implement support for Mean Error metric
  • Implement support for Kling-Gupta metric

building py_sources fails

When building the py_sources stack (./scripts/control_stack.sh py_sources build) the build fails build the required dependency wheels:
This is the failing dockerfile line:

RUN mkdir /DIST && pip download --no-cache-dir --destination-directory /DIST -r /nwm_service/requirements.txt

Giving this error:

 Building wheels for collected packages: cffi
    Building wheel for cffi (setup.py): started
    Building wheel for cffi (setup.py): finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: /usr/local/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-cqqauhp5/cffi/s
etup.py'"'"'; __file__='"'"'/tmp/pip-install-cqqauhp5/cffi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);cod
e=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pi
p-wheel-p4kxp7xk --python-tag cp37
         cwd: /tmp/pip-install-cqqauhp5/cffi/
    Complete output (48 lines):
    unable to execute 'gcc': No such file or directory

Examine solutions to NetCDF compatibility issue

The pre-NextGen NWM does not operate properly with versions of NetCDF beyond 4.6. This is documented in NCAR/wrf_hydro_nwm_public#382. The DMOD workaround is to specify the version. Recently it was discovered this will need to be extended to the netcdf-fortran library also, as its latest is no longer compatible with the required older NetCDF versions.

A better long-term solution is needed. One potential option is an OWP fork of the NWM code, in which we patch the source as needed to work with the newer NetCDF code.

Current behavior

NWM (i.e., pre-NextGen) DMOD images cannot be built unless specific older versions of NetCDF C and Fortran libs are used.

Expected behavior

It should be possible to use the latest NetCDF without issue.

Unit testing uses incorrect cache of DMOD package.

The cache key is static, so as long a the cache has been built the same one will be used each time. But since the DMOD packages are installed into this venv, if multiple changes to various package components are made in subsequent PR's, the testing is likely to fail.

Need to invalidate this venv cache when DMOD packages are changed, or not cache DMOD installs and only cache external dependencies.

py_sources build layer doesn't actually build all python dependencies

RUN mkdir /DIST && pip download --no-cache-dir --destination-directory /DIST -r /nwm_service/requirements.txt

The referenced line has pip download the required depdencies, but if these downloads come in sdist form, they are not built into wheels. When a downstream service requires one of the source depdencies, it attempts to build it, but doesn't have the correct tool chain in the image file system to build (i.e. scheduler-service requires cryptography, which requires cffi which is only a source distribution, so the scheduler-sercvice attempts to build the source and fails.)

Setup gui testing

Need some unit/integration tests for the GUI module

Current behavior

All gui testing is manual

Expected behavior

Should have some automated tests for the GUI

Disscussion

Django has built in testing support using python unittest and can be executed using ./manage.py test.

There is also a solid pytest-django plugin for pytest, and a good writeup on settiing up and using it here.

Pytest may also be worth looking at for all DMOD testing going forward (it works fine with traditional unittest tests out of the box)

Create Request Subclasses for NGEN requests

To support NGEN model, subclasses of MaaSRequest and MaaSRequestResponse need to be implemented, these will be added to maas_request.py (for the time being) as NGENRequest and NGENRequestResponse.

Support configurable settings for various async sleep times.

There are several spots in our async service logic where we call sleep(), typically at the end of a looping periodic task to wait before running it again. These usages should be examined and, where appropriate, support for controlling the specific amount of time should be added via either externalized configuration or parameterization.

A particular example is in python/services/monitorservice/dmod/monitorservice/service.py within the exec_monitoring() function.

Invalid use of DOCKER_SECRET_REDIS_PASS

The env variable established in commit 30c48a3 contains the path to the secret file, not the secret itself, but several places in the service code assume that DOCKER_SECRET_REDIS_ is a valid env prefix for the password itself.

Current behavior

Services look for ENV variable DOCKER_SECRET_REDIS_PASS assuming it is the password.

Expected behavior

Either DOCKER_SECRET_REDIS_PASS needs to be re-written, or a different ENV variable needs to be used which is set from the value in the secret file.

The redis service does this approrpiately in the entrypoint.sh

SECRET_FILE="/run/secrets/${DOCKER_SECRET_REDIS_PASS:?}"
REDIS_PASS="$(cat ${SECRET_FILE})"

But the other python based services do not read this secret appropriately, and the function used to parse the env, _get_parsed_or_env_val in the __main__.py checks that DOCKER_SECRET_REDIS_PASS exists, which it does, and uses its value (the path to the secret file in the container file system) as the password it uses to attempt connections with, which fail.

Ensure compatibility with urllib3 1.26.5

As described in #101, there appear to be some problems in at least some environments with the current required version of urllib3. These need to be investigated and addressed.

The move to this version was done via an automated workflow (#80), so reverting to a previous version should not be done without careful examination.

Service component for internal dataset handling

Implement service component for managing forcing, configuration, and other category dataset handling.

This may need to be developed from the partially implementeddatarequestservice package, perhaps with some refactoring/renaming of things.

  • #138
  • #153
  • #139
  • Data service support for dataset catalog and querying/search for datasets based on matching DataDomain or sub-domain
  • Data service support for checking if required data for a job is available
  • Data service support for serving canonical forcing data (directly and/or indirectly; e.g., via object store dataset)
  • Data service support for removing existing datasets

Implement Object Store Service

To provide solution for several data location and availability problems, an object storage service needs to be created. This (potentially meta) issue will track the requirements and progress.

Initial Usages:

  • Movement and availability of forcing data
  • Getting job artifacts (i.e., configs, hydrofabrics) to stack nodes
  • Getting output out of the stack and back to the GUI for the user

Initial Design Notes:

  • Add separate stack for this
    • Clean separation of resource config
    • Will make simple to, in the future, switch from using an internal implementation inside DMOD to an externally provided object store service if/when available
  • Use MinIO

Building a NWM v2.1 docker image

To build a NWM v2.1 docker image, the previous scripts can be used for the building process with the use of NWM v2.1 commit: NWM_COMMIT=4d0c8ad

For the testing of the image, change to the nwm/Dockerfile and a new entry.sh are needed.

Current behavior

Run to finish successfully

Expected behavior

Run to finish successfully

Steps to replicate behavior (include URLs)

The build involves docker-build-custom.yml, base/Dockerfile, nwm/deps/Dockerfile, nwm/Dockerfile, and nwm/entry.sh

The instruction on downloading the test example__case is located at

https://github.com/NCAR/wrf_hydro_nwm_public/tree/v5.2.0/tests

The step by step instruction how to build, prepare initial input, and run wrf_hydro.exe using mpirun is located at

https://ral.ucar.edu/sites/default/files/public/projects/Technical%20Description%20%26amp%3B%20User%20Guides/wrf-hydrov5.1.1testcaseuserguide.pdf

In the parent directory of nwm/, run the following commands to build and run the test of nwm_v2.1 docker image:

NWM_COMMIT=4d0c8ad docker-compose -f docker-build-custom.yml build nwm
docker run -d --cpus=2 127.0.0.1:5000/nwm-2.1

Screenshots

$ docker logs 3828b75fb4dbe9baa09f99f10dcd0a45480fa6edc55817e55f608abf8872b67e
Calling config noahlsm_offline
Calling config noahlsm_offline
WARNING: KDAY is deprecated and may be removed in a future version, please use KHOUR.
WARNING: KDAY is deprecated and may be removed in a future version, please use KHOUR.
WARNING: In land_driver_ini() - KHOUR < 0. DEFINED USING KDAY.
WARNING: In land_driver_ini() - KHOUR < 0. DEFINED USING KDAY.
reading from hydrotbl_f(HYDRO.TBL.nc) file ....
reading from hydrotbl_f(HYDRO.TBL.nc) file ....
WARNING: get2d_real: failed to find the variables: CHAN_DEPTH and CHAN_DEPTH
Before read LAKEPARM from NetCDF ./DOMAIN/LAKEPARM.nc
NLAKES = 1
read gwbasmskfil as nc format: ./DOMAIN/GWBASINS.nc
read GWBUCKPARM file as nc format: ./DOMAIN/GWBUCKPARM.nc
Resetting RESTART Accumulation Variables to 0... 1
Resetting RESTART Accumulation Variables to 0... 1
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_OVERFLOW_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
==> diag_hydro.00000 <==
***DATE=2011-09-01_17:00:00 294.39764 2.27589 Timing: 0.19 Cumulative: 40.39 SFLX: 0.00
***DATE=2011-09-01_18:00:00 295.02222 2.27288 Timing: 0.19 Cumulative: 40.58 SFLX: 0.00
***DATE=2011-09-01_19:00:00 295.32843 2.26986 Timing: 0.19 Cumulative: 40.77 SFLX: 0.00
***DATE=2011-09-01_20:00:00 295.39072 2.26685 Timing: 0.20 Cumulative: 40.96 SFLX: 0.00
***DATE=2011-09-01_21:00:00 295.25891 2.26384 Timing: 0.20 Cumulative: 41.16 SFLX: 0.00
***DATE=2011-09-01_22:00:00 294.83615 2.26082 Timing: 0.19 Cumulative: 41.35 SFLX: 0.00
***DATE=2011-09-01_23:00:00 294.21228 2.25781 Timing: 0.20 Cumulative: 41.55 SFLX: 0.00
yw check output restart at 2011-09-02_00:00
***DATE=2011-09-02_00:00:00 293.52155 2.25479 Timing: 0.27 Cumulative: 41.82 SFLX: 0.00
The model finished successfully.......
==> diag_hydro.00001 <==
***DATE=2011-09-01_17:00:00 294.60822 2.27589 Timing: 0.19 Cumulative: 40.38 SFLX: 0.00
***DATE=2011-09-01_18:00:00 295.29651 2.27288 Timing: 0.19 Cumulative: 40.57 SFLX: 0.00
***DATE=2011-09-01_19:00:00 295.65710 2.26986 Timing: 0.19 Cumulative: 40.77 SFLX: 0.00
***DATE=2011-09-01_20:00:00 295.75607 2.26685 Timing: 0.20 Cumulative: 40.96 SFLX: 0.00
***DATE=2011-09-01_21:00:00 295.64462 2.26384 Timing: 0.20 Cumulative: 41.16 SFLX: 0.00
***DATE=2011-09-01_22:00:00 295.20270 2.26082 Timing: 0.19 Cumulative: 41.35 SFLX: 0.00
***DATE=2011-09-01_23:00:00 294.53528 2.25781 Timing: 0.20 Cumulative: 41.55 SFLX: 0.00
yw check output restart at 2011-08-26_00:00
***DATE=2011-09-02_00:00:00 293.75571 2.25479 Timing: 0.27 Cumulative: 41.81 SFLX: 0.00
The model finished successfully.......
mpirun returned with a return value: 0

Add editable flag to update_packages.sh

pip has an editable option (pip install -e <package>?) for libraries so that changes can be made and used directly instead of treating libraries as static entities. I'd like to be able to pass a flag to update_packages.sh so that I can install the dmod libraries as editable, but not the third party dependencies.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.