Code Monkey home page Code Monkey logo

meillionen's Introduction

Meillionen

Meillionen will be a project to facilitate the interoperability of simulation models written in different languages and frameworks for the physical and social sciences. Currently it is being used to experiment with different ways of connecting the SimpleCrop CLI/library with LandLab and making models conform to the PyMT interface.

Setup

In order to setup this project you'll have to install conda.

Then run

make setup

to setup the conda env and install the needed packages

Examples

Example models and workflows are in the examples directory

Documentation

To build the documentation you'll need a meillionen python development environment setup (see python/README.md for details) which will result in the ghp-import package being installed.

Instructions

Go into the docs folder

cd docs

Then build the jupyter book docs

jupyter book build .

Push those docs to the gh-pages branch on github when the built docs look ready

ghp-import --no-jekyll -o _build/html -p

meillionen's People

Contributors

cpritcha avatar dependabot-preview[bot] avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

cpritcha chporter

meillionen's Issues

SimpleCrop Output Unit Conversion

Plant Output

  • Day of Year
  • # of Leaf Nodes
  • Accumulated Temperature during Reproductive Phase (°C)
  • Plant Weight (g/m2)
  • Canopy Weight (g/m2)
  • Root Weight (g/m2)
  • Fruit Weight (g/m2)
  • Leaf Area Index (m2/m2)

since output is independent of spatial scale we don't need to scale output to match the overland flows output model. If we need information like total mass of fruit produced in a 10x10m cell then we need to convert units (multiple weights by 100m2, average accumulated temperature during reproductive phase and # of leaf nodes)

Workplan and Project Goals

Goal

The goal of the OMF interoperability project is to be able to combine models written in different languages and environments with a low barrier to entry for model consumers.

Possible User Stories

Bob wants to couple two different models together. One is in Python (a rain routing model) and the other is in Fortran (a crop model). Bob wants to use the crop model as it was created as a command line program. This means Bob first has to figure out the format of any files consumed and produced by the crop model. Then he has to create a wrapper library to make the model easy to call.

Adison wants to couple two different models together. One is in Python (a rain routing model) and the other is in Fortran (a crop model). Adison is more comfortable with Fortran (and with the model itself) so she wants to build the portions of the application she is using as a library. This means Adison has to adapt portions of the application to make it possible to call directly from C, Python etc.

Acme Inc wants to make use of Adison's wrapped crop model in their project. They install the model and can now use it with their project. First they need to understand how the model works so experiment with the model in the REPL using a built in dataset and get documentation about the model's features. Then they integrate the model into their project.

Model Implementations

Model implementations are often implemented in compiled languages as libraries for executables (for C, C++, Fortran) and interpreted/JITed languages (R, Python, Julia, NetLogo). The challenges in creating model wrappers for these types of model implementations are outlined in this section.

Compiled Libraries

  • local (in-process/subprocess) wrappers
    • may not need to serialize/deserialize requests if data structures are compatible
    • can be loaded dynamically as shared libraries / dynamic link libraries
    • probably doesn't require it's own container for dependency management since libraries are fairly self contained
    • a local wrapper will not allow multiple instances if the library global mutable state
  • remote wrappers
    • deserialize requests and serialize requests
    • clients spawning a instance will need to able to know the model component api interface, be able make requests and deserialize responses. This will be made easier using an existing serialization library like Apache Arrow, Capn Proto or FlatBuffers.
    • probably won't require their own containers since remote models can be built fairly self contained (but may be needed to get around OS requirements in development)

Compiled Executables

  • local (in-process/subprocess) wrappers
    • can call command line interface as subprocess. also need to be able to write input files and interpret output files
  • remote wrappers
    • deserialize requests and serialize requests. also need to be able to write to somewhere that the executable can read from in a format the executable understand and interpret results produced by the executable into something to serialize.
    • clients spawning a instance will need to able to know the model component api interface, be able make requests and deserialize responses. This will be made easier using an existing serialization library like Apache Arrow, Capn Proto or FlatBuffers.

Interpreted

  • local (in-process/subprocess) wrappers
    • depending on the programming language it is possible to make direct calls between them (for example for R to Python there is reticulate and for Python to R there is rpython).
    • setting up a container here is potentially more difficult because it requires unioning dependencies to form an image
  • remote wrappers
    • same challenges as remote compiled libraries but much more likely to benefit from a container setup since dependency management for scripting language / JITs is more difficult

Model Metadata

In order to be able find models or know how to connect a model we need to have metadata about a particular model. Ideally this could be dynamic so that when a model instance is created the instance can be queried to determine it's interface.

Work Plan / Roadmap

SimpleCrop / LandLab Example

What has already been done:

  • adapt simplecrop for use as a cli
  • adapt simplecrop for use in a library
  • adapt landlab overland flow for use in a library

What needs to be done (basics):

Short Term:

  • finish wrapping cli interface using pymt interface
  • finish wrapping library using pymt like interface
  • create or make use of a spatial grid abstraction so that crop model can be easily setup and run on multiple sites
  • write a generic data component to replace the weather and irrigation components
  • create a Python wrapper for local and remote simulation creation and scheduling using PyMT
  • add support for interpreted languages using PyO3 etcetera
  • adapt r.watershed / r.terraflow for use in library as a landlab overland flow drop in
  • containerize example models

Medium Term:

  • write a basic metadata library to make it possible to create dynamically create instances
  • write a basic Python wrapper to call the basic metadata library (look at mint project)
  • documentation for libraries

Long Term:

  • tooling to publish and install models
  • create a library to reduce time needed to create local wrappers and simulators
  • create a library to reduce time needed to create remote wrappers and simulators
  • create a library to expose basic simulation interfaces
  • allow model instance APIs to depend on configuration

Cleanup simplecrop rust wrapper

  • see if a fixed width serialization format could replace our manual serialization / deserialization
  • use a macro to shorten to_dict from_dict methods
  • remove redundancy in schema declaration and conversion form record batch to struct and back

support for variable reference resource type

Being able to run multiple commands at once will mean it will be useful to have reference to the model after a method call ran.

It could look something like

from meillionen.interface.resource import Memory

model = Memory(variable='model')

It will be most useful after support for multiple commands is done.

Making SimpleCrop PyMT compatibible

In the Main.f03 of simplecrop you can see that models mutate state used by other models in the rate and integ phases. This makes adopting PyMTs single update method and getter/setter interface more difficult because each model wants to change each others state part way through their respective update methods (assuming that the rate and integ methods are inside the update method). We could get around this by more closing the following a Moore machine model (add a separate output method or using getters at the beginning of the step) instead of just an update method (the Mealy machine model). Following a Moore machine model would result in a slightly different simulation but it would also ensure that each component is receiving messages from its current time slice.

Since SimpleCrop has a CLI it would benefit from some multi-step optimizations from coupled models where the model relationships form a DAG (if the coupled model is a DAG then data across all time-steps to the SimpleCrop model at once). Should be as easy as providing support for giving a years worth of irrigation and weather data to the model.

Support a streaming interface

You should be able to send a stream of data to a model for processing. Currently we allow passing data via file. A stream of data could be sent over a socket (we can hide this from the user with special loader and saver classes).

Create model import functionality

# How  we import models?
# - using regular python packages
# - using metadata files
# 
# is there another way that would be useful?
crop_library = import_model_collection('your/model/here.yml')
crop_library.describe()     # Get basic documentation about models are available
crop_library.list_models() # List all models available in this library
PineappleGrowth = crop_library.get_model('pineapple_growth') # get a PyMT compatible model
Irrigration = crop_library.get_model('irrigation')

# Use the PyMT models
irrigation = Irrigation()
irrigation.initialize(**irrifation.setup())
pineapple_growth = PineappleGrowth()
pineapple_growth.initialize(**pineapple_growth.setup(), dataset='irrig.csv')

for t in range(0, 365):
  pineapple_growth.set_value('irrigation', irrigation.get_value('amount'))
  irrigation.update()
  pineapple_growth.update()

establish pluggable 2x2 crop models and hydrology models

To help keep the middleware focused on being useful we should look at use cases that support multiple configurations of multiple models to ask similar questions.

As a first cut we can work towards a 2x2 matrix of combining the a hydrology model and a crop model:

Crop models: SimpleCrop (should we graduate to DSSAT proper?), APSIM (https://www.apsim.info/)
Hydrology: LandLab, r.watershed, r.terraflow

support dynamic interfaces

a method should be able to return a class interface from handling a request. This could be useful for models where configuration changes the signatures of methods and what data the model produces.

provide a tracing interface make it easier for model users to call the library

Currently we manually methods on a class by building a method request and issuing it to the server.

Building a more class like interface would be great. Maybe something like

landlab = Client(CLIRef('landlab'))
Overlandflow = landlab.get_class('overlandflow')
Infiltration = landlab.get_class('infiltration')

tracer = Tracer()

# not sure how to initialize
def initialize(overlandflow_cls, infiltration_cls, resources):
  overlandflow = overlandflow_cls(elevation=resources['elevation'], weather=resources['weather'])
  infiltration = infiltration_cls(soil_type=reources['soil_type'])
  return {'overlandflow': overlandflow, 'infiltration': infiltration }

def run_one_step(overlandflow: Model, infiltration: Model, duration: float, elapsed_time: float):
  dt = overlandflow.calc_time_step()
  dt = dt if dt + elapsed_time < duration else duration - elapsed_time
  overlandflow.run_one_step(dt=dt)
  infiltration.run_one_step(dt=dt)
  elapsed_time += dt
  return elapsed_time

def output(overlandflow: Model, infiltration: Model, swid: ResourcePayloadable):
  overlandflow.save('soil_water_infiltration_depth', swid)
  return {
      'soil_water_infiltration_depth': overlandflow.resource('soil_water_infiltration_depth')
  }

trace = tracer.run_until(
  models={
    'overlandflow': overlandflow, 
    'infiltration': infiltration
  },
  resources={
    'elevation': ...,
    'weather': ...,
    'soil_type': ...
  },
  initialize=initialize, 
  duration=60*30, 
  run_one_step=run_one_step, 
  output=output)

resources = trace.execute()

Would require overlandflow to have calc_time_step, run_one_step methods with the ability to return values.

use arrow to serialize requests to command line application / web framework

Right now the FuncRequest struct is being used. This has a number of draw backs:

  • have to copy data between rust and host language or use language specific ffi APIs to provide a view of the data. Using language specific implementation information is costly in terms of development effort, copying data is fine for small struct is fine but it would be great to have the ability to pass Arrow record batches, tensors, json etc.
  • restricted implementations of resources to rust. It would be useful if users could create their own resource types without having to extend meillionen or write a rust extension. Since arrow has many language bindings the host language can process arrow record batch and dispatch to appropriate resource handlers (that could be written in rust, python, r, java, julia etc) to load and save data

The arrow request should have a sink struct column and a source struct column. The sink struct column will have one column for every sink declared in the interface. The source struct column will have one column for every source declared in the interface. This will also issuing multiple run instructions to one cli program which will be useful later.

LandLab Integration Possibilities / BMIification

In order to interface SimpleCrop with LandLab for the purpose of crop modeling I see a few areas that will likely require changes for BMI compatibility and general interoperability:

  1. Input/Output - currently file inputs and outputs are hardwired into each model component. We will have to generalize or get rid of these somehow. As a library it would be better if the calling program can decide what data to keep after a model run and where to store that data. If compiled as a library this could mean converting to an API where we have model structs that get mutated or we at least have functions for building model structs for consumption by the calling program.
  2. Configuration - model should be able to configured via a function (right now it depends on hardwired configuration files).
  3. Allow Multiple Instances - since a parameterized model is global mutable state right now it is not possible to run two instances of the same model if SimpleCrop were compiled as a shared object (library). We could get around this in various ways (like running multiple processes, always serializing and deserializing the model before calling the library or shelling out to an executable) but it would push greater complexity into some sort of runner/scheduler that large portions of which will likely be model specific.
  4. BMI Compatibility - we could adopt the BMI in Fortran or in a wrapper written in Python / Rust etcetera. The benefit of the BMI wrapper being in Fortran is that it should be easy to integrate with other BMI models. However, it you wanted to expose the models to a non BMI compliant way I'm not sure how stable / accepted modern Fortran FFI is (that is object oriented). A wrapper in Rust or C would have a more standard FFI. Any thoughts or suggestions @mcflugen? Could you point me to an example Fortran BMI and Python BMI compatibility if you think that is a reasonable choice?

@chporter thoughts? preferences?

Overall I think the choice of whether or not to put more time into a runner / scheduler for running single models is a matter of how common these global mutable state models are and how difficult they are to change. If they are common and difficult to change then there may be some benefit toward trying to figure out how to run them. I think that for this model in particular it is simpler and less work to remove the global mutable state and use structs. Of course we could attempt both eventually.

schema validation and basic metadata

  • DataFrames (uses Arrow data types)
  • Tensors (only support numeric types)
  • Basic metadata
    • label
    • description
    • unit of measure (initial implementation will be just strings) (python implementation could use gimli.units)

client conversion of schema to handler

Handlers support particular resource types and depend on their source library (for example pandas in Python and dplyr in R). Different libraries support different resource types so it may be that the resource classes supported by one model as input are not supported by another library.

An interactive function to list the handlers compatible with a particular schema would help model users try out a new model in the REPL.

Something like

compatible_handlers(schema)
> PandasHandler
>  matches with:
>    - Feather
>    - Parquet
>
> SparkHandler
>  matches with:
>    - Parquet

that shows the resource types supported by different handlers

more useful interface for overlandflow / soil water infiltration wrapper

How the model currently works:

Takes a daily weather data frame (with has a precipitation (mm) value). Rain is assumed to fall uniformly on the elevation grid over the course of a half hour. Infiltration values (in mm) are reported for each grid cell for each day. Number of days to run is simply determined by the number of rows of the weather dataframe.

Parameters needed to make model more useful:

  • duration of rainfall each day
  • allow multiple rainfall events each day (would need to change input weather data frame format)
  • support changing the soil type

Improvements to LandLabGridHandler:

  • needs to support model construction parameters
    • just asking the model user to construct and landlab model grid object and pickle it is easiest
    • is supporting asc files need additional information about which nodes should be fixed value and which should be closed

Add project settings and scaffolding

A meillionen project settings file should describe how to call different models via

  • file system path
  • docker

This settings file should be able to be used in the Python wrapper to make it easy to setup a project and get workflows running.

A command line tool could be used to initialize a project and add models.

Store Variable Interface

  • slicing - a store variable slice should be able to return a labeled n-dimensional array or a dataframe
  • dimension values - store variable should be indexable based on their dimension values. For example, suppose a variable has a time dimension. Then the variable should be indexable by date.
  • subsetting should be possible using a query builder interface
  • the Python interface should hide what type of backing store is used

reduce file path construction boilerplate in workflows

The example workflow connecting the overland flow model and the SimpleCrop model has a fairly large proportion amount of code dedicated to constructing sink and source paths. We should think about ways of reducing the burden of organizing the data produced by computational models.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.