ml4ai / delphi Goto Github PK

Framework for assembling causal probabilistic models from text and software.

Home Page: http://ml4ai.github.io/delphi

License: Apache License 2.0

Makefile 0.16% Python 6.56% Fortran 87.75% Shell 0.08% CSS 1.32% Smarty 0.01% TeX 0.07% JavaScript 0.16% HTML 0.07% Dockerfile 0.01% Julia 0.03% C++ 2.68% CMake 0.05% Forth 0.03% Scheme 0.07% Gnuplot 0.91% Scala 0.04% Pascal 0.01% Pawn 0.01% C 0.01%

delphi's Introduction

Complete documentation available at ml4ai.github.io/delphi (the 'raw' version can be found in the docs directory.)

Modeling complex phenomena such as food insecurity requires reasoning over multiple levels of abstraction and fully utilizing expert knowledge about multiple disparate domains, ranging from the environmental to the sociopolitical.

Delphi is a Python/C++ library for assembling causal, dynamic, probabilistic models from information extracted from two sources:

Text: Delphi utilizes causal relations extracted using machine reading from text sources such as UN agency reports, news articles, and technical papers.
Software: Delphi also incorporates functionality to extract abstracted representations of scientific models from code that implements them, and convert these into probabilistic models.

Delphi builds upon INDRA and Eidos.

For a detailed description of our procedure to convert text to models, see this document.

Delphi is also part of the AutoMATES project.

Citing

If you use Delphi, please cite the following:

  @InProceedings{sharp-EtAl:2019:N19-4,
    author    = {Sharp, Rebecca  and  Pyarelal, Adarsh  and  Gyori, Benjamin
      and  Alcock, Keith  and  Laparra, Egoitz  and  Valenzuela-Esc\'{a}rcega,
      Marco A.  and  Nagesh, Ajay  and  Yadav, Vikas  and  Bachman, John  and
      Tang, Zheng  and  Lent, Heather  and  Luo, Fan  and  Paul, Mithun  and
      Bethard, Steven  and  Barnard, Kobus  and  Morrison, Clayton  and
      Surdeanu, Mihai},
    title     = {Eidos, INDRA, \& Delphi: From Free Text to Executable Causal Models},
    booktitle = {Proceedings of the 2019 Conference of the North American
    Chapter of the Association for Computational Linguistics (Demonstrations)},
    month     = {6},
    year      = {2019},
    address   = {Minneapolis, Minnesota},
    publisher = {Association for Computational Linguistics},
    pages     = {42-47},
    url       = {http://www.aclweb.org/anthology/N19-4008},
    keywords = {demo paper, causal relations, timelines, locations, information extraction},
  }
   @misc{Delphi,
       Author = {Adarsh Pyarelal and Paul Hein and Jon Stephens and Pratik
                 Bhandari and HeuiChan Lim and Saumya Debray and Clayton
                 Morrison},
       Title = {Delphi: A Framework for Assembling Causal Probabilistic 
                Models from Text and Software.},
       doi={10.5281/zenodo.1436915},
   }

License and Funding

Delphi is licensed under the Apache License 2.0.

The development of Delphi was supported by the Defense Advanced Research Projects Agency (DARPA) under the World Modelers (grant no. W911NF1810014) and Automated Scientific Knowledge Extraction (agreement no. HR00111990011) programs.

delphi's People

Contributors

Stargazers

Watchers

Forkers

bgyori daveblumstein min-yin-sri huayangs musashi11 cthoyt jgawrilo hlim1 reynoldsm88 mwdchang adamocarolli gaybro8777 ease78 rela0426 jataware bkj kimth0612

delphi's Issues

Pandas is listed twice as a dependendency in setup.py

Implement handling multiple parameters in ICM API skeleton generator

Add types to CAG-JSON spec

Handle Function Returns

The program analysis code does not yet handle function returns. This is in part because the handling of such returns requires a few non-trivial additions. As a note, a good resource on functions can be found here:
https://pages.mtu.edu/~shene/COURSES/cs201/NOTES/F90-Subprograms.pdf

The following additions must be made:

Functions must be "contained" within programs or subroutines which are the only thing that can call them. Thus, functions must be scoped so that multiple definitions of the same name in different contains scopes will not interfere with each other. This can be done by renaming functions. I have a few notes on this as well: (1) I believe contains statements can be nested. (2) I'm not sure, but it might be possible to access local variables in the parent scopes.
The value to return is encoded differently than most other languages. The return value is stored in a variable of the same name as the function. Thus, for a function foo, the return value is set by assigning foo = value. Logic that is aware of this will need to be added to properly translate to python.
There is an actual return keyword that terminates a function/subroutine/program. A transformation will likely be necessary to preserve this behavior as I assume delphi expects an entire container/function to execute to its entirety. If that is the case, a program transformation will likely be necessary to preserve the return behavior.

[Feature request]: Running Delphi with only a subset of the variables in a CAG

In order to connect Delphi to other workflows in MINT, we may need to run the model with a subset of variables of the original CAG. In those cases, it would be useful to write a function that takes in a list of variables, and removes nodes in the CAG that are not supposed to be in the CAG, prior to execution.

Option to specify output location when creating model/dressed CAG

@dgarijo: Re: our email conversation - it is possible to set the output path of the result folder while creating the model using the --model_dir flag. However, this is probably not so clear from the help message. In any case, I'll add flags to the delphi CLI to specify the separate locations of the output model files.

Migrate from file-based data flow to using internal Python data structures.

We should move away from requiring intermediate files for program analysis and instead be writing functions to produce and consume Python internal data structures - this will help with integrating the program analysis components with the rest of Delphi.

Remove `dataset` field from CAG JSON

Add 'drop-in'/'setup' functionality.

Basically, this is the idea: an analyst should be able to 'set up' the workspace with a script before launching the visualizer/simulator. Thus, the app object should be available to import, and app.run() can be called after the setup to launch the app with the desired configuration.

Things that need to be able to be pulled into the workspace:

Update the value of the lastUpdated field

The lastUpdated field of a CausalVariable should be the starting point/timestamp for the most recent intervention.

@fhusainUC

Implement updating of ProgramAnalysisGraph nodes to be in sync with the loop index

Right now, there is a 'delay' in the updating of nodes in a ProgramAnalysisGraph (which should be integrated better into the AnalysisGraph class) that makes it so that the 'downstream' outputs like YIELD_EST are updated a couple of steps after the 'upstream' outputs like RAIN (in the example crop_yield.f. This results in the DAY variable which serves as the loop index lagging behind by 2 compared to the FORTRAN program.

where to put for2py test inputs

I've been making up tests for various Fortran language constructs for2py will have to handle (currently: I/O and modules; soon: multi-dimensional arrays). Right now these tests are in a couple of different places: some are in delphi/tests/data and some are in delphi/delphi/program_analysis/autoTranslate/tests/test_data/. It would be good for all of these to live in the same place. Where should I put them?

Locations of docstrings

delphi/data/program_analysis/crop_yield.py

Line 19 in ff59749

stuff in here? Note, it does not seem like you can attach a docstring to any

Replying to your comment, @cl4yton - we can't place docstrings in arbitrary places for Python to automatically process them, that's true. However, we can always modify the __doc__ attribute of objects to set their docstrings, for example:

def function_name():
    pass

function_name.__doc__ = "docstring"

Reduce the amount of stateful behaviour in genPGM.py

I'm having some issues with global state in genPGM.py - basically, the lambdas.py file seems to change upon multiple runs of the same function with the same inputs.

Steps to reproduce (assuming you are in the Delphi repo root directory) -

cd delphi/program_analysis/autoTranslate
./autoTranslate ../../../data/program_analysis/crop_yield.f
python

Then in the Python interpreter, do:

>>> from delphi.program_analysis.autoTranslate.scripts.genPGM.import get_asts_from_files, create_pgm_dict
>>> asts = get_asts_from_files(['crop_yield.py'])
>>> pgm_dict = create_pgm_dict('lambdas.py', asts, 'pgm.json')

The lambdas.py file is unchanged by this.

def UPDATE_EST__lambda__TOTAL_RAIN_0(TOTAL_RAIN, RAIN):
    TOTAL_RAIN = (TOTAL_RAIN+RAIN)
    return TOTAL_RAIN

def UPDATE_EST__lambda__IF_1_0(TOTAL_RAIN):
    return (TOTAL_RAIN<=40)

def UPDATE_EST__lambda__YIELD_EST_0(TOTAL_RAIN):
    YIELD_EST = (-((((TOTAL_RAIN-40)**2)/16))+100)
    return YIELD_EST

def UPDATE_EST__lambda__YIELD_EST_1(TOTAL_RAIN):
    YIELD_EST = (-(TOTAL_RAIN)+140)
    return YIELD_EST

def CROP_YIELD__lambda__MAX_RAIN_0():
    MAX_RAIN = 4.0
    return MAX_RAIN

def CROP_YIELD__lambda__CONSISTENCY_0():
    CONSISTENCY = 64.0
    return CONSISTENCY

def CROP_YIELD__lambda__ABSORPTION_0():
    ABSORPTION = 0.6
    return ABSORPTION

def CROP_YIELD__lambda__YIELD_EST_0():
    YIELD_EST = 0
    return YIELD_EST

def CROP_YIELD__lambda__TOTAL_RAIN_0():
    TOTAL_RAIN = 0
    return TOTAL_RAIN

def CROP_YIELD__lambda__RAIN_0(DAY, CONSISTENCY, MAX_RAIN, ABSORPTION):
    RAIN = ((-((((DAY-16)**2)/CONSISTENCY))+MAX_RAIN)*ABSORPTION)
    return RAIN

However, upon calling this function a second time:

>>> pgm_dict = create_pgm_dict('lambdas.py', asts, 'pgm.json')

The numbers at the end of the function names in lambdas.py get incremented by one.

def UPDATE_EST__lambda__TOTAL_RAIN_1(TOTAL_RAIN, RAIN):
    TOTAL_RAIN = (TOTAL_RAIN+RAIN)
    return TOTAL_RAIN

def UPDATE_EST__lambda__IF_1_1(TOTAL_RAIN):
    return (TOTAL_RAIN<=40)

def UPDATE_EST__lambda__YIELD_EST_2(TOTAL_RAIN):
    YIELD_EST = (-((((TOTAL_RAIN-40)**2)/16))+100)
    return YIELD_EST

def UPDATE_EST__lambda__YIELD_EST_3(TOTAL_RAIN):
    YIELD_EST = (-(TOTAL_RAIN)+140)
    return YIELD_EST

def CROP_YIELD__lambda__MAX_RAIN_1():
    MAX_RAIN = 4.0
    return MAX_RAIN

def CROP_YIELD__lambda__CONSISTENCY_1():
    CONSISTENCY = 64.0
    return CONSISTENCY

def CROP_YIELD__lambda__ABSORPTION_1():
    ABSORPTION = 0.6
    return ABSORPTION

def CROP_YIELD__lambda__YIELD_EST_1():
    YIELD_EST = 0
    return YIELD_EST

def CROP_YIELD__lambda__TOTAL_RAIN_1():
    TOTAL_RAIN = 0
    return TOTAL_RAIN

def CROP_YIELD__lambda__RAIN_1(DAY, CONSISTENCY, MAX_RAIN, ABSORPTION):
    RAIN = ((-((((DAY-16)**2)/CONSISTENCY))+MAX_RAIN)*ABSORPTION)
    return RAIN

This side effect needs to be gotten rid of.

Add flag for input adjective data

To incorporate delphi into a workflow, the inputs and outputs must be explicitly specified - right now the path to the gradable adjective data file is hard-coded into the system. @dgarijo

program analysis also in delphi

@adarshp : Posting as "question" for discussion, although I'm "stating" it here...
The program analysis project right now has the following components:
(1) Analyze fortran to map to python
(2) Analyze pythons AST to map to CAG with functions that can be input to delphi
(3) Sensitivity analysis of delphi CAG with functions
Item (3) will be in delphi (has general use).
For now, I'd like to put parts of (2) also in the delphi project, under the directory program_analysis/ (at project root, sibling to sensitivity). For now I'll keep this in the sensitivity branch. This means probably adding Jon Stephens to the project.
Long term: This may move out depending on whether we consider the python side of program analysis a component of delphi (which currently I'm ok with it)

Add units to CAG-JSON spec

Set up an backcasting evaluation framework for Delphi.

Implement a function that takes the name (str) of a concept as a parameter, and the maximum depth for graph traversal (int), and returns a CAG centered around the concept.
Write a function that takes a concept-level CAG as an input parameter connects each concept of the CAG with an indicator, while ensuring that indicators are not shared among concepts.
Set initial conditions for backcasting - get mean values of indicators from relevant data sources, for South Sudan in 2016.
Run the DBN for one time step
Compare predicted values of indicators to actual values.

Implement functionality to handle open-ended loops

Add minimal docstrings for all functions, classes, and modules.

We need to make the API documentation reasonably complete.

Pickling lambda functions

@pauldhein and I were discussing this a while ago - it occurred to us that since the output of the program analysis pipeline is a pickled Python object, there is no reason, in principle, why the lambda functions couldn't be pickled alongside the rest of the output. For example, the following function in PETPT_lambdas.py:

def PETPT__lambda__TD_0(TMAX, TMIN):
    TD = ((0.6*TMAX)+(0.4*TMIN))
    return TD

Could be constructed as follows:

PETPT__lambda__TD_0 = eval("lambda TMAX, TMIN: ((0.6*TMAX)+(0.4*TMIN))")

Here, the string argument to eval could be constructed in the same way the second line of the existing lambda functions in the lambdas.py files are built up from parsing the XML AST output of OFP.

Alternatively (and this seems to me to be the right way), one could take advantage of type annotations and use the more powerful def syntax for declaring functions -

exec("def PETPT__lambda__TD_0(TMAX: float, TMIN: float) -> float: return ((0.6*TMAX)+(0.4*TMIN))")

(assuming we can get these types - can we?)

and later the PETPT__lambda__TD_0 object can be used as a value in the dict object produced by genPGM.py.

Since functions are first class objects in Python, you can actually set attributes for functions as well - perhaps this might make it easier to keep track of things like the function type (assign/lambda/condition, etc.), the reference, and so on:

PETPT__lambda__TD_0.fn_type = "lambda"
PETPT__lambda__TD_0.reference = 9
PETPT__lambda__TD_0.target = "TD"

And then if someone wants to serialize the GrFN object to a JSON file, we could define the following function:

import inspect

def to_json_serialized_dict(function):
    return {
        "name": function.__name__,
        "type": function.fn_type,
        "target": function.target,
        "sources": inspect.signature(function) 
        # Plus some processing to massage the above into a JSON-serializable dict
        ...
    }

Not super urgent but I do think that it might be a investment worth making to simplify things in the long run...

Add tooltips for edges.

It would be good to add tooltips to the edges for visualization purposes.

GrFN spec interlanguage compatibility

@adarshp, I took a look at the spec for GrFN.
https://delphi.readthedocs.io/en/master/grfn_spec.html#top-level-grfn-specification
This is a great representation. Kind of like a higher level IR for translating between languages.

One thing I noticed is this note:

TODO: we think Fortran is restricted to integer values for iteration variables, which would include iteration over indexes into arrays. Need to double check this.

If the GrFN schema is going to work for multiple languages it is going to need to support iterator loops like C++, Python, Julia have.

I guess you could have int-loop and iterator loop or a per language loop construct.

Request to change contents of "input" attribute of loop_plate specification

(Just realized I should have first created this as an issue!)

@stephensj2 : @pauldhein is making good progress getting the DBN-JSON to DBN graph wiring working. Paul identified a change in the DBN-JSON output that we'd like to ask you to make -- this is just for the loop_plate specification. Up to this point, my thought was that the "input" attribute of the loop_place spec should list all of the variables that are referenced within the loop_plate. It turns out, it is much more useful to Paul to have this be the list of variable names that are set in the scope (container fn) that the loop_plate appears within. And in this case, we don't need actual <veriable_reference>s (no need for the index info), just the <variable_name> (the base string name of the variable). I've update the description of the Function Loop Place Specification to reflect this (text in maroon).

Is it easy for you to make this change?

Library Functions

Appropriate handlers need to be created for library functions (such as read, write, etc) in the program analysis code. It is not clear to me at this moment if it is necessary they preserve the actual behavior of fortran's libraries though. By that, I mean I don't think delphi cares about the particulars of the call. I believe it only cares that the code is receiving input or producing output. If that is the case, rather than creating handler to translate fortran library calls into python calls, it may be possible to instead replace all input calls with a single call to a function like 'input' and all output calls to a function like 'output'. The user can then define what an input and output call is.

Putting modifications to OpenFortranParser under version control

If we are modifying OpenFortranParser to perform program analysis, we should keep the modified source code under version control as well.

Change name of export functions

TODO: Make the following function name changes in delphi/export.py:

to_json -> to_json_file
to_json_dict -> to_dict

Update readme file with new configuration options

This is how I understand how the system has to be run now:

Creation of the model:
root@59c89406cc39:/src/delphils# ./delphi.py --create_model --indra_statements data/sample_indra_statements.pkl --adjective_data data/adjectiveData.tsv --output_cag_json /out/testDelphiDanielJSON --output_dressed_cag /out/testDelphiDanielCAG --output_variables_path /out/testDelphiDanielVar

Execution of the model:
root@59c89406cc39:/src/delphils# ./delphi.py --execute_model --input_dressed_cag /out/testDelphiDanielCAG --input_variables_path /out/testDelphiDanielVar --output_sequences /out/DelphiSequencesResult.csv

Implement functionality to process data from FEWSNET

The data is contained in shape files, so we will need to figure out how to use those.

Write script to programmatically download FEWSNET data.
Write script that gets IPC phase classifications for individual South Sudan districts for different time periods.
Connect data from shapefiles that contain IPC phase classification data and the shape files that contain Administrative boundaries

Add at least one type of assumption to CAG JSON spec

delphi depends on unreleased features from Indra

delphy/core.py imports the class Influence from indra.statements module.

However, the latest indra version release, 1.5, which is the one that pip installs by default from PyPi, does not include this class, since this feature was introduced after the latest release.

I suggest the following as an interim solution:

Update the requirements.txt file to point at the indra git repository instead of Pypi.
Add dependency_links in setup.py to make it install indra using git instead of PyPi.

And, for later releases:

Use released versions as dependencies and hold back any changes that depend on unreleased features.
Add version specifications in requirements.txt and setup.py to avoid confusions and prevent problems due to backwards incompatible changes in any of the dependencies.

indra and networkx version

While creating a fresh virtual env for delphi development using requirements, I noticed this was generated:
indra 1.7.0 has requirement networkx==1.11, but you'll have networkx 2.1 which is incompatible.
Is this a concern?

Get indicators from INDRA statements

DBN-JSON representation for Fortran files without a PROGRAM module

The current structure of the pgm.json file is as follows:

{
"start": <name_of_PROGRAM_module>
"name": "pgm.json"
"dateCreated": <date_of_creation>
"functions": [<list_of_functions>]
}

This means that the "start" key is only created when there is a PROGRAM module in the FORTRAN file. Both the PETPT.for and PETASCE.for files do not have a PROGRAM module and only contain SUBROUTINES. For these files, a "start" field is not created and the pgm/lambdas generation script crashes.

For now, I will add an initial check where a search for this PROGRAM module is made and if not found, a dummy "start" field will be added. Moving forward, how can we represent such FORTRAN files?

Implement functionality to read in DBN JSON generated using program analysis

Todo: write functions to read DBN JSON from AST analysis and output DBN in delphi's internal representation.

Set up DB for storing parameterization data.

Right now, Delphi uses data tables stored as plain text files to parameterize its models. However, this will not scale with increasing amounts of data. Another concern is minimization of git repo bloat. For these reasons, it might be good to have an online database (hosted on vision or a SISTA server) that Delphi can query programmatically. I'm leaning towards Neo4j since that's the DB system I have the most experience with.

Implement functionality to handle multi-level array indexing

Create separate flags for the input dressed CAG and initial conditions.

Requested by @dgarijo from the ISI team.

Set up Web API for Uncharted HMI

@JiamingHao I'm opening this issue for us to have a space to discuss the API implementation.

Only save ICM experiments if successful

Right now, all experiments are being saved to the database - need to change this behavior to only save the experiments that are successfully run.

@fhusainUC

add seaborn in requirements.txt and setup.py

seaborn is imported and used in delphi.views but not listed as a dependency.

New handlers

The program analysis code has only been tested on some fairly small programs. So it is likely that there are program constructs it has not seen before, such as arrays, strings and while loops. Handlers will need to be added into the program analysis code for these constructs.