Code Monkey home page Code Monkey logo

translator-openpredict's Introduction

๐Ÿ”ฎ๐Ÿ Translator OpenPredict

Python versions

This repository contains the code for the OpenPredict Translator API available at openpredict.semanticscience.org, which serves a few prediction models developed at the Institute of Data Science.

  • various folders for different prediction models served by the OpenPredict API are available under src/:
    • the OpenPredict drug-disease prediction model in src/openpredict_model/
    • a model to compile the evidence path between a drug and a disease explaining the predictions of the OpenPredict model in src/openpredict_evidence_path/
    • a prediction model trained from the Drug Repurposing Knowledge Graph (aka. DRKG) in src/drkg_model/
  • the code for the OpenPredict API endpoints in src/trapi/ defines:
    • a TRAPI endpoint returning predictions for the loaded models

The data used by the models in this repository is versionned using dvc in the data/ folder, and stored on DagsHub at https://dagshub.com/vemonet/translator-openpredict

Deploy the OpenPredict API locally

Requirements: Python 3.8+ and pip installed

  1. Clone the repository with submodule:

    git clone --recursive https://github.com/MaastrichtU-IDS/translator-openpredict.git
    cd translator-openpredict
  2. Pull the data required to run the models in the data folder with dvc:

    pip install dvc
    dvc pull

Start the API in development mode with docker on http://localhost:8808, the API will automatically reload when you make changes in the code:

docker compose up api

Contributions are welcome! If you wish to help improve OpenPredict, see the instructions to contribute ๐Ÿ‘ฉโ€๐Ÿ’ป for more details on the development workflow

Test the OpenPredict API

Run the tests locally with docker:

docker compose run tests

See the TESTING.md file for more details on testing the API.

You can change the entrypoint of the test container to run other commands, such as training a model:

docker compose run --entrypoint "python src/openpredict_model/train.py train-model" tests
# Or with the helper script:
./resources/run.sh python src/openpredict_model/train.py train-model

Use the OpenPredict API

The user provides a drug or a disease identifier as a CURIE (e.g. DRUGBANK:DB00394, or OMIM:246300), and choose a prediction model (only the Predict OMIM-DrugBank classifier is currently implemented).

The API will return predicted targets for the given drug or disease:

  • The potential drugs treating a given disease ๐Ÿ’Š
  • The potential diseases a given drug could treat ๐Ÿฆ 

Feel free to try the API at openpredict.semanticscience.org

TRAPI operations

Operations to query OpenPredict using the Translator Reasoner API standards.

Query operation

The /query operation will return the same predictions as the /predict operation, using the ReasonerAPI format, used within the Translator project.

The user sends a ReasonerAPI query asking for the predicted targets given: a source, and the relation to predict. The query is a graph with nodes and edges defined in JSON, and uses classes from the BioLink model.

You can use the default TRAPI query of OpenPredict /query operation to try a working example.

Example of TRAPI query to retrieve drugs similar to a specific drug:

{
    "message": {
        "query_graph": {
        "edges": {
            "e01": {
            "object": "n1",
            "predicates": [
                "biolink:similar_to"
            ],
            "subject": "n0"
            }
        },
        "nodes": {
            "n0": {
            "categories": [
                "biolink:Drug"
            ],
            "ids": [
                "DRUGBANK:DB00394"
            ]
            },
            "n1": {
            "categories": [
                "biolink:Drug"
            ]
            }
        }
        }
    },
    "query_options": {
        "n_results": 3
    }
}
Predicates operation

The /predicates operation will return the entities and relations provided by this API in a JSON object (following the ReasonerAPI specifications).

Try it at https://openpredict.semanticscience.org/predicates

Notebooks examples ๐Ÿ“”

We provide Jupyter Notebooks with examples to use the OpenPredict API:

  1. Query the OpenPredict API
  2. Generate embeddings with pyRDF2Vec, and import them in the OpenPredict API

Add embedding ๐Ÿš‰

The default baseline model is openpredict_baseline. You can choose the base model when you post a new embeddings using the /embeddings call. Then the OpenPredict API will:

  1. add embeddings to the provided model
  2. train the model with the new embeddings
  3. store the features and model using a unique ID for the run (e.g. 7621843c-1f5f-11eb-85ae-48a472db7414)

Once the embedding has been added you can find the existing models previously generated (including openpredict_baseline), and use them as base model when you ask the model for prediction or add new embeddings.

Predict operation ๐Ÿ”ฎ

Use this operation if you just want to easily retrieve predictions for a given entity. The /predict operation takes 4 parameters (1 required):

  • A drug_id to get predicted diseases it could treat (e.g. DRUGBANK:DB00394)
    • OR a disease_id to get predicted drugs it could be treated with (e.g. OMIM:246300)
  • The prediction model to use (default to Predict OMIM-DrugBank)
  • The minimum score of the returned predictions, from 0 to 1 (optional)
  • The limit of results to return, starting from the higher score, e.g. 42 (optional)

The API will return the list of predicted target for the given entity, the labels are resolved using the Translator Name Resolver API

Try it at https://openpredict.semanticscience.org/predict?drug_id=DRUGBANK:DB00394


More about the data model ๐Ÿ’ฝ

Diagram of the data model used for OpenPredict, based on the ML Schema ontology (mls):

OpenPredict datamodel


Translator application

Service Summary

Query for drug-disease pairs predicted from pre-computed sets of graphs embeddings.

Add new embeddings to improve the predictive models, with versioning and scoring of the models.

Component List

API component

  1. Component Name: OpenPredict API

  2. Component Description: Python API to serve pre-computed set of drug-disease pair predictions from graphs embeddings

  3. GitHub Repository URL: https://github.com/MaastrichtU-IDS/translator-openpredict

  4. Component Framework: Knowledge Provider

  5. System requirements

    5.1. Specific OS and version if required: python 3.8

    5.2. CPU/Memory (for CI, TEST and PROD): 32 CPUs and 32 Go memory ?

    5.3. Disk size/IO throughput (for CI, TEST and PROD): 20 Go ?

    5.4. Firewall policies: does the team need access to infrastructure components? The NodeNormalization API https://nodenormalization-sri.renci.org

  6. External Dependencies (any components other than current one)

    6.1. External storage solution: Models and database are stored in /data/openpredict in the Docker container

  7. Docker application:

    7.1. Path to the Dockerfile: Dockerfile

    7.2. Docker build command:

    docker build ghcr.io/maastrichtu-ids/openpredict-api .

    7.3. Docker run command:

    Replace ${PERSISTENT_STORAGE} with the path to persistent storage on host:

    docker run -d -v ${PERSISTENT_STORAGE}:/data/openpredict -p 8808:8808 ghcr.io/maastrichtu-ids/openpredict-api
  8. Logs of the application

    9.2. Format of the logs: TODO

Acknowledgmentsโ€‹

Funded the the NIH NCATS Translator project

translator-openpredict's People

Contributors

arifx avatar carlosug avatar elifozkn avatar pahmadi8740 avatar rcelebi avatar vemonet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

translator-openpredict's Issues

Inconsistent Docker build on different machines

Using the jupyter:all-spark image as base, and doing pip install . :
https://github.com/MaastrichtU-IDS/translator-openpredict/blob/master/Dockerfile#L10

See full Dockerfile here (as simple as possible):

FROM jupyter/all-spark-notebook

RUN pip install --upgrade pip

COPY . .
RUN pip install .

EXPOSE 8808
ENTRYPOINT [ "openpredict" ]
CMD [ "start-api" ]

The issue:

When doing docker build on different machines it works or don't depending on the position of the stars in the sky

On Ubuntu 18.04:

  • pip install . : works as expect
  • pip install -e . : won't work due to "permission issue" with pytest (could be due to the user not being root at this moment)

On Centos 7:

  • pip install . : this time this one does not work! (build works, but cannot find openapi.yml when running)
  • pip install -e . : but this one works!

Issue with pip install . on Centos:

openpredict-api    |   File "/opt/conda/lib/python3.8/pathlib.py", line 1218, in open
openpredict-api    |     return io.open(self, mode, buffering, encoding, errors, newline,
openpredict-api    |   File "/opt/conda/lib/python3.8/pathlib.py", line 1074, in _opener
openpredict-api    |     return self._accessor.open(self, flags, mode)
openpredict-api    | FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.8/site-packages/openpredict/../openapi.yml'

How can docker screw up that bad?

ZAP Scan Baseline Report - https://openpredict.semanticscience.org/

View the following link to download the report.
RunnerID:436967534

No score attribute on KG edges

Describe the problem

In queries to OpenPredict's /query endpoint, the knowledge graph edges do not contain the model prediction scores. It looks like the scores are only contained in the results analyses. The scores should also be contained in the edge attributes (in addition to score in the analyses). This would allow the scoring data to be retained if OpenPredict's edges get incorporated into other multihop results. I think the score used to be present in edge attributes in previous versions of OpenPredict.

Provide the URL to the problematic OpenPredict API call

Provide the full URL or curl command to reproduce the problematic call if applicable.

To Reproduce

The example bioilnk:treated_by and biolink:similar_to queries in the OpenAPI documentation reproduce this issue. Seen in dev, CI, and prod environments.

Look into CML to build models and DVC to version them

Build models using GitHub Actions or GitLab CI: https://cml.dev/

Version models with https://dvc.org/
See https://determined.ai/blog/building-an-enterprise-deep-learning-platform-2/

DVC is significantly more lightweight than Pachyderm, running locally and adding versioning on top of your local storage solution. DVC simply integrates into existing Git repositories to track the version of data that was used to run experiments. ML teams can also define and execute transformation pipelines with DVC; however, the biggest drawback of DVC is that those transformations run locally and are not automatically scaled to a cluster. Notably, DVC does not handle the storage of data, simply the versioning.

See also: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9264-how-to-build-efficient-ml-pipelines-from-the-startup-perspective.pdf

Explore Kubeflow?

Empty tags are displayed by Swagger UI

The OpenAPI define translator and reasoner tags for the API (required for proper indexing in the SmartAPI registry)

Unfortunately Swagger UI render them, and there is no plan on changing this behavior. See this issue: swagger-api/swagger-ui#3819

A way to avoid this behavior is to remove the translator and reasoner tags from the openapi.yml file, and fill them directly via the SmartAPI web UI when registering the API. But this require manual operations and create a difference between the OpenAPI definition published and the one registered in SmartAPI

We could change the ReasonerAPI definition to remove query and predicates tags

  • Put the /query call under the reasoner tag
  • Put the /predicates call under the translator tag

Invalid TRAPI

Describe the problem

This query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "a": {
                    "categories": [
                        "biolink:Drug"
                    ],
                    "ids": [
                        "DRUGBANK:DB00394"
                    ]
                },
                "b": {
                    "categories": [
                        "biolink:Disease"
                    ]
                }
            },
            "edges": {
                "ab": {
                    "subject": "a",
                    "object": "b",
                    "predicates": [
                        "biolink:treats"
                    ]
                }
            }
        },
        "knowledge_graph": {
            "nodes": {},
            "edges": {}
        },
        "results": []
    }
}

returns a response. However, the response does not validate (TRAPI 1.1).

The issue appears to be the categories in the knowledge graph:

            "nodes": {
                "DRUGBANK:DB00394": {
                    "categories": "biolink:Drug"
                },
                "OMIM:102100": {
                    "categories": "biolink:Disease"
                },
                "OMIM:102300": {
                    "categories": "biolink:Disease"
                },
                "OMIM:102400": {
                    "categories": "biolink:Disease"
                },
                "OMIM:102500": {
                    "categories": "biolink:Disease"
                },

The value of categories should be a list, like:

            "nodes": {
                "DRUGBANK:DB00394": {
                    "categories": "[biolink:Drug"]
                },
                "OMIM:102100": {
                    "categories": ["biolink:Disease"]
                },

Setting the API Server URL bug in Docker container

The servers url in the OpenAPI yaml definition is not changed by the provided arguments only when run in Docker containers

The exact same command:
openpredict start-api --server-url http://myurl

Works fine locally, the server URL is updated in the OpenAPI definition exposed

When run as entrypoint of a Docker... For some reason the arg is properly passed to the Python code, and the API is well started by the following line (with the right server_url):

api.add_api('openapi.yml', arguments={'server_url': server_url})

But for some reason the started API has en EMPTY servers URL (not even a / which should be the default)

Seems to be an issue with Connexion

Use up to date pyrdf2vec to generate embeddings?

See the pyRDF2Vec documentation:

https://pyrdf2vec.readthedocs.io/en/latest/readme.html#create-a-knowledge-graph-object

To create embeddings:

from pyrdf2vec.graphs import KG

kg = KG(location="https://dbpedia.org/sparql", is_remote=True)

from pyrdf2vec.samplers import UniformSampler
from pyrdf2vec.walkers import RandomWalker

walkers = [RandomWalker(4, 5, UniformSampler())]

from pyrdf2vec import RDF2VecTransformer

transformer = RDF2VecTransformer(walkers=[walkers], sg=1)
# Entities should be a list of URIs that can be found in the Knowledge Graph
embeddings = transformer.fit_transform(kg, entities)

type->attribute_type_id

Describe the problem

Response from openpredict is not validating against trapi 1.1

Provide the URL to the problematic OpenPredict API call

Hitting https://openpredict.semanticscience.org/query with this query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "a": {
                    "categories": [
                        "biolink:Drug"
                    ],
                    "ids": [
                        "DRUGBANK:DB00394"
                    ]
                },
                "b": {
                    "categories": [
                        "biolink:Disease"
                    ]
                }
            },
            "edges": {
                "ab": {
                    "subject": "a",
                    "object": "b",
                    "predicates": [
                        "biolink:treats"
                    ]
                }
            }
        },
        "knowledge_graph": {
            "nodes": {},
            "edges": {}
        },
        "results": []
    }
}

To Reproduce

I'm getting results with attributes that look look like:

{
                            "name": "model_id",
                            "source": "OpenPredict",
                            "type": "EDAM:data_1048",
                            "value": "openpredict-baseline-omim-drugbank"
                        },

Specs for the attributes can be found here: https://github.com/NCATSTranslator/ReasonerAPI/blob/master/docs/reference.md#attribute-

I think that type in this attribute should be attribute_type_id ?

Add users

Possible options to store and serve ML models

DVC and CML (Continuous Machine Learning)

Build models using GitHub Actions or GitLab CI: https://cml.dev/

Version models with https://dvc.org/
See https://determined.ai/blog/building-an-enterprise-deep-learning-platform-2/

DVC is significantly more lightweight than Pachyderm, running locally and adding versioning on top of your local storage solution. DVC simply integrates into existing Git repositories to track the version of data that was used to run experiments. ML teams can also define and execute transformation pipelines with DVC; however, the biggest drawback of DVC is that those transformations run locally and are not automatically scaled to a cluster. Notably, DVC does not handle the storage of data, simply the versioning.

See also: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9264-how-to-build-efficient-ml-pipelines-from-the-startup-perspective.pdf

Explore Kubeflow?

OpenML to share models

Concepts: https://openml.github.io/OpenML/#concepts

See how to publish a dataset:
https://openml.github.io/openml-python/master/examples/30_extended/datasets_tutorial.html#sphx-glr-examples-30-extended-datasets-tutorial-py

Should we also publish tasks?

Scann lib from google

https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html

Clipper AI API

Clipper AI: Serve ML models (tensorflow, pytorch, sklearn...) through a HTTP REST API (no OpenAPI support builtin)

Pachyderm

Build, train, and deploy your data science workloads on whatever Kubernetes deployment you call home.

https://www.pachyderm.com/getting-started/

Machine Learning model databases

ModelDB

Open Source ML Model Versioning, Metadata, and Experiment Management

https://github.com/VertaAI/modeldb

Video presentation: https://databricks.com/fr/session/modeldb-a-system-to-manage-machine-learning-models

MLDB

https://mldb.ai/

MLDB (Machine Learning Database) is an open-source database designed for machine learning. You can install it wherever you want and send it commands over a RESTful API to -store data, explore it using SQL, then train machine learning models and expose them as APIs

disease treats disease being returned on TRAPI query

Describe the problem

During today's stand-up meeting, there was a query for ChemicalEntity treats MONDO:0005180 (parkinson disease), and OpenPredict was returning many results saying MONDO:0005180 (Parkinson disease) treats MONDO:0008199 (late-onset Parkinson disease). I believe this was in the dev environment.

PKs:
ARS: ecbaf378-e57e-4d6d-bc78-419e8c280a7e
OpenPredict result: https://arax.ncats.io/?r=4548bc8a-9466-485f-814e-f0829876b8e7

Sample edge:

        "e0": {
          "attributes": [
            {
              "attribute_type_id": "EDAM:data_1048",
              "description": "model_id",
              "value": "openpredict_baseline"
            },
            {
              "attribute_type_id": "EDAM:data_1772",
              "description": "score",
              "value": "0.9835270306548173"
            }
          ],
          "object": "MONDO:0008199",
          "predicate": "biolink:treats",
          "sources": [
            {
              "resource_id": "infores:openpredict",
              "resource_role": "primary_knowledge_source"
            },
            {
              "resource_id": "infores:cohd",
              "resource_role": "supporting_data_source"
            }
          ],
          "subject": "MONDO:0005180"
        },

Issue with Spark

Issue when running get_predict on node2 with Spark (it uses pandas as backup)

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
INFO:root:<SparkContext master=local[*] appName=pyspark-shell>
20/09/22 18:10:42 ERROR Executor: Exception in task 4.0 in stage 0.0 (TID 4)/ 8]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/vemonet/.local/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 477, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

"extra fields not permitted" when empty aux graph included in message

Describe the problem

When Workflow Runner queries a TRAPI service, it includes an empty auxiliary_graphs object in the message. OpenPredict responds with

{
    "detail": [
        {
            "loc": [
                "body",
                "message",
                "auxiliary_graphs"
            ],
            "msg": "extra fields not permitted",
            "type": "value_error.extra"
        }
    ]
}

Provide the URL to the problematic OpenPredict API call

Tested in dev and CI environments

To Reproduce

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": [
                        "biolink:Drug"
                    ]
                },
                "n1": {
                    "ids": [
                        "MONDO:0004979"
                    ],
                    "categories": [
                        "biolink:Disease"
                    ]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": [
                        "biolink:treats"
                    ]
                }
            }
        },
        "auxiliary_graphs": {}
    },
    "workflow": [
        {
            "id": "lookup"
        }
    ],
    "submitter": "Workflow Runner"
}

Also occurs if auxiliary_graphs is null

Additional context

Currently a blocker for the CQS Path E query that targets OpenPredict. WFR needs to include the empty aux graphs to get around a current issue with how ARAX validates TRAPI based on 1.4.0 schema. Inclusion of an empty aux graph shouldn't affect how queries are performed, so it'd be helpful if OpenPredict can ignore empty or null aux graphs in the TRAPI query. Can this be accommodated?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.