webyrd / medikanren Goto Github PK

View Code? Open in Web Editor NEW

316.0 29.0 53.0 28.78 MB

Proof-of-concept for reasoning over the SemMedDB knowledge base, using miniKanren + heuristics + indexing.

License: MIT License

Racket 55.69% Scheme 41.81% Shell 0.02% Awk 0.02% Python 0.02% HTML 2.42% R 0.01% Perl 0.01% Makefile 0.01%

racket minikanren ncats-translator

medikanren's Introduction

mediKanren

*** FOR RESEARCH PURPOSES ONLY ***

Proof-of-concept for reasoning over medical knowledge graphs, using miniKanren + heuristics + indexing.

There are several prototypes, each in its directory:

attic/code is the original prototype.
medikanren is the working prototype.
medikanren2 is the next-generation prototype.

Contributed use cases, queries and applications are now located in a directory separate from medikanren itself:

contrib/

If you have previously contributed code applying medikanren and can't find yours, look there.

medikanren's People

Contributors

Stargazers

Watchers

Forkers

magemasher danielgutmann renesugar pkpkpk r-k-h urbanslug genenetwork starinformatics saqibm128 stjordanis zxlzr health-apps d-demirci yixf-self lvbaocheng rtxteam francesco-bongiovanni hermenegildo12 nystrom richard1933 triagedr frdewisy michaelballantyne mzheng17 19katz maggiex masssa1982 kimthi1011 kaiwenho qanu-survey photonsarefree gechunqiang jeffhhk nathanielrb diffractometer animesh zyongbei jcolivo earwickerh jovew mayu-322 ncatstranslator chevvak2 aojesanmi natashaaasmi pittma mondano navyaramakrishnan pat-rondon leochencipher melihkazan ekimwerac

medikanren's Issues

contract violation, expected real

I am building index for semmed database.

The sample_semmed.csv could be run successfully. But the whole data exported from 'PREDICATION' does not work. Please see the error in the attachement.

medikanren.error.txt

Integrate dbSNP for variant interpretation

https://www.ncbi.nlm.nih.gov/snp/

Chembl prefix being used

Queries sent to Unsecret agent return compounds from chembl with the "chembl" prefix rather than that "chembl.compound" prefix that is in TRAPI.

Better logging on TRAPI server

Unsecret Agent not returning results for connections to UMLS identifiers for vomiting/nausea

In attempting to recreate the results of the TIDBIT regarding cyclic vomiting, I sent out several queries for connections to various identifiers related to vomiting, nausea, etc. There are connections to these identifiers present in SemMedDB, but no results are returned from Unsecret. This query graph was submitted to Unsecret through the ARS.

{"message":{
  
  "query_graph": {
    "nodes": [
      {
        "id": "n0",
        "set": false,
        "curie":"UMLS:C0520909",
        "type": "ChemicalSubstance"
      },
      {
        "id": "n1",
        "type": "named_thing",
        "set": false
      }
    ],
    "edges": [
      {
        "id": "e0",
        "source_id": "n1",
        "target_id": "n0"
      }
    ]
  }
}}

The following identifiers were also used without results being returned:
UMLS:C0027497
UMLS:C0027498
UMLS:C0718572
UMLS:C0722001
UMLS:C0520904
UMLS:C0520909

Support queries to find all proteins that interact with a given protein

For example, find all human proteins that are known to interact with a given human protein.

(Feature request from a clinical researcher.)

Support queries to find proteins with certain domains

For example, find all human proteins that contain a specific domain

(Feature request from a clinical researcher.)

Add "entities like these" to mediKanren

Google used to have a feature where you could enter a set of things, and then ask it to extend the list with more "similar" entities.

For example, if you entered "dog, cat, cow", it might add "horse, pig, chicken."

It would be cool if we could do this in mediKanren. Given a list of entities, we could find overlapping properties, e.g. each one INHIBITS c for the same concept c. Then, we could look for other inhibitors of c.

We could rank by the number of shared predicates. And weight by the number of elements in the core set that it shares it with.

For examples, if 2 out of 3 items in the core set share a predicate, then satisfying this predicate is worth .66 points in the ranking.

Add query template for phenotypic drug repurposing

Problem: We want to be able to recommend drugs based on what a disease does to a patient.

We want to be able to run queries of the form:

"y such that [disease A] increases x (for some x) AND drug y decreases x."

and

"y such that [disease A] decreases x (for some x) AND drug y increases x."

I think the most general query template would be:

"y such that R(A,x) and R’(y,x) for some x."

ITRB deployments

I know that medikanren's server is being rebuilt. This is a reminder that when that is complete, we need an ITRB deployment in the prod environment, correctly annotated in the smartAPI registry

Connection being dropped on long server requests

Even though computation continues

Add tissue-specific filter

Given a list of gene names (and maybe metabolites too), filter those that have high expression in a particular tissue type.

This would be particularly useful for the output.

For example, "restrict output to all genes highly expressed in the uterus."

[This came up as a request during the May hackathon working with an SME.]

A bunch of semi-related tasks to help transition to mediKanren 2

Since we are moving to mediKanren 2, there are a bunch of related things to do:

port the webserver to mediKanren 2
implement TRAPI 1.1 compliance for the May Relay
augment our server with “pragma” style directives that extend TRAPI, so we can do reasoning that TRAPI doesn’t currently support, but which is useful to PMI use cases
implement the ability for TRAPI requests to span multiple KGs
implement light weight reasoning / query expansion
implement / improve node and edge normalization
make sure the NCATS TRAPI queries we are getting don’t break or DoS the server, and return reasonable answers

Integrate COSMIC for knowledge of precision oncology

Eddy Yang recommends:

http://www.sanger.ac.uk/science/tools/cosmic
http://cancer.sanger.ac.uk/cosmic

building index with rust

So I've started porting code/csv-semmed-ordered-unique-enum.rkt to rust. It works on the sample_semmed.csv

On the semmedDB page I am seeing a ~2GB PREDICATION gzipped sql dump. Can you confirm this is the right data source?

If that is the case, the CSV file it produces is about 9.5GB. I think it would be simpler to just decompress and process the sql.gz directly as its really almost identical to csv anyway. I'd like to make the tool usable for everyone though and if you need explicit csv support that would be good to know

Add post-translational modifications as relations

For example, A PHOSPHORYLATES B.

Other relations:

Ubiquitilyates
Glycosylates

[This came up with SME at May hackathon]

Create web interface

Now that researchers are asking to use the tool, a web interface would avoid the need for people to install Racket. More importantly, it would avoid the need for them to deal with processing or downloading data sources.

It is critical that the web interface retains the interactive feel and responsiveness of the tool.

Make sure the NCATS TRAPI test queries return valid answers

Part of #69.

This appears to contain the canonical TRAPI ARA response validator:

https://github.com/NCATSTranslator/reasoner-validator

It uses jsonschema to validate TRAPI payloads.

Implement light weight reasoning / query expansion

Implement the ability for TRAPI requests to span multiple KGs

Improved concept search

Searching for 'beta-catenin' does not return 'beta catenin', although perhaps it should.

Searching for 'betacatenin' returns nothing.

Consider using edit distance as well.

Implement TRAPI 1.1 compliance for the May Relay

Add metabolic databases

We want to add resources like biocyc/metacyc.

When working on different aspects of the phenotype for a disease with an unknown genetic cause, it may be useful to do "pathway intersections" between two different genes implicated in different aspects of the phenotype.

Import gene expression data sets and single gene up/down-regulation query

I want to be able to search for " UPREGULATES|DOWNREGULATES " based on L1000/connectivity map/other gene expression data sets

Make sure the NCATS TRAPI queries we are getting don’t break or DoS the server

Making this a separate issue: ", and return reasonable answers"

How to construct queries

Import wikidata

We should also consider using wikidata as our "patch" system when we correct errors in the NLP-generated data.

Implement / improve node and edge normalization

Deploy TRAPI 1.3

Here's your official ticket:
"https://arax.ncats.io/?smartapi=1 shows that mediKanren has 1.1 in development (infores:unsecret-agent). Please deploy 1.3 of this tool into development ASAP."

Tickets were created for all tools that do not currently have 1.3 in development. We know you are rebuilding the tool and making tremendous progress. But we would be remiss if we didn't keep you in the loop that we're moving to 1.3. Please keep us posted and keep on trucking.

Generalize queries supported in current GUI

Right now the Racket GUI supports:

Concept 1 -> Predicate 1 -> X -> Predicate 2 -> Concept 2

and

Concept 1 -> Predicate -> X

and

X -> Predicate -> Concept 2

where X is some unspecified concept.

However, the existing interface does not support the direct connection between two specified concepts and a specified predicate:

Concept 1 -> Predicate -> Concept 2

It would also be useful to be able to specify the middle concept in the two predicate query above:

Concept 1 -> Predicate 1 -> Concept 3 -> Predicate 2 -> Concept 2

Also, it would be handy to have the ability to specify the synthetic predicate 'any predicate', and the synthetic concept 'any concept' (which would subsume the X above).

It would be useful to be able to specify the types of an underspecified concept: gene product, disease, phenotype, etc. We could support SemMedDB semantic types, but we probably should also support synthetic concept types, since the SemMedDB types are rather messy.

We should support sorting and filtering of answers.

Use GEO chipseq data to infer gene-gene regulatory relationships

Example query: Given a transcription factor, what genes does it upregulate/downregulate?

Add a parameter: how proximate / distal (+/- 5kb upstream)?

[Added by SME expert at May hackathon.]

Make sure the NCATS TRAPI test queries return nonempty answers

Part of #69

Add "mechanisms in common" query

Given a set of concepts (likely drugs), find all mechanisms they touch in common.

For example, if drug X inhibits gene Y and drug Z inhibits gene Y, then "inhibits gene Y" is a mechanism in common.

suggestion from Melissa Haendel: document licensing of KSs used

For the reasoners- you may wish to examine licensing as a criteria for inclusion. You can request license evaluation if you are uncertain here:
http://reusabledata.org/
https://github.com/reusabledata/reusabledata/issues/new
GitHub
Build software better, together
GitHub is where people build software. More than 27 million people use GitHub to discover, fork, and contribute to over 80 million projects.
or you can also curate your own and make a PR. I would greatly appreciate if we can ensure that all Translator Knowledge sources are curated for licensing information and we would be grateful for assistance.

Augment our server with “pragma” style directives that extend TRAPI

Augment our server with “pragma” style directives that extend TRAPI, so we can do reasoning that TRAPI doesn’t currently support, but which is useful to PMI use cases.

License?

I could be missing it, but I didn't see one committed to the repository

Make sure the NCATS TRAPI test queries return reasonable answers

Need a way to export results

While demoing the tool, folks have been asking if I could export results for them to take away.

A simple text file dump would be a good enough way to start.

Later on, .csv, .json, etc. export would be great.

Integrate Orange Book Data

Output not being flushed

I'm trying to use mediKanren with the entire SemMedDB. I obtained the source db and created a .csv following the same format as indicated in code/sample_semmed.csv, at about 10GB. Everything seems to run fine but it's not creating any output. It seems that everything is kept in RAM until the end - which means the machine ran out of RAM and swap well before that. There doesn't seem to be anything wrong with printing to file on my system (macOS + Racket v7.3), only with flushing at these lines:

mediKanren/code/csv-semmed-ordered-unique-enum.rkt

Lines 42 to 43 in 8b1157b

    
           (flush-output out-predicate) 
        
           (flush-output out-semtype))

I noticed the todo in README.md dated Nov 27, 2017:
TODO: add SemMedDB files, along with terms of use information for SemMedDB.

What is the recommended solution for this problem?

Integrate pharos data

A good first cut would be to important INHIBITS/STIMULATES relationships on genes/targets.