Code Monkey home page Code Monkey logo

nameresolution's People

Contributors

cbizon avatar floor3d avatar gaurav avatar kshefchek avatar patrickkwang avatar phillipsowen avatar yaphetkg avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

kshefchek floor3d

nameresolution's Issues

Handle possessives and plurals better

"Parkinson's" returns hits, but "Parkinsons" does not. Other punctuation like commas or hyphens appear to be ignored by solr. Can we also ignore apostrophes?

Also see Alzheimer. Searching for "Alzheimer's" or "Alzheimer" returns over 20 hits, including things from CHEBI as well as MONDO. "Alzheimers" returns 1. Things like this will be handled inconsistently across naming schemes, so we should probably do some work here to make sure it doesn't matter.

Look for cases where the same synonym is mapped to multiple identifiers

We know this happens sometimes (e.g. Beta-Sitosterol maps to both UMLS:C0106127 and PUBCHEM.COMPOUND:222284). This usually indicates that cliques have been incorrectly split, but they might also indicate that a synonym is being incorporated incorrectly (e.g. CHEMBL.COMPOUND:CHEMBL4526634, UniProtKB:P07659, UniProtKB:P00573, NCIT:C41338, PR:000037458 are very different things that all have the synonym 1).

Logically, this sort of test should be done in Babel Validation (BV), but that would require BV to do cross-matching, which would take even more memory than it currently uses. Instead, we should first investigate whether we can get these results from the Solr database, which should be optimized for label querying. This StackOverflow answer might be a good starting point. It may also be possible to do this efficiently in Souffle.

Steps:

  • Make a list of all cases where a preferred name is shared by multiple cliques.
  • Make a list of all cases where a preferred name is identical to a synonym in another clique.
  • Make a list of all cases where two cliques share the same synonym.

There are some known examples of this issue:

How to deal with overlapping concepts

Thanks to the inclusion of all of UMLS, we now have some cases where searching for something results in a few well-normalized identifiers, followed by lots of UMLS identifiers that overlap or are a perfect subset of the well-normalized identifiers.

e.g. https://name-resolution-sri.renci.org/lookup?string=Neonatal%20withdrawal&offset=0&limit=10

concerns:

  1. Will this be confused for downstream users, or do they have anything to differentiate these similar concepts from each other?
  2. Are we scoring these matches properly so that the best-normalized nodes are returned before more vaguely normalized nodes?

Need to be able to differentiate between different species in genes, proteins, etc.

e.g. if you search NameRes for "fgf8", we get back several different UniProt IDs for human, zebrafish, etc. UI wants to make sure that users can clearly differentiate between the identifiers by means of the label (note that they use the canonical label from NodeNorm, NOT the label(s) from NameRes).

The ultimate fix for this is going to be the metadata service that will eventually fit into NodeNorm, but it would be good to have a short-term fix. This could be entirely on the UI's end (e.g. calling mygene.info for each UniProtKB identifier), but eventually the Node Metadata Service is the solution here.

Alternative: prioritize names with brackets to the canonical name in NodeNorm.

Host on correct url

Right now we're hosting on the robokop server, but it should really be at nameresolution-sri.renci.org. There is one there, but it's broken.

Add benchmarking

It would be great to have benchmarking to check the speed of a set of queries and to load-test NameRes instances.

Possible options:

We do have the NodeNorm/NameRes/Babel validation sheet, which we could use to set up and test integration tests. Note that that spreadsheet can include failing tests currently in development.

Ensure a smooth autocomplete experience

Translator UI calls NameRes as an autocomplete provider, i.e. it calls it for a list of concepts every time the user enters a new character. For a smooth autocomplete experience, NameRes should meet three criteria:

  1. Adding additional characters should not radically change search results.
  2. Once the expected result has shown up, adding additional characters shouldn't get rid of it.
  3. Every additional character should move the expected results further up the search result.

I think this was an artifact of our previous Solr query, but we should write a test to make sure the new query doesn't have the same problem.

Should fix NCATSTranslator/Feedback#315, NCATSTranslator/Feedback#312

Add support for Tox

I think this is the recommended way to run Python tests now, especially for testing multiple Python versions.

When we do this, it would be great to add a GitHub action to run the tests for every new pull request.

Query for "HIV" fails...

`nr_url = 'https://name-resolution-sri.renci.org/lookup'
params = {'string':"HIV", 'limit':-1}
r = requests.post(nr_url, params=params)
print(r.json())`

I get a "JSONDecodeError: Expecting value: line 1 column 1 (char 0)"
I try other names for HIV. "human immunodeficiency virus" gives the same error, but "HIV/AIDS" returns CURIEs.

-Kamileh from Multiomics

TS.5.c Add documentation

Update the README to include user-level documentation. What the service is and how to use it.

Acetaminophen issues on 2023may18 NodeNorm and NameRes

We have three different acetaminophen issues:

UMLS Identifiers Supported?

I am trying to find the particular CURIE we use in Clinical Trials Multiomics KG for cystic fibrosis in the Name Resolver. It is UMLS:C0010674:

image

I POST to find all potential CURIEs for the string "cystic fibrosis".

nr_url = 'https://name-resolution-sri.renci.org/lookup'

params = {'string':"cystic fibrosis", 'limit':-1}
r = requests.post(nr_url, params=params)
curies = list(r.json().keys())
print(curies)

I get this list of CURIEs

['MONDO:0009061', 'UMLS:C0455371', 'UMLS:C0494350', 'UMLS:C0392164', 'UMLS:C0357179', 'UMLS:C4510747', 'UMLS:C0260930', 'MONDO:0005219', 'UMLS:C0420028', 'UMLS:C3839063', 'UMLS:C4020619', 'UMLS:C2911653', 'UMLS:C0948452', 'UMLS:C0054504', 'MESH:D040501', 'UMLS:C4281719', 'PR:000014417', 'UniProtKB:P05109', 'MONDO:0054868', 'NCBIGene:6279', 'UMLS:C2711060', 'UMLS:C1135187', 'UMLS:C1527396', 'UMLS:C4546077', 'UMLS:C1997690', 'UMLS:C4546076', 'NCBIGene:6280', 'UMLS:C2599388', 'UMLS:C1859047', 'UMLS:C0809945', 'UMLS:C0428295', 'UMLS:C3825312', 'UMLS:C4546078', 'UMLS:C3669165', 'UMLS:C3669166', 'MESH:C026422', 'UMLS:C0056888', 'UMLS:C4698912', 'UMLS:C1997110', 'MONDO:0013112', 'MONDO:0013087', 'UMLS:C2827436', 'MONDO:0008887', 'UMLS:C3696937', 'UMLS:C1830847', 'UMLS:C2242739', 'UMLS:C3888831', 'UMLS:C1262474', 'UMLS:C2242728', 'UMLS:C0055725', 'UMLS:C2163825', 'UMLS:C0180269', 'UMLS:C0200917', 'UMLS:C2985286', 'UMLS:C4510748', 'UMLS:C1998161', 'UMLS:C5243752', 'UMLS:C4027850', 'UMLS:C2956101', 'UMLS:C4065544', 'UMLS:C3873346', 'UMLS:C0200920', 'UMLS:C1294306', 'UMLS:C1960658', 'UMLS:C0972869', 'UMLS:C0268409', 'UMLS:C1135342', 'UMLS:C1398240', 'UMLS:C3873356', 'UMLS:C2732778', 'UMLS:C2733432', 'UMLS:C2599389', 'UMLS:C2181591', 'UMLS:C4324649', 'UMLS:C1960705', 'UMLS:C0200918', 'UMLS:C4016407', 'MONDO:0005413', 'UMLS:C0410967', 'UMLS:C3873270', 'UMLS:C3873354', 'UMLS:C3696911', 'UMLS:C0348815', 'UMLS:C4050451', 'UMLS:C2931413', 'UMLS:C5188872', 'UMLS:C0348816', 'MONDO:0009062', 'UMLS:C2183116', 'UMLS:C4303495', 'UMLS:C4016791', 'UMLS:C2956100', 'UMLS:C5439238', 'UMLS:C0010676', 'UMLS:C1960626', 'UMLS:C0200919', 'UMLS:C3805037', 'UMLS:C3873257', 'UMLS:C2163824', 'UMLS:C2735559', 'UMLS:C3873179', 'UNII:PW7453NX3R', 'UMLS:C0056889', 'PR:000001044', 'UniProtKB:P13569', 'PANTHER.FAMILY:PTHR24223:SF19', 'GO:0005260', 'UMLS:C2938941', 'NCBIGene:1080', 'UMLS:C2262295', 'UMLS:C3805192', 'UMLS:C4729746', 'REACT:R-HSA-5678895', 'UMLS:C3873281', 'UMLS:C2874308', 'UMLS:C2735558', 'UMLS:C5188873', 'PUBCHEM.COMPOUND:504670', 'UMLS:C0548105', 'UMLS:C5229443', 'UniProtKB:P34158', 'EC:5.6.1.6', 'UNII:V89TAKK8SJ', 'UMLS:C2720898', 'UniProtKB:P26361', 'UMLS:C5539201', 'UMLS:C3873342', 'UMLS:C5161645', 'UMLS:C3873160', 'UMLS:C5208972', 'UMLS:C0651197', 'UMLS:C5230996', 'PR:000001045', 'PUBCHEM.COMPOUND:132758', 'UMLS:C0297062', 'MESH:C093349', 'UniProtKB:Q1LX78', 'UMLS:C4274847', 'UMLS:C3267922', 'UMLS:C4521703', 'UNII:0189J4XE1H', 'UMLS:C3659855', 'NCBIGene:140871', 'NCBIGene:107080633', 'NCBIGene:106481718', 'UMLS:C5208133', 'UMLS:C3873231', 'UMLS:C4742376', 'UMLS:C0483174', 'UniProtKB:P34158-1', 'UniProtKB:P13569-1', 'UniProtKB:P13569-3', 'UniProtKB:P26361-1', 'UniProtKB:P13569-2', 'UMLS:C4721985', 'UMLS:C5213231', 'UniProtKB:P26361-3', 'UniProtKB:P26361-2', 'UMLS:C0403805', 'UMLS:C4070204', 'PR:000046699', 'UMLS:C5186148', 'UMLS:C3248145', 'UMLS:C3516215', 'UMLS:C1840270', 'UMLS:C3689054', 'UMLS:C3248144', 'UMLS:C1326420', 'UMLS:C3248142', 'UMLS:C4729743', 'UMLS:C4521924', 'UMLS:C1326417', 'UMLS:C3702386', 'UMLS:C5161646', 'UMLS:C1857424', 'UMLS:C3703925', 'UMLS:C3811819', 'UMLS:C3525605', 'UMLS:C5412100', 'UMLS:C3248143', 'UMLS:C3248141', 'UMLS:C5439689', 'UMLS:C5439687', 'UMLS:C4709464', 'UMLS:C4051855', 'UMLS:C4732261', 'UMLS:C4316220']

But I am unable to find UMLS:C0010674 in the list of CURIEs returned by Name Resolver.

if "UMLS:C0010674" in curies:
    print("true")
else:
    print("false")

I get false...

-Kamileh from Multiomics

Add a data-loading Kubernetes job

I'm currently working on separating the data-loading components of this repository into a separate directory (PR #28). Running these locally requires copying the synonym files to your computer, and running them on Hatteras would require an sbatch file. Instead, it would be neat to come up with a Kubernetes instance that (1) mounts a particular Babel results directory from RENCI Projects, (2) generates the necessary Solr index, and (3) writes it back into the RENCI Projects for deploying.

Add support for filtering to a set of prefixes

Filtering to a set of prefixes would allow searches to take the view of a single ontology or identifier source, allowing large and detailed identifier sources like UMLS to be filtered out when not needed. The specific need here is from the UI, who have found that filtering to MONDO (and possibly HP) removes a lot of problematic matches, and would like NameRes to handle this so that they don't need to reduce the number of results when they filter these results at their end.

Theoretically, there are two ways in which we could do this:

  1. Filter only on the preferred identifier, i.e. concept A with a preferred ID of MONDO:1234 and a secondary ID of HP:4567 will be returned if filtering on MONDO but not if filtering on HP.
  2. Filter on any identifier, i.e. concept A with a preferred ID of MONDO:1234 and a secondary ID of HP:4567 will be returned if filtering on MONDO AND if filtering on HP, but not on other prefixes.

I think the first approach is fine for an initial implementation, but we can also implement the second if needed. Of course a complete implementation would included both filter_only_prefix and filter_out_prefix fields, but to meet this use-case we only need filter_only_prefix.

Implement an internal filter list

There are several types of concepts we might want to filter:

There are also several places we could implement filtering:

  1. In Babel: UMLS:C0277562 should not be included in any cliques in Node Normalization or Name Resolution. This would prevent "adult disease" from showing up in autocomplete or in Translator results.
  2. In Name Resolution: UMLS:C0277562 could be included in NodeNorm, but should not be included in NameRes' Solr index. "Adult disease" could still show up in Translator results, but won't show up in autocomplete.
  3. In UI: calling NameRes will still return UMLS:C0277562, but the UI will filter it out when displaying autocomplete results.

Add GET method for `/lookup`

Currently, /lookup is a POST-only endpoint. If it could also be used via a GET request, this would make it easier to link to a particular query.

MCM6

Gene MCM6 is NCBIGene:4175, but that isn't the first hit... (not sure if it shows up after the first few)

Restore previous sorting method

May fix #50

Need to restore old sort from:

query = f"http://{SOLR_HOST}:{SOLR_PORT}/solr/name_lookup/select"
params = {
"query": name_filters,
"limit": 0,
"sort": "length ASC",
"facet": {
"categories": {
"type": "terms",
"field": "curie",
"sort": "x asc",
"offset": offset,
"limit": limit,
"facet": {
"x": "min(length)",
},
"numBuckets": True,
}
}
}

Fix terms of service in OpenAPI

NameRes OpenAPI provides a link to the terms of service at robokop.renci.org:7055/tos?service_long=Name+Resolver&provider_long=the+Translator+Consortium, but this link no longer works. We should replace this with an updated, fixed link, preferably hosted in our GitHub repository so we can track changes to it.

Related to TranslatorSRI/NodeNormalization#170

Other minor fixes:

Fold into Node Normalization update

On an SRI call @cmungall suggested that this service and node normalization should be fed by the same pipeline. We should have a single ingest pipeline that takes in different vocabularies / ontologies, and collects xrefs or other identity assertions, labels and lexical synonyms, and keeps that all in some form. Could be RDF triples but doesn't really matter IMO.

Then that set of information can be e.g. the starting point for creating a name resolution service. But at that point, I think it makes sense to have a single service that has functions for both id and name lookup. Related to #4

Add support for conflation into NameRes

We don't want to conflate automatically, but would like to add a conflation flag that can be used to turn on conflation in an identical manner to that used by NodeNorm.

  1. NameRes could store the conflation information, and then combine conflation results on the fly if the conflation flag is turned on.
  2. Babel could generated conflated entries, in which case we'd need some way to differentiate between conflated and non-conflated entries in NameRes.

Add the canonical label and Biolink type to NameRes results

At the moment, the Translator UI queries NameRes to find identifiers to display, then queries NodeNorm for the canonical label, Biolink type and other information. Both the canonical label and the Biolink type (as needed for #39 anyway) would be great to include in the NodeNorm results so that a second NodeNorm query is not needed.

Data-loading Makefile doesn't include step 7 from the README

The data-loading Makefile doesn't currently contain instructions for step 7 of the README file:

7. Generate the backup tarball. At the moment, this is expected to be in the format
`var/solr/data/snapshot.backup/[index files]`. The easiest way to generate this tarball correctly is to run:
```shell
$ mkdir -p data/var/solr/data
$ mv /var/solr/name_lookup_shard1_replica_n1/data/snapshot.backup data/var/solr/data
$ cd data
$ tar zcvf snapshot.backup.tar.gz var
```

This is particular silly because it's exactly four bash commands -- it'll be extremely easy to automate!

Split Protein.txt and SmallMolecule.txt to avoid crashing Solr

It's unclear how large a file Solr can ingest without problems (the 20G Gene.txt file appears to work), but the 100+GB Protein.txt file definitely doesn't. We should split it into smaller files for the ingest. The best place to do this is to include it in the Makefile.

New synonym format leads to much worse querying

I've set up a NameRes instance on Sterling (accessible in the RENCI VPN only) at http://name-resolution-sri-dev.apps.renci.org/docs using the new synonym format we've built for NameRes (#46, helxplatform/translator-devops#634, TranslatorSRI/Babel#113).

You can also directly access the underlying Solr database by running:

$ kubectl port-forward -n translator-exp name-lookup-solr-dep-0 8983:8983

and then accessing http://localhost:8983/ on your computer.

The bad news is that both directly querying Solr and querying it through the NameRes frontend results in significantly worse results than we get with the old system. For example, querying https://name-resolution-sri.renci.org/docs for blood gives us UBERON:0000178, NCIT:C12434 and UMLS:C0851353 (all meaning "blood") followed by UMLS:C0851353 ("bloody"). But running the same query on http://name-resolution-sri-dev.apps.renci.org/docs gives us UMLS:C5169928 ("JWH-073 3-hydroxybutyl (synthetic cannabinoid metabolite) | Blood | Drug toxicology"), UMLS:C5171063 ("Lindane | Blood | Drug toxicology"), UMLS:C0312901 ("Blood group antigen IBH") and a bunch of others.

Searching with Solr gives slightly more relevant results, but not the really good results that https://name-resolution-sri.renci.org/docs gives.

One possible reason for this is that I've indexed the names field as a multiValued field (since it contains multiple values). Changing it to a non-multiValued field definitely helps with the results in Solr, but it causes NameRes to no longer work. I'll try fixing that and see if that solves this bug. If not, I'll probably need some help with the Solr querying and indexing aspect of all this.

Not returning expected CURIE

This issue is to report a missing CURIE. Specifically, ICEES KG calls Name Resolver to map CURIEs to ICEES feature variable, using search_term and limit parameters. For diseases, we have been using search terms structured as 'disease X diagnosis'. This approach was tested extensively and has been yielding more accurate mappings with less noise. However, in our latest ICEES KG instance, 'asthma diagnosis' was not correctly mapped to MONDO:0004979, which was not the case previously, at least I don't think so, given that I've successfully used that CURIE in numerous queries. I tested several other "search terms" using Name Resolver and these return the expected CURIES (see example below). So, I believe the issue is specific to "asthma diagnosis" and MONDO:0004979.

Example returns for "idiopathic bronchiectasis diagnosis":

{
  "MONDO:0018956": [
    "Idiopathic bronchiectasis (diagnosis)",
    "idiopathic bronchiectasis",
    "bronchiectasis idiopathic",
    "Idiopathic bronchiectasis",
    "Idiopathic bronchiectasis (disorder)"
  ]
}

Offset gives 500

Query any term, and set the offset higher than the number of results causes a 500.

Figure out why NameRes is showing up as a TRAPI endpoint on SmartAPI

This appears to be because the URL https://trapi-openapi.apps.renci.org/utility/infores%3Asri-name-resolver?version=%2A has been registered in SmartAPI as https://smart-api.info/registry?q=9995fed757acd034ef099dbb483c4c82, and is therefore displayed in the ARAX SmartAPI list -- note this only has the translator tag, just like the actual NameRes production OpenAPI, but it's possible the "TRAPI vQuality" in the title might be cause ARAX to think that this is a TRAPI endpoint.

Change result format

The current result is a JSON dictionary. But the results are meant to be ordered by the elasticsearch match score. So these should instead come back as a list.

Make NR type-aware

When the UI is calling name-resolution, it frequently knows what type it wants. Right now there is no ability to filter by type, so even if you know you want chemicals, you might get back a bunch of diseases. Not only that, but NR doesn't even have the data to return for filtering, we make people go to NN to get the types.

Somehow we should probably bring type info into this. I wonder if really all of NR should merge with NN in general...

@gaurav do you have any thoughts here?

See NCATSTranslator/Feedback#42

Return all names?

Currently if I look up a term that has many synonyms, I get back the identifiers, and then the list of lexical synonyms for that entity, but if I use different query terms, I get back different lists of names. I suspect that it's the list that solr decided matched my query, but I'm not sure.

Should we get all the names back? Maybe the matches are marked somehow? It might be useful in determining which of multiple entities that come back are the one(s) that you want.

Example: "Parkinson 19" returns 10 synonyms for MONDO:0014231 but "Parkinson 19 early" only returns 1.

Normalize output

Currently if I look up "diabetes" I get both HP and MONDO terms for T2D. But in Translator, we've unified these things. So you should only get back the MONDO term. In other words, the results should be normalized.

However, you still want to search the HP lexical synonyms.

Probably the eventual solution will involve #7 but for now we could call the node norm service.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.