Code Monkey home page Code Monkey logo

medtype's Introduction

...

Improving Medical Entity Linking with Semantic Type Prediction

What is MedType?

MedType is a BERT-based entity disambiguation module which can be incorporated with an any existing medical entity linker for enhancing its performance. For a given input text, MedType takes in the set of identified mentions along with their list of candidate concepts as input. Then, for each mention MedType predicts its semantic type based on its context in the text. The identified semantic type is utilized to disambiguate extracted mentions by filtering the candidate concepts. The figure below summarizes the entire process. The results demonstrate that MedType achieves state-of-the-art performance for medical entity linking task. Please refer to the paper for more details.

...

Contents

We make the following resources available in this repository:

  • medtype-as-service is inspired by bert-as-service which provides a scalable implementation of BERT model for encoding thousands of documents in seconds. medtype-as-service on similar lines helps to scale MedType by serving a pretrained MedType model through an API. Basically, medtype-as-service takes in a list of variable-length text and returns entity linking output in the following form:

    Input: ["Symptoms of common cold includes cough, fever, high temperature and nausea."]
    Output: 
    [
        {
            "text": "Symptoms of common cold includes cough, fever, high temperature and nausea.",
            "mentions":[
                {
                    "mention": "Surface form of mention",
                    "start_offset": "Character offset indicating start of the mention",
                    "end_offset": "Character offset indicating end of the mention",
                    "predicted_type": ["List containing predicted semantic types for the mention"],
                    "candidates": ["Contains list of [CUI, Score] pairs given by base entity linker"],
                    "filtered_candidates": ["Contains MedType output: filtered list of [CUI, Score] pairs based on mention's predicted semantic types"]
                },
                {}
            ]
        }   
    ]
    • We provide three pre-trained models for tackling different domain:
    • Currently, we provide support with the following entity linkers: cTakes, MetaMap, MetaMapLite, QuickUMLS, and ScispaCy.
    • Instructions for runing medtype-as-service follow the instructions given in the readme.md
    • Similar to bert-as-service, medtype-as-service is ๐Ÿ”ญ State-of-the-art, ๐Ÿฃ Easy-to-use, โšก Fast, ๐Ÿ™ Scalable, and ๐Ÿ’Ž Reliable.
  • medtype-trainer is for training a MedType model from scratch which can be later used by medtype-as-service. All the details for training and evaluation code for entity linking is provided in ./medtype-trainer.

    ...

Datasets

We present two new, automatically-created datasets (available on Google Drive):

  • WikiMed: Over 1 million mentions of biomedical concepts in Wikipedia pages
    • Mentions were automatically identified based on links to Wikipedia pages for medical concepts.
    • Mentions of concepts not linked to Wikipedia pages are not included in the dataset.
    • Manual evaluation of 100 random samples found 91% accuracy in the automatic annotations at the level of UMLS concepts, and 95% accuracy in terms of semantic type.
  • PubMedDS: Over 57 million mentions of biomedical concepts in abstracts of biomedical research papers on PubMed.
    • Mentions were automatically identified using distant supervision, based on and a machine learning NER model in scispaCy.
    • Concept identification focused on MeSH headers assigned to the papers.
    • Comparison with manually-annotated datasets found 75-90% precision in the automatic annotations.

Datasets statistics:

Datasets #Docs #Sents #Mentions #Unq Concepts
NCBI 792 7,645 6,817 1,638
Bio CDR 1,500 14,166 28,559 9,149
Sharecorpus 431 27,246 17,809 1,719
MedMentions 4,392 42,602 352,496 34,724
WikiMed 393,618 11,331,321 1,067,083 57,739
PubMedDS 13,197,430 127,670,590 57,943,354 44,881

Formatting information:

  • Both WikiMed, PubMedDS are in JSON format with one document per line. Each document has the following structure:

    {
        "_id":  "A unique identifier of each document",
        "text": "Contains text over which mentions are ",
        "title": "Title of Wikipedia/PubMed Article",
        "split": "[Not in PubMedDS] Dataset split: <train/test/valid>",
        "mentions": [
            {
                "mention": "Surface form of the mention",
                "start_offset": "Character offset indicating start of the mention",
                "end_offset": "Character offset indicating end of the mention",
                "link_id": "UMLS CUI. In case of multiple CUIs, they are concatenated using '|', i.e., CUI1|CUI2|..."
            },
            {}
        ]
    }
  • We also make two public datasets MedMentions and NCBI Disease corpus also available in the same format. The mapping from Wikipedia to UMLS used for creating the WikiMed dataset has also been made available.

  • All the datasets along with the mapping from Wikipedia to UMLS can be downloaded using the following script:

    ./download_datasets.sh

Citation

Please consider citing our paper if you use this code in your work.

@ARTICLE{medtype2020,
       author = {{Vashishth}, Shikhar and {Joshi}, Rishabh and {Newman-Griffis}, Denis and
         {Dutt}, Ritam and {Rose}, Carolyn},
        title = "{MedType: Improving Medical Entity Linking with Semantic Type Prediction}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Computation and Language},
         year = 2020,
        month = may,
          eid = {arXiv:2005.00460},
        pages = {arXiv:2005.00460},
archivePrefix = {arXiv},
       eprint = {2005.00460},
 primaryClass = {cs.CL},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2020arXiv200500460V},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

For any clarification, comments, or suggestions please create an issue or contact Shikhar.

Acknowledgements:

This work was funded in part by NSF grants IIS 1917668 IIS 1822831, Dow Chemical and UPMC Enterprises/Abridge, and the National Library of Medicine of the National Institutes of Health under award number T15 LM007059.

medtype's People

Contributors

drgriffis avatar gaurav avatar svjan5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

medtype's Issues

Update entity_linkers.py

Hi,

There were problems when running scispacy linker v0.2. In particular, there was not a config.cfg file to read. I solved that by installing the v0.4.

However, in the new release there are a few changes with respect to nlp.add_pipe.
I had to manually change your source code file entity_linkers.py to make it run:
image

MedType Demo - not working

When I go to your demo website, it asks me to accept the certificate in a popup and when I click Ok, takes me to https://128.2.204.127:8124/run_linker site but this site fails saying 'This site can't be reached'. I have tried both Chrome and Edge. I am not behind any proxy server. I am trying from my home PC.

About Table 2

๏ฟฝHi, thanks for your code!
In the paper, what is the difference between Oracle(F), Oracle(C), MedType in Table 2?
Shouldn't MedType contain one of Oracle(F) and Oracle(C)?
Thanks~

pip version?

I'm using pip-21.1.1 to install requirements for server as follow:
pip install -r ./medtype-as-service/server/requirements.txt
Output:

Looking in indexes: https://pypi.org/simple, https://:****@pkgs.dev.azure.com/DevOps-RD/daa07a13-a918-496b-9b75-929313115fba/_packaging/az-artifacts-pypi/pypi/simple/
ERROR: Could not find a version that satisfies the requirement torch==1.4.0 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 1.7.1, 1.8.0, 1.8.1)
ERROR: No matching distribution found for torch==1.4.0

It seems that requirements file does not satisfy by pip-21.1.1. May I ask to let me know you're using which version of pip or how can I solve the above problem?

UMLS authentication change

Hi,

One of your linkers (ctakes) has changed the way it connects to UMLS. I thought you would like to know so you update your Read.me

More info here.

medtype-as-service

Hi,
I am running medtype-as-server. While running when I am starting the server using the following command :

medtype-serving-start --model_path $PWD/resources/pretrained_models/pubmed_model.bin \
		      --type_remap_json $PWD/../config/type_remap.json \
		      --type2id_json $PWD/../config/type2id.json \
		      --umls2type_file $PWD/resources/umls2type.pkl \ 
		      --entity_linker scispacy

`
I am getting the following error:

I:VENTILATOR:[__i:__i: 64]:freeze, optimize and export graph, could take a while...
Traceback (most recent call last):
  File "/opt/conda/bin/medtype-serving-start", line 33, in <module>
    sys.exit(load_entry_point('medtype-serving-server==1.0.0', 'console_scripts', 'medtype-serving-start')())
  File "/opt/conda/lib/python3.6/site-packages/medtype_serving_server-1.0.0-py3.6.egg/medtype_serving/server/cli/__init__.py", line 4, in main
    with MedTypeServer(get_run_args()) as server:
  File "/opt/conda/lib/python3.6/site-packages/medtype_serving_server-1.0.0-py3.6.egg/medtype_serving/server/__init__.py", line66, in __init__
    self.model_params   = self.load_model(args.model_path)
  File "/opt/conda/lib/python3.6/site-packages/medtype_serving_server-1.0.0-py3.6.egg/medtype_serving/server/__init__.py", line74, in load_model
    state               = torch.load(model_path, map_location="cpu")
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 527, in load
    with _open_zipfile_reader(f) as opened_zipfile:
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 224, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)

I tried loading the pretrained model using torch version: 1.8.0 and using code :
model = torch.load('pubmed_model.bin', map_location="cpu")
It loaded successfully.
But when I did it using torch version: 1.4.0
I got same error as above.

Is the pretrained model present in the medtype github repo, trained on a different pytorch version? Or is there any other problem? Please let me know the way to solve it.

Thanks

Python package instead of medtype-as-service

I think it'd be really helpful to pull out the entity linkers that are python libraries (all except for cTAKES) and make an python package to be used in notebooks and installed in conda/pip environments easily.

I've made some changes and gotten the scispacy + medtype part to work without the server (if that's helpful), however I haven't checked the other linkers or done any extensive testing. Forked repo: https://github.com/vsocrates/medtype

pretrained model URLs - `general text` and `Electronic Health Records (EHR)` same file?

@svjan5
Thanks for sharing.
Just wondering

  1. pretrained model URLs of general text and Electronic Health Records are linked to the same file? They are in different URLs but downloaded files are same size with same file name (general_model.zip 1185789292 bytes).
  2. Online demo seems is not working for me (Linux 64 Chrome v.87 and Firefox v.78). Which browswers that you know are working?
    Thanks

About pretrained_model

Hi, thanks for your code!
How does the pretrained model provided on GitHub work?
Thanks!

Semantic type/category information in PubMedDS and WikiMedDS datasets

Hi there!

I downloaded the datasets included with MedType by running download_datasets.sh. I noticed that some datasets (ncbi.json, medmentions.json) include category information, while others (wikimed.json, pubmed_ds) don't. I couldn't find any documentation for why category information is not included -- I notice that figure 4 from the MedType paper specifically mentions that Semantic Type of the term. Is that information available in these datasets somewhere? If so, could you please document how to access that information in this repository?

Thanks so much!

The count of docs and sents in PubMedDS

I am using PubmedDS as training corpora for my project.
I notice the count of documents and sentences is inconsistent in arxiv v1 and v2/v3.
Do you add new documents to PubmedDS?

Semantic type prediction results in paper

In table 5, you provide results of MT <- WikiMed & MT <- PubMedDS.
I am curious about the results of MedType trained on WikiMed(or PubMedDS) without fine-tuning in semantic type prediction.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.