Code Monkey home page Code Monkey logo

medlinker's Introduction

MedLinker

ECIR 2020 - MedLinker: Medical Entity Linking with Neural Representations and Dictionary Matching

Link to paper: https://link.springer.com/chapter/10.1007/978-3-030-45442-5_29

Note: This is a poorly documented initial release, precipitated by some requests to have access to the code. As I have more time available, and if others remain interested, I'll try to continue improving the codebase and documentation.

Installation

After cloning this repository and moving to the root folder, follow the steps below.

1. Download and extract data:

UPDATE - Check the discussion here first: #2

This archive contains some data adapted from UMLS, please ensure you have the required license to use it before downloading. Download data.zip (153MB) from Google Drive, and then:

unzip data.zip

Check here for the files you're expected to have in the data/ directory.

If data.zip is not available, the create_umls_kb.py script should help in re-creating the UMLS data required to run MedLinker.

2. Download and extract models:

Download models.zip (1.8GB) from Google Drive, and then:

unzip models.zip

Check here for the files you're expected to have in the models/ directory.

3. Create an environment for this project:

conda create -n medlinker python=3.6.5 anaconda

4. Switch to this environment:

conda activate medlinker

5. Change the default pip version (default breaks installing dependencies):

pip install pip==9.0.3

6. Install dependencies:

pip install -r requirements.txt

Usage

For this initial release, we recommend using MedLinker with the parameters defined in medlinker.py .

You can test if your setup is correctly configured by simply running:

python medlinker.py

After loading the models, you should see the following output:

{'sentence': 'Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.',
 'tokens': ['Myeloid',
  'derived',
  'suppressor',
  'cells',
  '(MDSC)',
  'are',
  'immature',
  'myeloid',
  'cells',
  'with',
  'immunosuppressive',
  'activity.'],
 'spans': [{'start': 0,
   'end': 4,
   'text': 'Myeloid derived suppressor cells',
   'st': ('T017', 1.0),
   'cui': ('C4277543', 1.0)},
  {'start': 4,
   'end': 5,
   'text': '(MDSC)',
   'st': ('T017', 0.54723495),
   'cui': ('C4277543', 0.99998283)},
  {'start': 7,
   'end': 9,
   'text': 'myeloid cells',
   'st': ('T017', 1.0),
   'cui': ('C0887899', 1.0)}]}

Which should be reproducible with the following code, and easily adapted for other applications:

from medner import MedNER
from medlinker import MedLinker
from umls import umls_kb_st21pv as umls_kb

# default models, best configuration from paper
# to experiment with different configurations, just comment/uncomment components

cx_ner_path = 'models/ContextualNER/mm_st21pv_SCIBERT_uncased/'
em_ner_path = 'models/ExactMatchNER/umls.2017AA.active.st21pv.nerfed_nlp_and_matcher.max3.p'
ngram_db_path = 'models/SimString/umls.2017AA.active.st21pv.aliases.3gram.5toks.db'
ngram_map_path = 'models/SimString/umls.2017AA.active.st21pv.aliases.5toks.map'
st_vsm_path = 'models/VSMs/mm_st21pv.sts_anns.scibert_scivocab_uncased.vecs'
cui_vsm_path = 'models/VSMs/mm_st21pv.cuis.scibert_scivocab_uncased.vecs'
cui_clf_path = 'models/Classifiers/softmax.cui.h5'
sty_clf_path = 'models/Classifiers/softmax.sty.h5'
cui_val_path = 'models/Validators/mm_st21pv.lr_clf_cui.dev.joblib'
sty_val_path = 'models/Validators/mm_st21pv.lr_clf_sty.dev.joblib'

print('Loading MedNER ...')
medner = MedNER(umls_kb)
medner.load_contextual_ner(cx_ner_path)

print('Loading MedLinker ...')
medlinker = MedLinker(medner, umls_kb)

medlinker.load_string_matcher(ngram_db_path, ngram_map_path)  # simstring approximate string matching

# medlinker.load_st_VSM(st_vsm_path)
medlinker.load_sty_clf(sty_clf_path)
# medlinker.load_st_validator(sty_val_path, validator_thresh=0.45)

# medlinker.load_cui_VSM(cui_vsm_path)
medlinker.load_cui_clf(cui_clf_path)
# medlinker.load_cui_validator(cui_val_path, validator_thresh=0.70)

s = 'Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.'
r = medlinker.predict(s)
print(r)

medlinker's People

Contributors

danlou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

medlinker's Issues

umls.2017AA.active.st21pv.json not in the dataset

Hello, so I happened to get everything ready in my VM. I got the dataset from the link you had passed. When I ran :
python medlinker.py
I got the following error:

"""
Traceback (most recent call last):
File "medlinker.py", line 282, in
from umls import umls_kb_st21pv as umls_kb
File "/media/mnt/keshav/medlinker/umls.py", line 56, in
umls_kb_st21pv = UMLS_KB('umls.2017AA.active.st21pv')
File "/media/mnt/keshav/medlinker/umls.py", line 10, in init
self.load(umls_version)
File "/media/mnt/keshav/medlinker/umls.py", line 14, in load
with open(json_path, 'r') as json_f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/UMLS/umls.2017AA.active.st21pv.json'

"""

Can you please check this out and let me know in case I have missed something, if not where do I find that json file ?

create_umls_kb.py

Hi, I am not good at sqlite3, when I run create_umls_kb.py, I get

‘’‘
Collecting info from 'descriptions' table ...
Traceback (most recent call last):
File "/Users/sunjian/Desktop/code/Entity_linking/MedLinker-master/scripts/create_umls_kb.py", line 44, in
for row_idx, row in enumerate(c.execute('SELECT * FROM descriptions')):
sqlite3.OperationalError: no such table: descriptions
’‘’

Can you please let me know what do I have to so?
Thanks
Regards

Data is not downloadable

I tried running the piece of code given in the ReadMe.md but am running into an error which is related to data folder. I wasn't able to download data.zip as it takes me to a 404 page. Can you guide me?

Thanks.

Training End to End.

Hello sir,
I wished to train the whole model end to end from scratch for both NER and Entity Linking. For NER training I ran the code train_allennlp_st21pv.sh in the directory scripts by editing the required pertained model paths. But that is the BiLSTM with CRF for NER. May I know if there's a way to train the model End to End for both NER and Entity Linking in STY and CUI linking.

Negation detection

Thank you very much for making the code available. I followed the steps (except for using 2020AB instead of 2017AA) and after minimal change in code to avoid key lookup errors for missing CUIs) I successfully run the code.
I have done several tests with simple phrases including negated terms, but the pipeline does not skip them:
For example see below:

{'sentence': 'No voiding syndrome. No diarrhea or nausea. No abdominal pain. No other symptomatology of interest', 'tokens': ['No', 'voiding', 'syndrome.', 'No', 'diarrhea', 'or', 'nausea.', 'No', 'abdominal', 'pain.', 'No', 'other', 'symptomatology', 'of', 'interest'], 'spans': [{'start': 1, 'end': 3, 'text': 'voiding syndrome.', 'st': ('T038', 0.6669729688499156), 'cui': ('C0243095', 0.9999871)}, {'start': 3, 'end': 5, 'text': 'No diarrhea', 'st': ('T033', 0.99344856), 'cui': ('C0011991', 0.8528028654224417)}, {'start': 6, 'end': 7, 'text': 'nausea.', 'st': ('T033', 0.7728953), 'cui': ('C0243095', 0.99999225)}, {'start': 8, 'end': 10, 'text': 'abdominal pain.', 'st': ('T033', 0.92418957), 'cui': ('C0243095', 0.998923)}, {'start': 12, 'end': 13, 'text': 'symptomatology', 'st': ('T033', 0.9771756), 'cui': ('C1457887', 1.0)}]}

Will you include it in the future? Any hint?

Thanks again,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.