Code Monkey home page Code Monkey logo

medcat's Introduction

Medical oncept Annotation Tool

Build Status Documentation Status Latest release pypi Version

MedCAT can be used to extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS. Paper on arXiv.

Official Docs here

News

Demo

A demo application is available at MedCAT. This was trained on MIMIC-III and all of SNOMED-CT.

Tutorials

A guide on how to use MedCAT is available at MedCAT Tutorials. Read more about MedCAT on Towards Data Science.

Related Projects

  • MedCATtrainer - an interface for building, improving and customising a given Named Entity Recognition and Linking (NER+L) model (MedCAT) for biomedical domain text.
  • MedCATservice - implements the MedCAT NLP application as a service behind a REST API.
  • iCAT - A docker container for CogStack/MedCAT/HuggingFace development in isolated environments.

Install using PIP (Requires Python 3.7+)

  1. Upgrade pip pip install --upgrade pip
  2. Install MedCAT
  • For macOS/linux: pip install --upgrade medcat
  • For Windows (see PyTorch documentation): pip install --upgrade medcat -f https://download.pytorch.org/whl/torch_stable.html
  1. Quickstart (MedCAT v1.2+):
from medcat.cat import CAT

# Download the model_pack from the models section in the github repo.
cat = CAT.load_model_pack('<path to downloaded zip file>')

# Test it
text = "My simple document with kidney failure"
entities = cat.get_entities(text)
print(entities)

# To run unsupervised training over documents
data_iterator = <your iterator>
cat.train(data_iterator)
#Once done, save the whole model_pack 
cat.create_model_pack(<save path>)
  1. Quick start with separate models: New Models (MedCAT v1.2+) need the spacy en_core_web_md while older ones use the scispacy models, install the one you need or all if not sure. If using model packs you do not need to download these models:
python -m spacy download en_core_web_md
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_md-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz
from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.cat import CAT
from medcat.meta_cat import MetaCAT

# Load the vocab model you downloaded
vocab = Vocab.load(vocab_path)
# Load the cdb model you downloaded
cdb = CDB.load('<path to the cdb file>') 

# Download the mc_status model from the models section below and unzip it
mc_status = MetaCAT.load("<path to the unziped mc_status directory>")
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab, meta_cats=[mc_status])

# Test it
text = "My simple document with kidney failure"
entities = cat.get_entities(text)
print(entities)

# To run unsupervised training over documents
data_iterator = <your iterator>
cat.train(data_iterator)
#Once done you can make the current pipeline into a model_pack 
cat.create_model_pack(<save path>)
  1. Quick start with to create CDB and vocab models using local data and a config file:
# Run model creator with local config file
python medcat/utils/model_creator.py <path_to_model_creator_config_file>

# Run model creator with example file
python medcat/utils/model_creator.py tests/model_creator/config_example.yml
Model creator parameter Description
concept_csv_file Path to file containing UMLS concepts, including primary names, synonyms, types and source ontology. See examples and tests/model_creator/umls_sample.csv for format description and examples.
unsupervised_training_data_file Path to file containing text dataset used for spell checking and unsupervised training.
output_dir Path to output directory for writing the CDB and vocab models.
medcat_config_file Path to optional config file for adjusting MedCAT properties, see configs, medcat/config.py and tests/model_creator/medcat.txt
unigram_table_size Optional parameter for setting the initialization size of the unigram table in the vocab model. Default is 100000000, while for testing with a small unsupervised training data file a much smaller size could work.

Models

A basic trained model is made public. It contains ~ 35K concepts available in MedMentions.

ModelPacks

  • MedMentions with Status (Is Concept Affirmed or Negated/Hypothetical) Download

Separate models

  • Vocabulary Download - Built from MedMentions

  • CDB Download - Built from MedMentions

  • MetaCAT Status Download - Built from a sample from MIMIC-III, detects is an annotation Affirmed (Positve) or Other (Negated or Hypothetical)

(Note: This was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)

SNOMED-CT and UMLS

If you have access to UMLS or SNOMED-CT, you can download the pre-built CDB and Vocab for those databases by signing in and filling out the online form.

Acknowledgements

Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS

The vocabulary was compiled from Wiktionary In total ~ 800K unique words

Powered By

A big thank you goes to spaCy and Hugging Face - who made life a million times easier.

Citation

@ARTICLE{Kraljevic2021-ln,
  title="Multi-domain clinical natural language processing with {MedCAT}: The Medical Concept Annotation Toolkit",
  author="Kraljevic, Zeljko and Searle, Thomas and Shek, Anthony and Roguski, Lukasz and Noor, Kawsar and Bean, Daniel and Mascio, Aurelie and Zhu, Leilei and Folarin, Amos A and Roberts, Angus and Bendayan, Rebecca and Richardson, Mark P and Stewart, Robert and Shah, Anoop D and Wong, Wai Keong and Ibrahim, Zina and Teo, James T and Dobson, Richard J B",
  journal="Artif. Intell. Med.",
  volume=117,
  pages="102083",
  month=jul,
  year=2021,
  issn="0933-3657",
  doi="10.1016/j.artmed.2021.102083"
}

medcat's People

Contributors

adammorrissirrommada avatar alexhandy1 avatar antsh3k avatar baixiac avatar dependabot[bot] avatar gimoai avatar jthteo avatar lcreteig avatar lrog avatar myrthemh avatar sandertan avatar tomolopolis avatar w-is-h avatar willmaclean avatar zethson avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.