Code Monkey home page Code Monkey logo

medcat's Introduction

Archived

This project is archived and no longer maintained. CogStack-Nifi is the successor to this project and continues to be actively maintained.

Introduction

CogStack is a lightweight distributed, fault tolerant database processing architecture and ecosystem, intended to make NLP processing and preprocessing easier in resource constrained environments. It comprises of multiple components, where CogStack Pipeline, the one covered in this documentation, has been designed to provide a configurable data processing pipelines for working with EHR data. For the moment it mainly uses databases and files as the primary source of EHR data with the possibility of adding custom data connectors soon. It makes use of the Java Spring Batch framework in order to provide a fully configurable data processing pipeline with the goal of generating an annotated JSON files that can be readily indexed into ElasticSearch, stored as files or pushed back to a database.

Documentation

For the most up-to-date documentation about usage of CogStack, building, running with example deployments please refer to the official CogStack Confluence page.

Discussion

If you have any questions why not reach out to the community discourse forum.

Quick Start Guide

Introduction

This simple tutorial demonstrates how to get CogStack Pipeline running on a sample electronic health record (EHR) dataset stored initially in an external database. CogStack ecosystem has been designed with handling efficiently both structured and unstructured EHR data in mind. It shows its strength while working with the unstructured type of data, especially as some input data can be provided as documents in PDF or image formats. For the moment, however, we only show how to run CogStack on a set of structured and free-text EHRs that have been already digitalized. The part covering unstructured type of data in form of PDF documents, images and other clinical notes which needs to processed prior to analysis is covered in the official CogStack Confluence page.

This tutorial is divided into 3 parts:

  1. Getting CogStack (link),
  2. A brief description of how does CogStack pipeline work and its ecosystem (link),
  3. Running CogStack pipeline 'out-of-the-box' using the dataset already preloaded into a sample database (link).

To skip the brief description and to get hands on running CogStack pipeline please head directly to Running CogStack part.

The main directory with resources used in this tutorial is available in the CogStack bundle under examples/. This tutorial is based on the Example 2, however, there are more examples available to play with.

Getting CogStack

The most convenient way to get CogStack bundle is to download it directly from the official github repository either by cloning the source by using git:

git clone https://github.com/CogStack/CogStack-Pipeline.git

or by downloading the bundle from the repository's Releases page and decompressing it.

How does CogStack work

Data processing workflow

The data processing workflow of CogStack pipeline is based on Java Spring Batch framework. Not to dwell too much into technical details and just to give a general idea -- the data is being read from a predefined data source, later it follows a number of processing operations with the final result stored in a predefined data sink. CogStack pipeline implements variety of data processors, data readers and writers with scalability mechanisms that can be selected in CogStack job configuration. Although the data can be possibly read from different sources, the most frequently used data sink is ElasicSearch. For more details about the CogStack functionality, please refer to the CogStack Documentation.

cogstack

In this tutorial we only focus on a simple and very common use-case, where CogStack pipeline reads and process structured and free-text EHRs data from a single PostgreSQL database. The result is then stored in ElasticSearch where the data can be easily queried in Kibana dashboard. However, CogStack pipeline data processing engine also supports multiple data sources -- please see Example 3 which covers such case.

A sample CogStack ecosystem

CogStack ecosystem consists of multiple inter-connected microservices running together. For the ease of use and deployment we use Docker (more specifically, Docker Compose), and provide Compose files for configuring and running the microservices. The selection of running microservices depends mostly on the specification of EHR data source(s), data extraction and processing requirements.

In this tutorial the CogStack ecosystem is composed of the following microservices:

  • samples-db -- PostgreSQL database loaded with a sample dataset under db_samples name,
  • cogstack-pipeline -- CogStack data processing pipeline with worker(s),
  • cogstack-job-repo -- PostgreSQL database for storing information about CogStack jobs,
  • elasticsearch-1 -- ElasticSearch search engine (single node) for storing and querying the processed EHR data,
  • kibana -- Kibana data visualization tool for querying the data from ElasticSearch.

Since all the examples share the common configuration for the microservices used, the base Docker Compose file is provided in examples/docker-common/docker-compose.yml. The Docker Compose file with configuration of microservices being overriden for this example can be found in examples/example2/docker/docker-compose.override.yml. Both configuration files are automatically used by Docker Compose when deploying CogStack, as will be shown later.

Sample datasets

The sample dataset used in this tutorial consists of two types of EHR data:

  • Synthetic - structured, synthetic EHRs, generated using synthea application,
  • Medial reports - unstructured, medical health report documents obtained from MTsamples.

These datasets, although unrelated, are used together to compose a combined dataset.

Full description of these datasets can be found in the official CogStack Confluence page.

Running CogStack platform

Running CogStack pipeline for the first time

For the ease of use CogStack is being deployed and run using Docker. However, before starting the CogStack ecosystem for the first time, one needs to have the database dump files for sample data either by creating them locally or downloading from Amazon S3. To download the database dumps, just type in the main examples/ directory:

bash download_db_dumps.sh

Next, a setup scripts needs to be run locally to prepare the Docker images and configuration files for CogStack data processing pipeline. The script is available in examples/example2/ path and can be run as:

bash setup.sh

As a result, a temporary directory __deploy/ will be created containing all the necessary artifacts to deploy CogStack.

Docker-based deployment

Next, we can proceed to deploy CogStack ecosystem using Docker Compose. It will configure and start microservices based on the provided Compose files:

  • common base configuration, copied from examples/docker-common/docker-compose.yml ,
  • example-specific configuration copied from examples/example2/docker/docker-compose.override.yml. Moreover, the PostgreSQL database container comes with pre-initialized database dump ready to be loaded directly into.

In order to run CogStack, type in the examples/example2/__deploy/ directory:

docker-compose up

In the console there will be printed status logs of the currently running microservices. For the moment, however, they may be not very informative (sorry, we're working on that!).

Connecting to the microservices

CogStack ecosystem

The picture below sketches a general idea on how the microservices are running and communicating within a sample CogStack ecosystem used in this tutorial.

workflow

Assuming that everything is working fine, we should be able to connect to the running microservices. Selected running services (elasticsearch-1 and kibana) have their port connections forwarded to host localhost.

Kibana and ElasticSearch

Kibana dashboard used to query the EHRs can be accessed directly in browser via URL: http://localhost:5601/. The data can be queried using a number of ElasticSearch indices, e.g. sample_observations_view. Usually, each index will correspond to the database view in db_samples (samples-db PostgreSQL database) from which the data was ingested. However, when entering Kibana dashboard for the first time, an index pattern needs to be configured in the Kibana management panel -- for more information about its creation, please refer to the official Kibana documentation.

In addition, ElasticSearch REST end-point can be accessed via URL http://localhost:9200/. It can be used to perform manual queries or to be used by other external services -- for example, one can list the available indices:

curl 'http://localhost:9200/_cat/indices'

or query one of the available indices -- sample_observations_view:

curl 'http://localhost:9200/sample_observations_view'

For more information about possible documents querying or modification operations, please refer to the official ElasticSearch documentation.

As a side note, the name for ElasticSearch node in the Docker Compose has been set as elasticsearch-1. The -1 ending emphasizes that for larger-scale deployments, multiple ElasticSearch nodes can be used -- typically, a minimum of 3.

PostgreSQL sample database

Moreover, the access PostgreSQL database with the input sample data is exposed directly at localhost:5555. The database name is db_sample with user test and password test. To connect, one can run:

psql -U 'test' -W -d 'db_samples' -h localhost -p 5555

Publications

CogStack - Experiences Of Deploying Integrated Information Retrieval And Extraction Services In A Large National Health Service Foundation Trust Hospital, Richard Jackson, Asha Agrawal, Kenneth Lui, Amos Folarin, Honghan Wu, Tudor Groza, Angus Roberts, Genevieve Gorrell, Xingyi Song, Damian Lewsley, Doug Northwood, Clive Stringer, Robert Stewart, Richard Dobson. BMC medical informatics and decision making 18, no. 1 (2018): 47.

logos logos logos Cogstack Pipeline logos logos logos logos

medcat's People

Contributors

adam-sutton-1992 avatar adammorrissirrommada avatar alexhandy1 avatar antsh3k avatar baixiac avatar dependabot[bot] avatar gimoai avatar imipenem avatar jamesbrandreth avatar jenniferjiangkells avatar jkgenser avatar jthteo avatar lcreteig avatar lrog avatar mart-r avatar myrthemh avatar sandertan avatar shubham-s-agarwal avatar tomolopolis avatar w-is-h avatar willmaclean avatar zack-kimble avatar zethson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

medcat's Issues

cui2icd10 question + some setup feedback

Hi! Thanks for building this great tool, had some issues with the setup, but here's what I have in terms of a rqs.txt (incase this is helpful).

blis==0.7.4 catalogue==1.0.0 certifi==2021.5.30 chardet==4.0.0 click==7.1.2 cymem==2.0.5 datasets==1.6.0 dill==0.3.3 elasticsearch==7.10.0 filelock==3.0.12 Flask==1.1.0 fsspec==2021.6.0 gensim==3.8.0 huggingface-hub==0.0.10 idna==2.10 importlib-metadata==4.5.0 itsdangerous==2.0.1 Jinja2==3.0.1 joblib==1.0.1 MarkupSafe==2.0.1 multiprocess==0.70.11.1 murmurhash==1.0.5 numpy==1.20.0 packaging==20.9 pandas==1.2.4 pathy==0.5.2 plac==1.1.3 preshed==3.0.5 pyarrow==4.0.1 pydantic==1.7.4 pyparsing==2.4.7 python-dateutil==2.8.1 pytz==2021.1 regex==2021.4.4 requests==2.25.1 sacremoses==0.0.45 scikit-learn==0.24.0 scipy==1.6.3 six==1.16.0 smart-open==3.0.0 spacy==2.3.5 spacy-legacy==3.0.5 srsly==1.0.5 thinc==7.4.5 threadpoolctl==2.1.0 tokenizers==0.10.3 torch==1.8.1 tqdm==4.49.0 transformers==4.5.1 typer==0.3.2 typing-extensions==3.10.0.0 urllib3==1.26.5 wasabi==0.8.2 Werkzeug==2.0.1 xxhash==2.0.2 zipp==3.4.1

Basically I went through your setup.py manually, removing the the ~=, and making it ==. also for sklearn, I installed scikit-learn==0.24.0, not sure what sklearn~=0.0 does. Also the version of spacy is different, was seeing some errors in loading the spacy model, and followed this ticket to resolve them; I'm using spacy==2.3.5. More generally I'm using python3.7 as that's what your medium post was using and am on a mac, OS version 11.4 (Big Sur).

Also the config for your medmen trained CDB class uses the en_core_sci_lg model, not the en_core_sci_md model.

Anyways, the question I have is, in order for the CDB class to have the cui2icd10 key in addl_info filled out, do I need the UMLS license? Seems like that mapping is blank in the provided medmen trained model, and wanted to see if the model trained on NLM would have the icd codes filled out. I tried applying for the license, but am getting 500s on the sign up page right now, will check again later.

MedCAT model creator

Hi @w-is-h at our hospital we're using our own "MedCAT model creator", which is basically a pipeline containing the steps that MedCAT documented in Jupyter notebooks. Our code:

  • Loads input concepts from CSV and input documents from txt
  • Creates vocab, create CDB, do unsupervised training, optionally do supervised training
  • Write files to a configured location
  • Also contains an integration test with a sample of wikipedia data and UMLS concepts to verify some expected entities are found. That looks a bit like what you are doing in MedCAT/tests/medmentions/. We could adjust our test to use the MedMentions data already included in MedCAT.

It might be nice to put this functionality into MedCAT itself. Are you open for a PR for this? We can also discuss it in more detail in a call if you want.

Remove logging handlers from MedCAT

The function utils.loggers.add_handlers` is used in several modules, which results in independent logging handlers when MedCAT is imported as a library.

The creation of handlers in a library is considered an anti-pattern by the logging maintainers: https://docs.python.org/3/howto/logging-cookbook.html#patterns-to-avoid

It creates handlers that are difficult for the end application to suppress or modify.

If you're amenable to the change, I'll gladly submit a PR.

Dependencies for 3.0 with scispacy

The 3.0 version of the code base seems to not be compatible scispacy versions 0.4.0 (uses spacy 3.0) or 0.50 (uses spacy 3.2) since MedCAT depends on spacy 3.1. Should we use scispacy 0.5.0 and use spacy 3.2?

spacy add_pipe error on medcat-1.0.40

If I install MedCAT 1.0.40, I get the error below when calling the MedCAT Service. This error is fixed by installing medcat-1.0.39.

Error:

ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got functools.partial(<function tag_skip_and_punct at 0x7ff0b0e12cb0>, config=<medcat.config.Config object at 0x7ff16c125350>) (name: 'tag_skip_and_punct').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

Steps to reproduce:

  1. pip install -r medcat_service/requirements.txt

  2. pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_md-0.4.0.tar.gz

Which generates dependency error at bottom:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
medcat 1.0.40 requires spacy==2.3.4, but you have spacy 3.0.7 which is incompatible.
Successfully installed catalogue-2.0.6 en-core-sci-md-0.4.0 pathy-0.6.0 pydantic-1.8.2 spacy-3.0.7 spacy-legacy-3.0.8 srsly-2.4.1 thinc-8.0.8 typer-0.3.2
  1. . start-service-prod.sh
    curl -XPOST http://localhost:5000/api/process -H 'Content-Type: application/json' -d '{"content":{"text":"The patient was diagnosed with leukemia."}}’

  2. Receive the following error:

[2021-08-25 21:41:19,751] [ERROR] medcat_service.app.app: Exception on /api/process [POST]
Traceback (most recent call last):
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 804, in get
    return self._context[key]
KeyError: <class 'medcat_service.nlp_service.nlp_service.NlpService'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 804, in get
    return self._context[key]
KeyError: <class 'medcat_service.nlp_processor.medcat_processor.MedCatProcessor'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/flask_injector/__init__.py", line 95, in wrapper
    return injector.call_with_injection(callable=fun, args=args, kwargs=kwargs)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 1024, in call_with_injection
    owner_key=self_.__class__ if self_ is not None else callable.__module__,
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 111, in wrapper
    return function(*args, **kwargs)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 1069, in args_to_inject
    instance = self.get(interface)  # type: Any
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 963, in get
    result = scope_instance.get(interface, binding.provider).get(self)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 111, in wrapper
    return function(*args, **kwargs)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 806, in get
    provider = InstanceProvider(provider.get(self.injector))
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 291, in get
    return injector.create_object(self._cls)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 990, in create_object
    self.call_with_injection(cls.__init__, self_=instance, kwargs=additional_kwargs)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 1024, in call_with_injection
    owner_key=self_.__class__ if self_ is not None else callable.__module__,
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 111, in wrapper
    return function(*args, **kwargs)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 1069, in args_to_inject
    instance = self.get(interface)  # type: Any
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 963, in get
    result = scope_instance.get(interface, binding.provider).get(self)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 111, in wrapper
    return function(*args, **kwargs)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 806, in get
    provider = InstanceProvider(provider.get(self.injector))
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 291, in get
    return injector.create_object(self._cls)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 990, in create_object
    self.call_with_injection(cls.__init__, self_=instance, kwargs=additional_kwargs)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/injector/__init__.py", line 1030, in call_with_injection
    return callable(*full_args, **dependencies)
  File "/home/elisa/MedCATservice/medcat_service/nlp_processor/medcat_processor.py", line 63, in __init__
    self.cat = self._create_cat()
  File "/home/elisa/MedCATservice/medcat_service/nlp_processor/medcat_processor.py", line 234, in _create_cat
    return CAT(cdb=cdb, config=conf, vocab=vocab, meta_cats=meta_models)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/medcat/cat.py", line 75, in __init__
    additional_fields=['is_punct'])
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/medcat/pipe.py", line 38, in add_tagger
    self.nlp.add_pipe(tagger, name='tag_' + name, first=True)
  File "/opt/conda/envs/medcat/lib/python3.7/site-packages/spacy/language.py", line 755, in add_pipe
    raise ValueError(err)
ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got functools.partial(<function tag_skip_and_punct at 0x7ff0b0e12cb0>, config=<medcat.config.Config object at 0x7ff16c125350>) (name: 'tag_skip_and_punct').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.
[25/Aug/2021:21:41:19 +0000] [ACCESSS] 127.0.0.1 "POST /api/process HTTP/1.1" 500 "-" "curl/7.52.1"

Upgrade SpaCy dependencies

When trying to add MedCAT as a dependency to my project I run into:

Using version ^1.1.3 for medcat

Updating dependencies
Resolving dependencies... (29.8s)

  SolverProblemError

      Because no versions of spacy match >3.0.1,<3.0.2 || >3.0.2,<3.0.3 || >3.0.3,<3.0.4 || >3.0.4,<3.0.5 || >3.0.5,<3.0.6 || >3.0.6,<3.0.7 || >3.0.7,<3.1.0
   and spacy (3.0.1) depends on typer (>=0.3.0,<0.4.0), spacy (>=3.0.1,<3.0.2 || >3.0.2,<3.0.3 || >3.0.3,<3.0.4 || >3.0.4,<3.0.5 || >3.0.5,<3.0.6 || >3.0.6,<3.0.7 || >3.0.7,<3.1.0) requires typer (>=0.3.0,<0.4.0).
      And because spacy (3.0.2) depends on typer (>=0.3.0,<0.4.0), spacy (>=3.0.1,<3.0.3 || >3.0.3,<3.0.4 || >3.0.4,<3.0.5 || >3.0.5,<3.0.6 || >3.0.6,<3.0.7 || >3.0.7,<3.1.0) requires typer (>=0.3.0,<0.4.0).
      And because spacy (3.0.3) depends on typer (>=0.3.0,<0.4.0)
   and spacy (3.0.4) depends on typer (>=0.3.0,<0.4.0), spacy (>=3.0.1,<3.0.5 || >3.0.5,<3.0.6 || >3.0.6,<3.0.7 || >3.0.7,<3.1.0) requires typer (>=0.3.0,<0.4.0).
      And because spacy (3.0.5) depends on typer (>=0.3.0,<0.4.0)
   and spacy (3.0.6) depends on typer (>=0.3.0,<0.4.0), spacy (>=3.0.1,<3.0.7 || >3.0.7,<3.1.0) requires typer (>=0.3.0,<0.4.0).
      Because no versions of medcat match >1.1.3,<2.0.0
   and medcat (1.1.3) depends on spacy (>=3.0.1,<3.1.0), medcat (>=1.1.3,<2.0.0) requires spacy (>=3.0.1,<3.1.0).
      Thus, medcat (>=1.1.3,<2.0.0) requires typer (>=0.3.0,<0.4.0) or spacy (3.0.7).
  (1) So, because spacy (3.0.7) depends on typer (>=0.3.0,<0.4.0), medcat (>=1.1.3,<2.0.0) requires typer (>=0.3.0,<0.4.0).
  
      Because no versions of typer match >0.3.0,<0.3.1 || >0.3.1,<0.3.2 || >0.3.2,<0.4.0
   and typer (0.3.0) depends on click (>=7.1.1,<7.2.0), typer (>=0.3.0,<0.3.1 || >0.3.1,<0.3.2 || >0.3.2,<0.4.0) requires click (>=7.1.1,<7.2.0).
      And because typer (0.3.1) depends on click (>=7.1.1,<7.2.0)
   and typer (0.3.2) depends on click (>=7.1.1,<7.2.0), typer (>=0.3.0,<0.4.0) requires click (>=7.1.1,<7.2.0).
      And because medcat (>=1.1.3,<2.0.0) requires typer (>=0.3.0,<0.4.0) (1), medcat (>=1.1.3,<2.0.0) requires click (>=7.1.1,<7.2.0)
      So, because ehrapy depends on both click (^8.0.2) and medcat (^1.1.3), version solving failed.

From what I can see you are pinning a version of spaCy which requires typer <=0.4.0
The latest spaCy allows for typer dependencies up to 0.5.0. This version has added support for click 8.x : tiangolo/typer@b972981

Could you please upgrade the spaCy version, (test whether it works with Click 8.x) and release a new version?

This would be highly appreciated. Happy to provide more detailed if required. I urgently need this to work with my environment.

Thanks!

Guidance on extraction

Are there results we can leverage to decide how to use the provided 'medcat_acc'? As in, precision-recall performance at various thresholds. Thanks!

Python 3.9 Compatibility not available

Torch==1.4.0 is not available and triggers build errors with Python 3.9, meaning that this application also cannot be installed. Tested on Ubuntu 18. Changing torch version then leads to a failure in TorchVision dependency. Changing that to the latest triggers a tokenizer installation error (because 0.8.0 is not PEP compliant).

Minimum reproducible example:

$ python3.9 -m venv medcat_install
$ source medcat_install/bin/activate
(medcat_issue) $ pip install MedCat
Collecting MedCat
  Downloading medcat-0.4.0.6-py3-none-any.whl (70 kB)
     |████████████████████████████████| 70 kB 650 kB/s
Collecting tokenizers~=0.8.0
  Downloading tokenizers-0.8.1.tar.gz (97 kB)
     |████████████████████████████████| 97 kB 903 kB/s
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting numpy~=1.18
  Downloading numpy-1.20.1-cp39-cp39-manylinux2010_x86_64.whl (15.4 MB)
     |████████████████████████████████| 15.4 MB 16.3 MB/s
Collecting gensim~=3.7
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
     |████████████████████████████████| 23.4 MB 9.2 MB/s
ERROR: Could not find a version that satisfies the requirement torch~=1.4.0 (from MedCat) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 1.7.1, 1.8.0)
ERROR: No matching distribution found for torch~=1.4.0 (from MedCat)

Prebuild snomed model

Hello getting below error while using prebuid models. How to resolve this?

ConfigValidationError:

Config validation error

ner -> incorrect_spans_key extra fields not permitted

{'nlp': <spacy.lang.en.English object at 0x00000138780A7670>, 'name': 'ner', 'incorrect_spans_key': None, 'model': {'@architectures': 'spacy.TransitionBasedParser.v2', 'state_type': 'ner', 'extra_state_tokens': False, 'hidden_width': 64, 'maxout_pieces': 2, 'use_upper': True, 'nO': None, 'tok2vec': {'@architectures': 'spacy.Tok2Vec.v2', 'embed': {'@architectures': 'spacy.MultiHashEmbed.v2', 'width': 96, 'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE'], 'rows': [5000, 2500, 2500, 2500], 'include_static_vectors': True}, 'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2', 'width': 96, 'depth': 4, 'window_size': 1, 'maxout_pieces': 3}}}, 'moves': None, 'update_with_oracle_cut_size': 100, '@factories': 'n

Older version of scispacy

OSError: [E050] Can't find model 'en_core_sci_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Fix: Use the latest version of scispacy

Add support for more abbreviations

Hi,

the MIMIC-IV dataset (triage.csv) uses a couple of abbreviations which are vital for the extraction of diagnosis etc from freetext, but currently are not really well supported by MedCAT.

I was using your hosted instance (https://medcat.rosalind.kcl.ac.uk/) for these quick tests.

Examples:

  1. Hypertension, R Leg numbness, R Shoulder pain -> extracts 3 findings, but does not add the "R" as attributes. It is absolutely essential that these findings were only on the right part of the body.
  2. Dizziness, Malaise, n/v/d -> extracts 2 findings, but misses the n/v/d which is short for nausea/vomiting/diarrhea

Do you think that you (we) can improve on the support for such abbreviations in reasonable way?

Creation of CAT object fails with: Can't find model 'en_core_sci_lg'.

Hey,
I'm using medcat==1.2.8.

I'm trying to create a CAT object from my vocab and cdb like this:

# load vocab
vocab = Vocab.load("./vocab.dat")
# load cdb
cdb = CDB.load("./cdb-medmen-v1.dat")
# create model
cat = CAT(cdb=concept_db, vocab=vocabulary)

...

However this results in :

/home/myname/anaconda3/envs/poetry3_8/lib/python3.8/site-packages/medcat/cat.py:100  │
│ in __init__                                                                               │
│                                                                                           │
│     97 │   │   │   self.config = config                                                   │
│     98 │   │   │   self.cdb.config = config                                               │
│     99 │   │   self._meta_cats = meta_cats                                                │
│ ❱  100 │   │   self._create_pipeline(self.config)                                         │
│    101 │                                                                                  │
│    102 │   def _create_pipeline(self, config):                                            │
│    103 │   │   # Set log level                                                            │
│                                                                                           │
│ /home/myname/anaconda3/envs/poetry3_8/lib/python3.8/site-packages/medcat/cat.py:107  │
│ in _create_pipeline                                                                       │
│                                                                                           │
│    104 │   │   self.log.setLevel(config.general['log_level'])                             │
│    105 │   │                                                                              │
│    106 │   │   # Build the pipeline                                                       │
│ ❱  107 │   │   self.pipe = Pipe(tokenizer=spacy_split_all, config=config)                 │
│    108 │   │   self.pipe.add_tagger(tagger=tag_skip_and_punct,                            │
│    109 │   │   │   │   │   │   │    name='skip_and_punct',                                │
│    110 │   │   │   │   │   │   │    additional_fields=['is_punct'])                       │
│                                                                                           │
│ /home/myname/anaconda3/envs/poetry3_8/lib/python3.8/site-packages/medcat/pipe.py:40  │
│ in __init__                                                                               │
│                                                                                           │
│    37 │   log = add_handlers(log)                                                         │
│    38 │                                                                                   │
│    39 │   def __init__(self, tokenizer: Tokenizer, config: Config) -> None:               │
│ ❱  40 │   │   self._nlp = spacy.load(config.general['spacy_model'],                       │
│       disable=config.general['spacy_disabled_components'])                                │
│    41 │   │   if config.preprocessing['stopwords'] is not None:                           │
│    42 │   │   │   self._nlp.Defaults.stop_words = set(config.preprocessing['stopwords'])  │
│    43 │   │   self._nlp.tokenizer = tokenizer(self._nlp, config)                          │
│                                                                                           │
│ /home/myname/anaconda3/envs/poetry3_8/lib/python3.8/site-packages/spacy/__init__.py: │
│ 51 in load                                                                                │
│                                                                                           │
│   48 │   │   keyed by section values in dot notation.                                     │
│   49 │   RETURNS (Language): The loaded nlp object.                                       │
│   50 │   """                                                                              │
│ ❱ 51 │   return util.load_model(                                                          │
│   52 │   │   name, vocab=vocab, disable=disable, exclude=exclude, config=config           │
│   53 │   )                                                                                │
│   54                                                                                      │
│                                                                                           │
│ /home/myname/anaconda3/envs/poetry3_8/lib/python3.8/site-packages/spacy/util.py:354  │
│ in load_model                                                                             │
│                                                                                           │
│    351 │   │   return load_model_from_path(name, **kwargs)                                │
│    352 │   if name in OLD_MODEL_SHORTCUTS:                                                │
│    353 │   │   raise IOError(Errors.E941.format(name=name, full=OLD_MODEL_SHORTCUTS[name] │
│ ❱  354 │   raise IOError(Errors.E050.format(name=name))                                   │
│    355                                                                                    │
│    356                                                                                    │
│    357 def load_model_from_package(                                                       │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
OSError: [E050] Can't find model 'en_core_sci_lg'. It doesn't seem to be a Python package or 
a valid path to a data directory.

Any ideas on why this is failing? I tried to install the en_core_sci_lg model (just downloaded it) but it cannot be installed from spacy directly.

Any help would be appreciated!

Discerning between was found and was not found

Hi,

great work!
I was working through the full Medcat pipeline tutorial but am bugged about one thing: discerning between whether an entity has a positive connotation (= was indeed found) or a negative connotation (= was not actually found).
I thought that the status (which I cannot see here?) should tell us that?
The output does not allow me to classify this:

[('H',
  {'entities': {1: {'acc': 1.0,
     'context_similarity': 1.0,
     'cui': 'C0035078',
     'detected_name': 'kidney~failure',
     'end': 21,
     'id': 1,
     'meta_anns': {},
     'pretty_name': 'Kidney Failure',
     'source_value': 'kidney failure',
     'start': 7,
     'tuis': ['T047'],
     'types': ['Disease or Syndrome']}},
   'text': 'He has kidney failure',
   'tokens': []}),
 ('S',
  {'entities': {2: {'acc': 1.0,
     'context_similarity': 1.0,
     'cui': 'C0035078',
     'detected_name': 'kidney~failure',
     'end': 32,
     'id': 2,
     'meta_anns': {},
     'pretty_name': 'Kidney Failure',
     'source_value': 'kidney failure',
     'start': 18,
     'tuis': ['T047'],
     'types': ['Disease or Syndrome']}},
   'text': 'She does not have kidney failure',
   'tokens': []}),

Update unsupervised training section in tutorial to use cat.train()

The updated Google colab tutorial uses cat.train = True and cat(text, do_train=True) in a for loop for unsupervised training, but the MedCAT inline code documentation strongly suggests to use cat.train(). It might be good to update the tutorial to reflect the preferred method.

Check multi argument logs

A recent PR introduced a bug with logs, instead of

log.info('something %s and another %s', (1, 2))

we need to do:

log.info('something %s and another %s', 1, 2)

All logs with more than one argument have this problem, we need to find and fix them.

Getting issue while training the custom vocab and cdb

Facing the below issue while trying to train a custom vocab with word2vec vectors. cdb is a the set of core snomed database. While I am trying to train the clinical text of 5000 visits. I have noticed that it throws error for certain text while certain texts are fine

Using the below code.

cat.spacy_cat.PREFER_FREQUENT = True
cat.spacy_cat.PREFER_ICD10 = False
cat.spacy_cat.WEIGHTED_AVG = True
cat.spacy_cat.MIN_CONCEPT_LENGTH = 3 # Ignore concepts (diseases) <= 3 characters
cat.spacy_cat.MIN_ACC = 0.2 # Confidence cut-off, everything bellow will not be displayed

for i, text in enumerate(data['text'].values):
# This will now run the training in the background
try:
_ = cat(text)
# So we know how things are moving
#if i % 100 == 0:
# print("Finished {} - text blocks".format(i))
except KeyboardInterrupt:
print('Manually Exited')
break
except:
print(data[data['id'] == i]['name'] )
continue

Print statistics on the CDB after training

cat.cdb.print_stats()

Disable the training mode

cat.train = False

File "", line 11, in
_ = cat(text)

File "C:\ProgramData\Anaconda3\lib\site-packages\medcat\cat.py", line 92, in call
return self.nlp(text)

File "C:\ProgramData\Anaconda3\lib\site-packages\medcat\utils\spacy_pipe.py", line 53, in call
return self.nlp(text)

File "C:\ProgramData\Anaconda3\lib\site-packages\spacy\language.py", line 439, in call
doc = proc(doc, **component_cfg.get(name, {}))

File "C:\ProgramData\Anaconda3\lib\site-packages\medcat\spacy_cat.py", line 470, in call
self.cat_ann.add_ann(raw_name, tkns, doc, self.to_disamb, doc_words)

File "C:\ProgramData\Anaconda3\lib\site-packages\medcat\basic_cat_ann.py", line 40, in add_ann
self._cat._add_ann(cui, doc, tkns, acc=1, name=name)

File "C:\ProgramData\Anaconda3\lib\site-packages\medcat\spacy_cat.py", line 373, in _add_ann
self._add_cntx_vec(cui, doc, tkns)

File "C:\ProgramData\Anaconda3\lib\site-packages\medcat\spacy_cat.py", line 291, in _add_cntx_vec
negs = self.vocab.get_negative_samples(n=self.CNTX_SPAN * 2)

File "C:\ProgramData\Anaconda3\lib\site-packages\medcat\utils\vocab.py", line 99, in get_negative_samples
inds = np.random.randint(0, len(self.unigram_table), n)

File "mtrand.pyx", line 745, in numpy.random.mtrand.RandomState.randint

File "_bounded_integers.pyx", line 1363, in numpy.random._bounded_integers._rand_int32

ValueError: low >= high

Use MetaCAT.save() with auto_save

Hi @w-is-h

Great work on the updated meta_cat, not sure what the functional changes were but the performance on our test sets significantly increased.

One thing I noticed was that when doing a run with auto_save_model set to True, to save the Config, I still needed to do a manual MetaCAT.save(). To prevent this, it might be nice to use this save function:

def save(self, save_dir_path):

instead of the auto_save_model's own save function:

torch.save(model.state_dict(), path)

How to get ICD10 codes if we are using medcat CDB ?

Hi @w-is-h ,
I need ICD10 code for the disease. As define in Unsupervised training and NER+L
tutorial 3.2 when we run the code

Let's test the multi processing function first

in_data = [(3, "cancer")]
results = cat.multi_processing(in_data, nproc=2)
print(results)

we got
[(3, {'entities': [{'cui': 'C0006826', 'tui': 'T191', 'type': 'Neoplastic Process', 'source_value': 'cancer', 'acc': '0.7050448862662964', 'start': 0, 'end': 6, 'id': '0', 'pretty_name': 'malignant tumours', 'icd10': '', 'umls': '', 'snomed': ''}], 'text': 'cancer'})]

In this output we are not getting 'icd10','umls',&'snomed' values. How we can access them ? Please suggest...
Thanks

Exact text value

From paper Kraljevic, Zeljko, et al. "MedCAT--medical concept annotation tool." arXiv preprint arXiv:1912.10166 (2019), "An annotation by MedCAT is considered correct only if the exact text value was found and the annotation was linked to the correct concept in the CDB."

My understand is that the exact text value would include the start and end index of prediction, but from the cat.py, it seems that the end index is not included, only the start and the cui? Thank you.

MedCat fails to correctly detect enumerations of (negative) diagnoses

Hey,

First of all thanks for the great package.

I'm using medcat 1.2.8 and I noticed the following issue:

Example:

text = "Patient suffers from diabetes. Denies hypertension, psychosis and glaucoma"

# let cat be the CAT object, that has been trained and initialized using the model pack/example data from the docs
annotated_text = cat.get_entities(text)

This results in:

{'entities': {2: {'pretty_name': 'Diabetes',
   'cui': 'C0011847',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'diabetes',
   'detected_name': 'diabetes',
   'acc': 0.6452550625169893,
   'context_similarity': 0.6452550625169893,
   'start': 20,
   'end': 28,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 2,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.999997079372406,
     'name': 'Status'}}},
  3: {'pretty_name': 'Hypertensive disease',
   'cui': 'C0020538',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'hypertension',
   'detected_name': 'hypertension',
   'acc': 0.6790682188733697,
   'context_similarity': 0.6790682188733697,
   'start': 37,
   'end': 49,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 3,
   'meta_anns': {'Status': {'value': 'Other',
     'confidence': 0.9918639063835144,
     'name': 'Status'}}},
  4: {'pretty_name': 'Psychotic Disorders',
   'cui': 'C0033975',
   'type_ids': ['T048'],
   'types': ['Mental or Behavioral Dysfunction'],
   'source_value': 'psychosis',
   'detected_name': 'psychosis',
   'acc': 0.3484492297815132,
   'context_similarity': 0.3484492297815132,
   'start': 51,
   'end': 60,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 4,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.8026704788208008,
     'name': 'Status'}}},
  5: {'pretty_name': 'Glaucoma',
   'cui': 'C0017601',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'glaucoma',
   'detected_name': 'glaucoma',
   'acc': 0.3833850208933218,
   'context_similarity': 0.3833850208933218,
   'start': 65,
   'end': 73,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 5,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.9999270439147949,
     'name': 'Status'}}}},
 'tokens': []}

As one can see, medcat correctly gets, that there is a diabetes but no hypertension diagnosis. But the "denies" context seems to get lost/ignored in the enumeration after hypertension so psychosis and glaucoma are labeled as "affirmed" although, they should also be "Other" (like negative).

Is this a known Issue? Are there any approaches to solve such issues?

Many thanks in advance ;)

Question about the MedCAT model

Thank you for sharing the files. If it is possible could you answer the following question?

  1. Are the context vectors that are used for disambiguation included in the CDB file? I am asking as I am not sure that when we will use the CDB and the VOCAB file we will be using the pre-trained model that was trained on the MIMIC-III?

  2. How many words are included in the Vocabulary of Umls?

  3. Is there only one MetaCAT Status pre-trained model (the one that you include in the README.md) or there is another (umls) model ? and does it only use the lstm.dat model?

medcat.utils.preprocess_snomed.Snomed - FileNotFoundError

Hi there,

Whenever I attempt to use the Snomed preprocess utility set, I have file not found errors:

from medcat.utils.preprocess_snomed import Snomed
snomed = Snomed("C:/path/to/dir/uk_sct2cl_32.7.0_20211124000001Z/")
cdf = snomed.to_concept_df()

Returns

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-23-5eb639e435ed> in <module>
----> 1 cdf = snomed.to_concept_df()

~\Projects\nlp\env\lib\site-packages\medcat\utils\preprocess_snomed.py in to_concept_df(self)
     50                     snomed_v = m.group(1)
     51 
---> 52             int_terms = parse_file(f'{contents_path}/sct2_Concept_Snapshot_{snomed_v}_{snomed_release}.txt')
     53             active_terms = int_terms[int_terms.active == '1']
     54             del int_terms

~\Projects\nlp\env\lib\site-packages\medcat\utils\preprocess_snomed.py in parse_file(filename, first_row_header, columns)
      7 
      8 def parse_file(filename, first_row_header=True, columns=None):
----> 9     with open(filename, encoding='utf-8') as f:
     10         entities = [[n.strip() for n in line.split('\t')] for line in f]
     11         return pd.DataFrame(entities[1:], columns=entities[0] if first_row_header else columns)

FileNotFoundError: [Errno 2] No such file or directory: 'C:/path/to/dir/uk_sct2cl_32.7.0_20211124000001Z/SnomedCT_UKClinicalRefsetsRF2_PRODUCTION_20211124T000001Z\\Snapshot\\Terminology/sct2_Concept_Snapshot_INT_20211124.txt'

Where the file is named sct2_Concept_UKCRSnapshot_GB1000000_20211124.txt

Best wishes,

Keiron

create_model_pack creates directories recursively rather than storing model pack in the path

Hey,

consider the following example:

# some setup code here, cat is the medcat object
cat.create_model_pack("~/MyDir/MyProjects/MyModelPack)

This code results in having a directory named ~ inside the current working directory, which itsself contains MyDir, which contains MyProjects which contains the actual model pack named MyModelPack.
I expected this code to save the actual model pack in the path instead of creating this path in the cwd. I guess that's due to the line os.makedirs(save_dir_path, exist_ok=True) in the create_model_pack function. Is this intended behaviour?

Best ;)

multiprocessing function returns an error

When I attempt to call the multiprocessing function, I get a pickling error. My code is as follows:

import pandas as pd
import numpy as np
from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.cat import CAT
from medcat.config import Config
from tqdm.notebook import tqdm
from medcat.meta_cat import MetaCAT

data = pd.read_csv('mri_reports.csv')
vocab = Vocab.load('C:/Users/Tom/Documents/Christie/MRI Medcat/Vocab and CBD/vocab.dat')

Config

config = Config()
config.general['spacy_model'] = 'en_core_sci_md'

tui_filter = ['T047'] # Detect only Disease and Mental Disorders
cui_filters = set()
for tui in tui_filter:
cui_filters.update(cdb.addl_info['type_id2cuis'][tui])
config.linking['filters']['cuis'] = cui_filters

Get the status model for meta_annotations

mc_status = MetaCAT(save_dir='C:/Users/Tom/Documents/Christie/MRI Medcat/Vocab and CBD/mc_status')
mc_status.load()
cdb = CDB.load("test_cdb.dat")
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)

batch_size = 10
batch = []
cnt = 0
for id, row in data.iterrows():
text = row['report_formatted']
# Skip text if under 10 characters, not really necessary as we have filtered before,
#but I like to be sure.
if len(text) > 10:
batch.append((id, text))

if len(batch) > batch_size or id == len(data) - 1:
    # Update the number of processors depending on your machine.
    results = cat.multiprocessing(batch, nproc=2)

When I get to calling the multiprocessing function, the code errors with the following:

PicklingError Traceback (most recent call last)
in
11 if len(batch) > batch_size or id == len(data) - 1:
12 # Update the number of processors depending on your machine.
---> 13 results = cat.multiprocessing(batch, nproc=2)
14
15 for pair in results:

~\AppData\Roaming\Python\Python38\site-packages\medcat\cat.py in multiprocessing(self, in_data, nproc, batch_size_chars, max_ram_percentage, only_cui, addl_info)
747 p = Process(target=self._mp_cons, kwargs={'in_q': in_q, 'out_dict': out_dict, 'pid': i, 'only_cui': only_cui,
748 'addl_info': addl_info, 'max_ram_percentage': max_ram_percentage})
--> 749 p.start()
750 procs.append(p)
751

~.conda\envs\medcat\lib\multiprocessing\process.py in start(self)
119 'daemonic processes are not allowed to have children'
120 _cleanup()
--> 121 self._popen = self._Popen(self)
122 self._sentinel = self._popen.sentinel
123 # Avoid a refcycle if the target function holds an indirect

~.conda\envs\medcat\lib\multiprocessing\context.py in _Popen(process_obj)
222 @staticmethod
223 def _Popen(process_obj):
--> 224 return _default_context.get_context().Process._Popen(process_obj)
225
226 class DefaultContext(BaseContext):

~.conda\envs\medcat\lib\multiprocessing\context.py in _Popen(process_obj)
325 def _Popen(process_obj):
326 from .popen_spawn_win32 import Popen
--> 327 return Popen(process_obj)
328
329 class SpawnContext(BaseContext):

~.conda\envs\medcat\lib\multiprocessing\popen_spawn_win32.py in init(self, process_obj)
91 try:
92 reduction.dump(prep_data, to_child)
---> 93 reduction.dump(process_obj, to_child)
94 finally:
95 set_spawning_popen(None)

~.conda\envs\medcat\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60 ForkingPickler(file, protocol).dump(obj)
61
62 #

PicklingError: Can't pickle <function at 0x000001A8DBFC8DC0>: attribute lookup on medcat.config failed

Seems to be an error with the python's native multiprocessing module not supporting lambda functions?

I'm running python 3.8.8.

Any help would be much appreciated.

Question - local creation UMLS concepts

Hi,
I am running some experiments with medcat. I have a UMLS license and was wondering whether there are instructions for running the build process anywhere? I've noticed the colab on custom vocabs and perhaps the process for UMLS is the same?

Thanks

MedCAT annotations- empty "pretty_name" field?

I am using the get_entities() method from the CAT class on some arbitrary text (i.e. cat.get_entities(text)) to extract annotations. Sometimes, the "pretty_name" field is empty even though the rest of the fields are populated. Example below. Has anyone else encountered this and know if it's something that would be happening on my end?

Entity 1
{   'acc': 0.2766362802720986,
    'context_similarity': 0.2766362802720986,
    'cui': 'C0542502',
    'detected_name': 'iodination',
    'end': 684,
    'icd10': [],
    'id': 59,
    'meta_anns': {},
    'ontologies': ['CSP', 'MTH', 'MSH', 'AOD'],
    'pretty_name': 'Iodination reaction',
    'snomed': [],
    'source_value': 'iodination',
    'start': 674,
    'tuis': ['T070'],
    'types': ['Natural Phenomenon or Process']}

Entity 2
{   'acc': 1.0,
    'context_similarity': 1.0,
    'cui': 'C0459768',
    'detected_name': 'mercurial',
    'end': 714,
    'icd10': [],
    'id': 60,
    'meta_anns': {},
    'ontologies': ['CHV', 'SNOMEDCT_US'],
    'pretty_name': '',
    'snomed': ['S-280907001'],
    'source_value': 'mercurial',
    'start': 705,
    'tuis': ['T121'],
    'types': ['Pharmacologic Substance']}

4.3 tutorial fails due to missing config.json file

Hi,

your 4.3 tutorial https://colab.research.google.com/drive/1apaFscR1a5shzuhg6nLM4lWxgvVbn8f1#scrollTo=TvvCIyv0afMZ fails due to:


FileNotFoundError                         Traceback (most recent call last)

<ipython-input-7-1892cf65204d> in <module>()
     18 
     19 # Get the status model for meta_annotations
---> 20 mc_status = MetaCAT.load("mc_status")
     21 
     22 # Create the full pipeline with models for meta-annotations

1 frames

/content/MedCAT/medcat/config.py in load(cls, save_path)
    105 
    106         # Read the jsonpickle string
--> 107         with open(save_path) as f:
    108             config_dict = jsonpickle.decode(f.read())
    109 

FileNotFoundError: [Errno 2] No such file or directory: 'mc_status/config.json'

at

# Get the status model for meta_annotations
mc_status = MetaCAT.load("mc_status")

Thanks!

about Meta Annotations

Thanks for addressing my previous question.

It seems that the mc_status file is not available.

Also the Meta Annotations with MedCAT tutorial seems not complete. I did not find the code for training a model for meta annotations.

In practice, would you suggest that I just use mc_status, if available later, for contextual detection (experiencer, negation, temporality, etc.)?

Best wishes,
A

Handling of words containing diacritics

Many functions in MedCAT seem to be tailored for English words, in which diacritics are quite rare. However, in Dutch, and I think in some other languages as well, they can be quite common. Because MedCAT often does not take into account words containing diacritics (for example in spell checker

letters = 'abcdefghijklmnopqrstuvwxyz'
and in tokenization
infix_re = re.compile(r'''[^A-Za-z0-9\@]''')
), this could lead to mistakes during processing.

Perhaps some of these issues can be resolved by replacing usage of A-Za-z with [A-Za-zÀ-ÖØ-öø-ÿ] (see https://stackoverflow.com/a/26900132/4141535). I checked some of these occurrences in the MedCAT codebase but it might be better to have expert @w-is-h look into this :)

For an English text and concept test example we can use Ménière's disease:

Ménière's disease (MD) is a disorder of the inner ear that is characterized by episodes of vertigo, tinnitus, hearing loss, and a fullness in the ear. (https://en.wikipedia.org/wiki/Ménière%27s_disease)

Problems installing on a Windows machine due to package dependencies

Hi,

Currently having an issue installing the medcat package due to the dependencies it's installing first.

Running the pip install medcat:

Collecting medcatNote: you may need to restart the kernel to use updated packages.
  Using cached medcat-0.4.0.2-py3-none-any.whl (70 kB)
Collecting sklearn~=0.0
  Using cached sklearn-0.0.tar.gz (1.1 kB)
Requirement already satisfied: scipy~=1.4 in c:\users\mcheng\anaconda3\lib\site-packages (from medcat) (1.5.0)
Collecting spacy==2.2.4
  Using cached spacy-2.2.4-cp38-cp38-win_amd64.whl (10.1 MB)

ERROR: Could not find a version that satisfies the requirement torch~=1.4.0 (from medcat) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch~=1.4.0 (from medcat)

Looking into this, there's only mac and linux versions of torch 1.4.0.

Not sure on how I can resolve this on my end. Would a switch from torch to pytorch work?

Running this as a user on a Windows Server 2012 (R2 Standard) with Intel Xeon CPU E5-2650 v4 @ 2.20Ghz.
(I'm currently trying to figure out docker to see if I can use iCAT instead - it's looking unlikely)

Complete UML model download

Hello,

How to download the complete UML model. I have UML license but unable to find it.

Additionally, Is there any way that we map the synonymous information to the correct pretty name?

E.g. In term "calcaneal fracture" is it identifies it as different entities and not as one might be because it doesn't recognize the work calcaneus to calcaneus. but it maps "Heal bone fracture" correctly to calcaneus. Will this synonym resolution is possible if we try to integrate the complete UML here.

FileNotFoundError: [Errno 2] No such file or directory: 'mc_status/config.json'

Hey everyone,

great work with MedCAT!

I do have one issue, I can't figure out. Could you help me out how to load the status model for meta_annotations?

Im getting the same error, both local and in the colab (https://colab.research.google.com/drive/1apaFscR1a5shzuhg6nLM4lWxgvVbn8f1#scrollTo=TvvCIyv0afMZ):

FileNotFoundError: [Errno 2] No such file or directory: 'mc_status/config.json'

I have downloaded all required files as per the tutorials/colabs. All the paths are existing paths. It seems there is not config.json file in the mc_status folder after unzipping it.

Any suggestions?

Incorrect schema in CSVs uesd in Google Collab

Hi,
While working through your TDS article the first Collab notebook under the Building custom models heading contains csv files that use the "name" header instead of the expected "str" header. This caused an error and required a manual edit and reupload. Thought you might want to know about that.

MedCAT annotations in displaCy

The previous version of MedCAT (<1.0) showed useful info (CUI, primary name of concept, TUI, name of TUI, context similarity) when rendering a MedCAT doc with displaCy. The latest version just shows "CONCEPT".

Any plans of returning this or similar functionality? It's great for generating screenshots for presentations :)

Additionally, it would be nice to configure what data is displayed, and add colors (e.g. per type)

Question about approximate string search

Quick question: are there any recommendations for how to incorporate approximate string search into the pipeline? I noticed MetCAT doesn't quite handle cases like couhg

Cdb.config() parameters definition ?

Is there any wiki/help guide/Readme on the cdb.config parameters (eg. cdb.config.ner , cdb.config.linking, etc.).
Just want to know what these parameters do, and how to use them

Vocab.dat and cdb-medmen-v1.dat unable to be read

Hi again,

This issue is a little more serious. The 2 .dat files you link in the readme are encoded as bytes and cannot be read in by the Vocab class. I could not decode them using the code below and a range of variants I found from Stack overflow. Code to decode: [line.decode() for line in io.open('vocab.dat', 'rb')]

I was not able to get passed this today so I don't know if the CDB file is any different. Am i doing something wrong? missing a step?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.