Code Monkey home page Code Monkey logo

biomedicus's People

Contributors

benknoll-umn avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

biomedicus's Issues

Make concepts DB easier to debug

Currently the process for searching the concept dictionary is very involved, requiring multiple steps to check if a CUI is in the databases. Provide utilities to easily perform this.

Concepts

  • Copy biomedicus concepts code over
  • Create concepts processor
  • Test normalization processor vs normalization lookup
  • Create concepts performance test

Script for running pipeline should have host parameter

Currently if BioMedICUS is deployed on another server, each individual address needs to be specified.

A global --pipeline-host parameter should be added which changes the default host from 127.0.0.1 but keeps the default ports for all components.

Performance Tests Statistics Collection

Create a global fixture for pytests that allows performance tests to report metrics.

Possibly attach performance metrics to processor documentation.

Should include documentation about the tests' gold standard corpus.

Test acronyms for precision

Currently only recall is tested / recorded for acronyms. Precision could be tested in the same way that it is for concepts.

Unable to build 3.0-beta.8 due to unavailable mtap Maven dependency

Describe the bug
Unable to build 3.0-beta.8 due to unavailable mtap Maven dependency

To Reproduce
python setup.py build

Expected behavior

Build to complete with out errors.

Terminal Output

python setup.py build
running build
running build_py
Starting a Gradle Daemon (subsequent builds will be faster)

FAILURE: Build failed with an exception.

* What went wrong:
Could not determine the dependencies of task ':shadowJar'.
> Could not resolve all dependencies for configuration ':runtimeClasspath'.
   > Could not find any version that matches edu.umn.nlpie:mtap:[0.8.0, ).
     Versions that do not match:
       - 0.7.0
       - 0.6.0
       - 0.5.0
       - 0.4.0
       - 0.3.0
       - + 1 more
     Searched in the following locations:
       - https://repo.maven.apache.org/maven2/edu/umn/nlpie/mtap/maven-metadata.xml
       - file:/home/dave/.m2/repository/edu/umn/nlpie/mtap/
     Required by:
         project :

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 10s
error: Java build failed.

Environment

  • OS: Ubuntu 18.04
  • Version e.g. 3.0-beta.8
  • Python Version 3.8
  • Java Version openjdk 11.0.9.1

Additional context
None

PyTorch-based sentences processor (and others?) are bottlenecking multiprocessing.

Above 8 threads the Python GIL on the sentences processor starts to be a bottleneck for pipeline throughput. There are currently three plans/options for resolving:

  • Use a (torch.)multithreading.Pool to create multiple processes inside the processor for parallelism.
  • Deploying multiple instances of the sentences processor and then use round-robin load balancing at the grpc channel.
  • Use a pytorch model server (https://github.com/pytorch/serve) to serve the sentences model and a single multithreaded processor that calls to the model server.

Model download script fails on missing VERSION.txt

The script for downloading BioMedICUS models fails when the data directory exists but does not contain a "VERSION.txt" file.

The script should be updated such that it tests for the existence of VERSION.txt in the same way it tests for the existence of the data directory.

Normalization?

Need to determine whether normalization should exist as a processor or if we should just do on-the-fly normalization lookup in other components that need it.

  • Normalization lookup code that can be used from anywhere
  • Normalization processor
  • Normalization service with custom endpoints?

NegEx

Add NegEx for Negation / Uncertainty detection. NegEx is probably the fastest option for getting negation / uncertainty into BioMedICUS immediately, and will be a useful baseline for our work on #38 .

Example of Calling Remote Hosted Pipeline

Could someone provide an example of what parameters/config to set when calling services hosted on another server (or docker)?

Say I want to run the basic biomedicus run and I have biomedicus deployed on another server.

Or maybe even how to run the python/examples/sentences.py with biomedicus deployed on another host.

In my experiments, I think it's working but breaks down when the server tries to deliver the response back.

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Broken pipe"
        debug_error_string = "{"created":"@1588255304.736909283","description":"Error received from peer ipv4:127.0.0.1:10100","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Broken pipe","grpc_status":14}"

Thanks!

PoS Tagging

  • Port pos tagging code from biomedicus
  • Newt processor for pos tagging
  • Performance test for pos tagging

Sentences Update

Sentences needs to be updated for the general public release.

  • Resolve any tensorflow warnings
  • Move input mapping and vocabulary lookup into TF code as much as possible
  • Retrain model using public/releasable word embeddings
  • Write blog post about sentences

Processor "biomedicus-selective-dependencies" failed

Describe the bug
While running a batch of files through the pipeline, the error occurred. This happens sporadically. The file on which the process fails may process normally next time the same batch of files is processed.
More information: using multiple threads seems to be the problem. I've been running batches of 100 files using a single thread biomedicus run --threads 1 /input /output with no issues. As soon as I set threads to anything but 1 the process will consistently fail.

To Reproduce
biomedicus run /input_folder /output_folder

Terminal Output


13:14:01.162 [main] INFO  edu.umn.nlpie.mtap.processing.DefaultProcessorService - Server for processor_id: biomedicus-section-headers started on port: 10109
Done starting all processors
Processor "biomedicus-selective-dependencies" failed while processing event with id: 46-10.txt
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/_runners.py", line 85, in call_process
    result = self.processor.process(event, p)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/base.py", line 296, in process
    return self.process_document(document, params)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biomedicus/dependencies/stanza_selective_parser.py", line 91, in process_document
    stanza_doc = self.nlp([sentence.text])
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/pipeline/core.py", line 166, in __call__
    doc = self.process(doc)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/pipeline/core.py", line 160, in process
    doc = self.processors[processor_name].process(doc)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/pipeline/lemma_processor.py", line 67, in process
    ps, es = self.trainer.predict(b, self.config['beam_size'])
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/models/lemma/trainer.py", line 91, in predict
    preds, edit_logits = self.model.predict(src, src_mask, pos=pos, beam_size=beam_size)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/models/common/seq2seq_model.py", line 214, in predict
    return self.predict_greedy(src, src_mask, pos=pos)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/models/common/seq2seq_model.py", line 179, in predict_greedy
    h_in, (hn, cn) = self.encode(enc_inputs, src_lens)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/models/common/seq2seq_model.py", line 119, in encode
    packed_h_in, (hn, cn) = self.encoder(packed_inputs, (self.h0, self.c0))
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 574, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 528, in check_forward_args
    self.check_hidden_size(hidden[0], expected_hidden_size,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 195, in check_hidden_size
    raise RuntimeError(msg.format(expected_hidden_size, tuple(hx.size())))
RuntimeError: Expected hidden[0] size (2, 5, 100), got (2, 7, 100)

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/_runners.py", line 85, in call_process
    result = self.processor.process(event, p)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/base.py", line 296, in process
    return self.process_document(document, params)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biomedicus/dependencies/stanza_selective_parser.py", line 91, in process_document
    stanza_doc = self.nlp([sentence.text])
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/pipeline/core.py", line 166, in __call__
    doc = self.process(doc)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/pipeline/core.py", line 160, in process
    doc = self.processors[processor_name].process(doc)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/pipeline/lemma_processor.py", line 67, in process
    ps, es = self.trainer.predict(b, self.config['beam_size'])
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/models/lemma/trainer.py", line 91, in predict
    preds, edit_logits = self.model.predict(src, src_mask, pos=pos, beam_size=beam_size)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/models/common/seq2seq_model.py", line 214, in predict
    return self.predict_greedy(src, src_mask, pos=pos)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/models/common/seq2seq_model.py", line 179, in predict_greedy
    h_in, (hn, cn) = self.encode(enc_inputs, src_lens)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/stanza/models/common/seq2seq_model.py", line 119, in encode
    packed_h_in, (hn, cn) = self.encoder(packed_inputs, (self.h0, self.c0))
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 574, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 528, in check_forward_args
    self.check_hidden_size(hidden[0], expected_hidden_size,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 195, in check_hidden_size
    raise RuntimeError(msg.format(expected_hidden_size, tuple(hx.size())))
RuntimeError: Expected hidden[0] size (2, 5, 100), got (2, 7, 100)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/service.py", line 192, in Process
    result, times, added_indices = self._runner.call_process(request.event_id, params)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/_runners.py", line 91, in call_process
    raise ProcessingError() from e
mtap.processing._runners.ProcessingError

Environment

  • OS: MacOS High Sierra
  • Version 10.13.6
  • Python Version 3.8
  • Java Version [e.g. OpenJDK 8.0]

Maximum Iterations reached while processing dependency graph.

Describe the bug
The batch process consistently fails while processing a specific file. I opened the file stanza_selective_parser.py and increased MAX_ITER = 50000 but that didn't help.

Terminal Output

Processor "biomedicus-selective-dependencies" failed while processing event with id: 9456997-182295206.txt
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/_runners.py", line 85, in call_process
    result = self.processor.process(event, p)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/base.py", line 296, in process
    return self.process_document(document, params)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biomedicus/dependencies/stanza_selective_parser.py", line 100, in process_document
    raise ValueError(
ValueError: Maximum Iterations reached while processing dependency graph.

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/_runners.py", line 85, in call_process
    result = self.processor.process(event, p)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/base.py", line 296, in process
    return self.process_document(document, params)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biomedicus/dependencies/stanza_selective_parser.py", line 100, in process_document
    raise ValueError(
ValueError: Maximum Iterations reached while processing dependency graph.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/service.py", line 192, in Process
    result, times, added_indices = self._runner.call_process(request.event_id, params)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mtap/processing/_runners.py", line 91, in call_process
    raise ProcessingError() from e
mtap.processing._runners.ProcessingError

Environment

  • OS: MacOS
  • Version 3.0b6
  • Python Version 3.8
  • Java Version OpenJDK 8.0

Additional context
Here's the offending file:
9456997-182295206.txt

Easy RTF conversion-only pipeline

A pipeline which only performs conversion of RTF documents to plain text that can be deployed and run from the "biomedicus" command.

Add processor-specific metadata to documents

Currently there is not much that ties the output artifacts to the source code version or data model versions in biomedicus, which makes debugging difficult. Should resolve this by adding metadata fields like "concepts_biomedicus_version" and "concepts_umls_version" to the document during processing.

Update UIMA RTF code to use NEWT

RTF uses consumers that write output to a UIMA CAS, this needs to be updated to use NEWT events.

  • More generalized sink for parsed rtf data.
  • Use an input stream instead of a reader.
  • Keyword actions do work themselves.
  • Hex should use rtf current ansi code page.
  • Properly handle encoding switches.
  • Properly handle /upr unicode pair destinations.
  • Unit tests for existing RTF code.
  • Create "convert to plaintext" sink.
  • Create NEWT sink.

Negation / Uncertainty

  • Literature Review
  • Find negation gold standard / test set
  • Create performance tests
  • Implement baseline (NegEx?)
  • Experiment with other approaches

Sentences: Use variable length sequences

Use variable length sequences so that during prediction entire sequence can be processed. This will require re-training.

Update tensorflow code to be compatible with 2.0.

Acronyms

  • Copy Acronym code from biomedicus
  • Create newt processor for acronyms
  • Create performance test for acronyms

Sentence Parsing Differences Between 3.0b7 and 3.0b8

Describe the bug
(See attachment for inputs/outputs)
Sentence parsing on simple document with headers seems to be slightly off between 3.0b7 and 3.0b8. The issues in 3.0b8 seems to be affecting the ability to identify headers/sections. The failure seems to happen at the early part of the document, but then has a lasting effect on section parsing.

To Reproduce
Run the full pipeline for the sample Source Doc in both 3.0b7 and 3.0b8, then compare results.

Expected behavior
Sentence and section parsing should be consistent between 3.0b7 and 3.0b8.

Terminal Output
None

Environment

  • OS: Ubuntu 18.04
  • Version 3.0b8
  • Python Version 3.69
  • Java Version OpenJDK 11.0.10+9

Additional context
The test scenario and results are in the attached sheet.
20210218-SentencesIssue.xlsx

Initial distribution release

  • include jar in python release
  • script for launching all processors
  • script for processing text files
  • script for checking for data files downloading if necessary

Latest vocab lacks Novel 2019 Coronavirus terms & concepts

Describe the bug
The most current data download: biomedicus-3.0b4-umls-license-required-data does not contain Novel 2019 Coronavirus / Covid-19 terms added to the UMLS Metathesaurus vocabs, such as Snomed and MSH, in early 2020. Without these updates the tool does not support analysis on Covid-19 related EHR histories.

To Reproduce
Extractions on documents containing references to Covid-19/ Wuhan Virus/ Novel 2019 Coronavirus, etc., and related terms are not detected by the dataset available.

Expected behavior
Dataset should be amended to include the newer terms summarized here: https://metamap.nlm.nih.gov/Covid19Terms.shtml.

Terminal Output
N/A
EnvironmentN/A
N/A
Additional context
Add any other context about the problem here.

Unable to download data [SSL: CERTIFICATE_VERIFY_FAILED]

I followed instructions to install biomedicus v3 on both my Mac and a clean Ubuntu VM. Both produced the same error below. I tried every suggestion I found online:

  • pip3 install certifi
  • running Install Certificates.command in Mac
  • running Update Shell Profile.command

Nothing seems to help.

`alex@labrat:~$ biomedicus deploy --download-data
No existing data folder.
Starting download: https://nlpie.umn.edu/downloads/open/biomedicus-3.0b6-standard-data.zip
Traceback (most recent call last):
File "/usr/lib/python3.8/urllib/request.py", line 1326, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/usr/lib/python3.8/http/client.py", line 1240, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1286, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1235, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1006, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 946, in send
self.connect()
File "/usr/lib/python3.8/http/client.py", line 1409, in connect
self.sock = self._context.wrap_socket(self.sock,
File "/usr/lib/python3.8/ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "/usr/lib/python3.8/ssl.py", line 1040, in _create
self.do_handshake()
File "/usr/lib/python3.8/ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/alex/.local/bin/biomedicus", line 8, in
sys.exit(main())
File "/home/alex/.local/lib/python3.8/site-packages/biomedicus/cli.py", line 77, in main
f(conf)
File "/home/alex/.local/lib/python3.8/site-packages/biomedicus/deployment/deploy_biomedicus.py", line 101, in deploy
check_data(conf.download_data)
File "/home/alex/.local/lib/python3.8/site-packages/biomedicus/deployment/deploy_biomedicus.py", line 71, in check_data
download_data_to(download_url, data)
File "/home/alex/.local/lib/python3.8/site-packages/biomedicus/deployment/deploy_biomedicus.py", line 86, in download_data_to
urllib.request.urlretrieve(download_url, filename=temporary_file.name,
File "/usr/lib/python3.8/urllib/request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 1369, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/usr/lib/python3.8/urllib/request.py", line 1329, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>
alex@labrat:~$
`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.