baderlab / saber Goto Github PK

Saber is a deep-learning based tool for information extraction in the biomedical domain. Pull requests are welcome! Note: this is a work in progress. Many things are broken, and the codebase is not stable.

Home Page: https://baderlab.github.io/saber/

License: MIT License

Python 92.25% Jupyter Notebook 7.61% Dockerfile 0.14%

information-extraction deep-learning biomedical-text-mining biomedical-named-entity-recognition spacy machine-learning bioinformatics

saber's Introduction

Saber

Saber (Sequence Annotator for Biomedical Entities and Relations) is a deep-learning based tool for information extraction in the biomedical domain.

Installation • Quickstart • Documentation

Installation

Note! This is a work in progress. Many things are broken, and the codebase is not stable.

To install Saber, you will need python3.6.

Latest PyPI stable release

(saber) $ pip install saber

The install from PyPI is currently broken, please install using the instructions below.

Latest development release on GitHub

Pull and install straight from GitHub

(saber) $ pip install git+https://github.com/BaderLab/saber.git

or install by cloning the repository

(saber) $ git clone https://github.com/BaderLab/saber.git
(saber) $ cd saber

and then using either pip

(saber) $ pip install -e .

or setuptools

(saber) $ python setup.py install

See the documentation for more detailed installation instructions.

Quickstart

If your goal is to use Saber to annotate biomedical text, then you can either use the web-service or a pre-trained model. If you simply want to check Saber out, without installing anything locally, try the Google Colaboratory notebook.

Google Colaboratory

The fastest way to check out Saber is by following along with the Google Colaboratory notebook (). In order to be able to run the cells, select "Open in Playground" or, alternatively, save a copy to your own Google Drive account (File > Save a copy in Drive).

Web-service

To use Saber as a local web-service, run

(saber) $ python -m saber.cli.app

or, if you prefer, you can pull & run the Saber image from Docker Hub

# Pull Saber image from Docker Hub
$ docker pull pathwaycommons/saber
# Run docker (use `-dt` instead of `-it` to run container in background)
$ docker run -it --rm -p 5000:5000 --name saber pathwaycommons/saber

There are currently two endpoints, /annotate/text and /annotate/pmid. Both expect a POST request with a JSON payload, e.g.,

{
  "text": "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53."
}

{
  "pmid": 11835401
}

For example, running the web-service locally and using cURL

$ curl -X POST 'http://localhost:5000/annotate/text' \
--data '{"text": "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53."}'

Documentation for the Saber web-service API can be found here.

Pre-trained models

First, import the Saber class. This is the interface to Saber

from saber.saber import Saber

then create a Saber object

saber = Saber()

and then load the model of our choice

saber.load('PRGE')

To annotate text with the model, just call the Saber.annotate() method

saber.annotate("The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.")

See the documentation for more details on using pre-trained models.

Documentation

Documentation for the Saber package can be found here. The web-service API has its own documentation here.

You can also call help() on any Saber method for more information

from saber import Saber

saber = Saber()

help(saber.annotate)

or pass the --help flag to any of the command-line interfaces

python -m src.cli.train --help

Feel free to open an issue or reach out to us on our slack channel () for more help.

saber's People

Contributors

Stargazers

Watchers

Forkers

amirunpri2018 cthoyt anarkia7115 carrielui theejung izzykayu johngiorgi bandontseng villiedie hitman56 sailfish009 kapetis tspannhw opscidia augustkrzhu jiapei-nexera

saber's Issues

Implement Named Entity Normalization (NEN)

This is likely to be extremely hard. One solution is to completely outsource this by calling some other program. Either way, should likely implement even a dumb solution soon.

Spacy tokenization is different than training data

The tokenization performed by Spacy is likely inadequate for biomedical text. It does not break up many tokens that, in the training data, are broken up. For example:

Spacy

"... mutations in the ataxia-telangiectasia ..." --> ["mutations", "in", "the", "ataxia-telangiectasia"]

Training data

"... mutations in the ataxia-telangiectasia ..." --> ["mutations", "in", "the", "ataxia", "-", "telangiectasia"]

This is a big problem, because the data the model is deployed on will look slightly different to the data it was trained on, and therefore performance will suffer.

There is a couple options to fix this:

write custom rules for the spacy tokenizer
find another off-the-shelf tokenizer for biomedical text, use it in combination with spacy (spacys Doc object allows you to feed it already tokenized text).

Find way to incorporate Igor's grounding function into the annotation process

There is now a function (utils/grounding_utils/ground.py) which grounds annotations made by Saber. Find way to incorporate this into the annotation pipeline.

Beef up unit tests

The current suite of unit tests leave a lot to be desired. Steps to improve this:

Set aside enough time to learn the in's and out's of pytest and make sure I am using it to its full advantage. Also a generic brush up on best practices for unit testing would be a good idea.
Work toward 100% coverage. At the very least, cross the 90% coverage mark.
See if I can re-write the slowest unit tests to speed things up.
Setup up TravisCI for testing on MacOS and Windows
Write unit tests for the web-app. See here.
Optional: move code coverage to codacy from coveralls, just to use one less tool and simplify things.

Loading pre-trained models packaged with Saber is confusing

If a user just wants to load the pre-trained models that come with saber (i.e., they DONT want to train and load their own models), it is not clear how they might do this.

Need to:

Provide some mechanism for loading pre-trained models provided with saber
Update the using_pretrained_models notebook with better instructions for how to do this

Dockerize (flask app)

Multi-GPU model doesn't work

When trying to train Saber across multiple GPUs, the following error is thrown:

Traceback (most recent call last):
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/scratch/johnmg/dev/saber/saber/train.py", line 44, in <module>
    main()
  File "/scratch/johnmg/dev/saber/saber/train.py", line 31, in main
    sp.create_model()
  File "/scratch/johnmg/dev/saber/saber/sequence_processor.py", line 287, in create_model
    model.compile_()
  File "/scratch/johnmg/dev/saber/saber/models/multi_task_lstm_crf.py", line 201, in compile_
    crf_loss_function = self.model[i].layers[-1].loss_function
AttributeError: 'Concatenate' object has no attribute 'loss_function'

Looks like this is to do with the way I try to grab the CRF loss function. This raises other questions though, like do I need to grab the loss-function on a per model basis? or simply grab it once?

Also, multi-gpu training doesn't work unless you use tensorflow-gpu==1.7.0. No idea why.

TODOS

Fix the bug. Get a model to train across multiple GPUs
Play around and see if we need a single loss function object or one per model
Explain multi-gpu training in the documentation

Add support for python 3.5

Add support for python>=3.5. This simply requires changing the python requirement in the setup.py file and adding a flag to the TravisCI config.

Transfer learning from corpus with more entity types

Currently, you cannot perform transfer learning from corpus A to corpus B if corpus A contains labels not contained in corpus B.

It would be nice if this was possible, the basic idea would be to not load the final CRF layer when performing a transfer.

Add pytest to test_requires

pytest dependency should be added to test_requires section of setup.py. Also, configure pytest such that

$ python setup.py test

Installs all test dependencies and runs the test suite.

Resources

this SO post

Create an examples folder, with useful scripts

~~An example folder containing useful scripts (not unlike https://github.com/keras-team/keras) would likely be helpful for anyone who wants to use the tool.~~

For now, Jupyter notebooks might actually be a better idea. Clean up the notebooks I have and move them into their own directory.

Setup documentation with MKDocs

Need some documentation other than the MKDocs

TODO

Setup a simple page with MKDocs
Choose a nice theme!
Find someone elses documentation to use as inspiration (maybe Keras?)
"Sync" information across the readme, documentation, and notebook.

Tensorboard support

There is currently little to no tensorboard support. It would be helpful if this was properly setup -- hardly a priority at this point and time though.

Implement proper logging

I need to implement proper logging (in place of the current scattered print statements).

Resources

Use the documentation and this resource to make sure I do this properly.

Tasks

Replace all print statements with info level logs where appropriate
Include logging statements in all handled error exceptions.

Create a landing page for the web-service.

There is currently no landing page for the web-service.

Create a simple landing page for the web-service which explains the API. A lot of inspiration can be drawn from here.

Simple web interface for annotating text

For now, this is low on the priority scale, but is something we will likely have to implement. Many text-mining tools include simple web interfaces for annotating text (ex, ex, ex).

We could do something similar, with Flask. Has the added bonus that it facilitates easy testing of Saber.

Add more performance metrics

It would be useful to add more performance metrics like AuROC, AuPR, TP, FP, FN. These should be printed by default in the table displayed after each epoch.

TODO

TP, FP, and FN are already computed. Just add them to the table (and probably the JSON output as well).

error with python enviornment setup process

Trying to setup python environment with anaconda and following the instructions outlined on the repository:

Error 1: installing saber with "pip install ." inside the saber folder

Error 2: installing en_coref_md, tried running twice but the same errors come up

Save config to pre-trained model folder

When saving a model, it would be useful if the configuration used to train that model was saved as an .ini file to the model folder.

TODO

Save a .ini containing the configuration used to train a model to the models saved folder.

Try just adding self.config.save(filepath) to SequenceProcessor.save().

The TravisCI builds fail with a cryptic error message

The TravisCI builds fail with a cryptic error message. A SO post seems to suggest this is an issue with installing Gensim on a google cloud instance and suggests a few workarounds. Try them out and get builds passing again.

Train models for each major entity class

Need to train models for each major entity class: PRGE, LIVB, DISO, CHED. The first three are fairly straight-forward. As for the last, there are multiple levels of granularity to the entity annotations, for now, might just cheat and collapse everything under the CHED tag.

For relations, we are at the mercy of what datasets are available. Right now, we could train a model for adverse drug events using the ADE corpus.

There should be a base and large version for each model. In the case of BERT, this would correspond to whether the BERT base or large model was used. Any model not implemented should raise a NotImplementedError (see #155).

Finally, the model names should follow a convention. Maybe [model-name]-[entity or relation]-[base or large], e.g. bert-for-ner-prge, bert-for-ner-prge-lg. See PyTorch Transformers or SpaCy for inspiration.

BERT

Entities

Relations

Train ADE

Config file in output should reflect actual arguments used

Currently, during training, saber creates an output folder and saves the config.ini file used to specify model/training details to this output folder. However, if saber was called in the command line and command line args were passed in this call, this is not captured in that saved config file.

The config.ini file saved to the output folder during training should reflect any arguments passed at the command line.

Cyclic learning rates

We should see if using the cyclic learning rate finder (paper: here) along with an adaptive learning rate optimizer (e.g., adam) improves on our current optimizer (nadam).

Todo

Use the Cyclic LR Keras Callback to determine an optimal learning rate.
Try this learning rate with a few different optimizers (starting with Adam). Does it beat our current optimization method?

Resources

Successive loading of Saber model

Currently, you can only load one Saber model under a Saber object at a time. It would be great if multiple calls to saber.load() loaded multiple models. Then, when annotate is called, it would combine the annotations for multiple models into a single annotation.

todo

change:

self.datasets = [Dataset()]
self.datasets[0].type_to_idx = model_attributes['type_to_idx']
self.datasets[0].idx_to_tag = model_attributes['idx_to_tag']

to:

self.datasets.append(Dataset())
self.datasets[-1].type_to_idx = model_attributes['type_to_idx']
self.datasets[-1].idx_to_tag = model_attributes['idx_to_tag']

so that loading models keeps appending new ones. Then update annotate accordingly.

--filepath commmand line argument has no effect

The --filepath argument, when supplied at the command line, has no effect. This is because the config file is loaded before any command line arguments are considered.

Fix this!

Allow saving / loading of state of optimizer

Currently, 'loading' a model loads its architecture and its weights, which is useful for transfer learning. However, a mechanism for loading the entire model, including the state of the optimizer, would allow for the resuming of training.

saber should come with one 'default' dataset

It would be useful if saber came with a default dataset (pointed to by the default config file). This would make it easier to follow along with the notebook (which currently serves as documentation) or the readme directly after installing saber.

No idea how to reference files internally in the config.ini file though. Asked a question on stackoverflow to get some help: https://stackoverflow.com/questions/52391209/specify-relative-filepath-in-ini-file-with-pythons-configparser

Saving only the best model weights does not work as expected

When saving only the best model weights (i.e. save_all_weights) is false, multiple sets of weights are still saved. Name the output files the same so they overwrite each other.

Should not have to provide the model name when loading a pre-trained model

Currently, when loading a pre-trained model, you still need to provide the 'name' of that model (i.e. its architecture). This doesn't make sense.

todo

Save models name along with other model attributes in the model_attributes dictionary which is pickled in Saber.load()
Load the model name attribute in Saber.load() and pass it to model_utils.load_pretrained_model().

Improvements to how pre-trained word embeddings are handled.

A couple improvement to how pre-trained embeddings are handled is required.

1. Out-source loading of vectors

Currently, the code to load pre-trained embeddings was written by me. That means its likely fragile and slow. See if I can load embeddings using Gensim which is likely to be faster and more reliable.

Use Gensim to load word embeddings

Note, this ~~might~~ actually solve the problem below.

~~2. Handle binary or plain text format~~

Currently, pre-trained embeddings in binary format (.bin) must be manually converted to a plaint text format (.txt) to be used with saber. This is an unnecessary additional step imposed on the user. Automatically detect if the embeddings are in binary or plain text format, and convert from binary to plain text automatically if necessary. To fix:

~~- [ ] Determine if embeddings are binary or plain text~~
~~- [ ] Use Gensim to convert from binary to plain text if necessary~~

Attending to character embeddings

Currently, we take the character embedding and the word embedding and simple concatenate them together. This paper uses a 2-layer neural network with an attention mechanism to combine the embeddings, allowing the model to assign importance to either the word or char embedding.

This might be a cool thing to explore along with the other attention mechanisms.

Some command line arguments are useless

Due to the current implementation of the Config class, any argument whos default value is True in the config file and which has the action=store_true property is useless from the command line.

If not provided at the command line, it will be True. If provided, it will be True. Therefore there needs to be a better way of harmonizing these arguments with boolean values.

Try weighting the importance of target classes

The annotated entities in a given corpus roughly follow a Zipfian distribution. This means that some entities are repeated many many times (e.g. Human, Mouse, p53, glucose), but most entities only appear a very small number of times.

Thus, during training the model is given many many examples of some entities and very few of others. It would therefore be useful to weight the cost of making a wrong prediction on these rare entities higher, in order for the model to "pay more attention" to them.

Keras provides a nice way to do this (see class_weight argument to the fit() function). The only challenge is coming up with the weighting scheme!

todo

Try some super simple strategy, like weighting words according to their inverse relative frequency.
Does that improves models recall (I think it will) and hurts models precision (I think it will)?.
Does the F1 get a boost overall (I think it will)?
If this looks promising, switch the inverse relative frequency with TF-IDF or SoCal.

Input formats

This is of low priority, but eventually we would need to support multiple types of input format (BioC, Standoff, etc).

What this really means, is that we need to convert from each of the formats to CoNLL (the most straightforward format to prepare for training). For now, try to find converters that we can link out too.

Try Elmo Embeddings

We should see if ELMo embeddings offer any appreciable improvement in performance.

Download the pre-trained embeddings here, see if we can load them into Saber.
Train our own set of embeddings on PubMed + PMC + Wiki corpus. Compare performance to the word embeddings we are currently using.

Do residual connection help?

Given that our network has two word-level LSTM layers, it is reasonable to expect that a residual connection between these layers might help. Try it!

Todo

Create a residual connection between the two word-level LSTM layers. Does it help?
What about between the character- and word-level LSTMS?

Resources

This blog post for how to create residual connections in keras.

Some methods should allow arguments

Some methods, such as SequenceProcessor.load_dataset() and SequenceProcessor.load_embeddings() should allow the user to specify filepaths. Otherwise, using saber as a python library is pretty useless, as you have to set these files in the config.ini file everytime.

Arguments passed to these methods should update the Config object

TODO

Add filepath argument to SequenceProcessor.load_dataset() and SequenceProcessor.load_embeddings() (really, any method that operates on a filepath)
These arguments should update the corresponding values in config instance.
Config that updating config attributes works as expected.

Keras displays the wrong epoch number

For some reason, Keras displays epoch 1/1 during training, for every epoch. Fix!

Publish this tool on PyPI

We need to properly package Saber, and publish it on PyPI so that it can be pip install saber. I don't know how to do this, so I need to do my homework first and then publish it. See here for some inspiration.

Resources

this blog

Update Readme

Cleanup the readme, add the new logo
Add a white version of the logo to the Slate docs
Add the logo to the documentation

Replace multi_task_lstm_crf load method with call to super

Replace

def load(self, weights_filepath, model_filepath):
        with open(model_filepath) as f:
            model = model_from_json(f.read(), custom_objects={'CRF': CRF})
            model.load_weights(weights_filepath)
            self.models.append(model)

with

def load(self, weights_filepath, model_filepath):
        super().load(weights_filepath, model_filepath, custom_objects={'CRF': CRF})

in multi_task_lstm_crf.py

debug argument should partially load a dataset

Current, the debug argument loads only 10K embeddings, which is helpful for debugging. It would be useful if it also only loaded a proportion of sentences as well!

Unit tests are breaking on Travis CI

The unit tests are breaking on Travis CI but all pass when run locally. Try to figure out what the issue is here.

Cross-validation breaks on the nth fold

The cross-validation training loop seems to break on the 2nd fold. Figure this out.

Need more robust saving/loading of models

Currently, the saving and loading of models is very fragile. Right now, saving a model requires pickling multiple python classes (Config and Dataset). This is an issue, because if these classes change between the saving and loading of models, an error is thrown.

Specifically, instead of pickling and un-pickling Dataset, I should simply pickle what I need (I think just Dataset.type_to_idx), and use this to update the attributes of a new Dataset class when we load. Similarly, for Config, I should only pickle the information I need.

Change SequenceProcessor.save() to only pickle Dataset.idx_to_tag instead of an entire Dataset object
Change SequenceProcessor.save() to only pickle Dataset.type_to_idx instead of an entire Dataset object
Change SequenceProcessor.load() to create a new Dataset instance, and update its Dataset.type_to_idx attribute with the pickled one created in SequenceProcessor.save().
Change SequenceProcessor.load() to create a new Dataset instance, and update its Dataset.idx_to_tag attribute with the pickled one created in SequenceProcessor.save().
Change SequenceProcessor.load() to update SequenceProcessor.config with word_embedding_dim and char_embedding_dim values from pre-trained model.

Implement Transfer Learning

I need to implement a mechanism for transfer learning. Preferably, this would be extremely straightforward, e.g.,

train a model,
save that model
load that model
begin training again

Preferably, there would be a way to just load a models weights (and not the state of the optimizer) and a way to load both. The former is useful for transfer learning, and the later would be useful for resuming training. Additionally, there should be some mechanism for freezing layers, in order for them not to be trained in the transfer. Pushing this all to its own issue.

Allow for the freezing of some layers during transfer (or in general really). Check this out, should be able to get a list of layer names from the user and freeze them.
Allow for the loading of only some layers when calling sp.load()

Resolve coreferences

Saber should be able to resolve coreferences, i.e. recognize that in

"Mary has a dog. She loves him"

She = Mary and him = a dog.

There is a library built on top of Spacy for just this purpose, and should be rather painless to add to saber. The tricky part is how to actually resolve the entities.

baderlab / saber Goto Github PK

saber's Introduction

Saber

Installation

Latest PyPI stable release

Latest development release on GitHub

Quickstart

Google Colaboratory

Web-service

Pre-trained models

Documentation

saber's People

Contributors

Stargazers

Watchers

Forkers

saber's Issues

BERT

Entities

Relations

Recommend Projects

Recommend Topics

Recommend Org