johngiorgi / seq2rel Goto Github PK

The corresponding code for our paper: A sequence-to-sequence approach for document-level relation extraction.

Home Page: https://share.streamlit.io/johngiorgi/seq2rel/main/demo.py

License: Apache License 2.0

Python 28.20% Jsonnet 15.83% Jupyter Notebook 55.97%

named-entity-recognition relation-extraction information-extraction seq2seq coreference-resolution entity-extraction seq2rel pytorch allen

seq2rel's Introduction

seq2rel: A sequence-to-sequence approach for document-level relation extraction

The corresponding code for our paper: A sequence-to-sequence approach for document-level relation extraction. Check out our demo here!

seq2rel: A sequence-to-sequence approach for document-level relation extraction

Notebooks

The easiest way to get started is to follow along with one of our notebooks:

Training your own model
Reproducing results

Or to open the demo:

Note: Unfortunately, the demo is liable to crash as the free resources provided by Streamlit are insufficient to run the model. To run the demo locally, please follow the instructions below.

Installation

This repository requires Python 3.8 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. If you need pointers on setting up a virtual environment, please see the AllenNLP install instructions.

Installing the library and dependencies

If you do not plan on modifying the source code, install from git using pip

pip install git+https://github.com/JohnGiorgi/seq2rel.git

Otherwise, clone the repository and install from source using Poetry:

# Install poetry for your system: https://python-poetry.org/docs/#installation
# E.g. for Linux, macOS, Windows (WSL)
curl -sSL https://install.python-poetry.org | python3 -

# Clone and move into the repo
git clone https://github.com/JohnGiorgi/seq2rel
cd seq2rel

# Install the package with poetry
poetry install

Usage

Preparing a dataset

Datasets are tab-separated files where each example is contained on its own line. The first column contains the text, and the second column contains the relations. Relations themselves must be serialized to strings.

Take the following example, which expresses a gene-disease association ("@GDA@") between ESR1 ("@GENE@") and schizophrenia ("@DISEASE@")

Variants in the estrogen receptor alpha (ESR1) gene and its mRNA contribute to risk for schizophrenia. estrogen receptor alpha ; ESR1 @GENE@ schizophrenia @DISEASE@ @GDA@

For convenience, we provide a second package, seq2rel-ds, which makes it easy to generate data in this format for various popular corpora. See our paper for more details on serializing relations.

Training

To train the model, use the allennlp train command with one of our configs (or write your own!)

For example, to train a model on the BioCreative V CDR task corpus, first, preprocess this data with seq2rel-ds

seq2rel-ds cdr main "path/to/preprocessed/cdr"

Then, call allennlp train with the CDR config we have provided

train_data_path="path/to/preprocessed/cdr/train.tsv" \
valid_data_path="path/to/preprocessed/cdr/valid.tsv" \
dataset_size=500 \
allennlp train "training_config/cdr.jsonnet" \
    --serialization-dir "output" \
    --include-package "seq2rel"

The best model checkpoint (measured by micro-F1 score on the validation set), vocabulary, configuration, and log files will be saved to --serialization-dir. This can be changed to any directory you like. Please see the training notebook for more details.

Inference

To use the model to extract relations, import Seq2Rel and pass it some text

from seq2rel import Seq2Rel
from seq2rel.common import util

# Pretrained models are stored on GitHub and will be downloaded and cached automatically.
# See: https://github.com/JohnGiorgi/seq2rel/releases/tag/pretrained-models.
pretrained_model = "gda"

# Models are loaded via a simple interface
seq2rel = Seq2Rel(pretrained_model)

# Flexible inputs. You can provide...
# - a string
# - a list of strings
# - a text file (local path or URL)
input_text = "Variations in the monoamine oxidase B (MAOB) gene are associated with Parkinson's disease (PD)."

# Pass any of these to the model to generate the raw output
output = seq2rel(input_text)
output == ["monoamine oxidase b ; maob @GENE@ parkinson's disease ; pd @DISEASE@ @GDA@"]

# To get a more structured (and useful!) output, use the `extract_relations` function
extract_relations = util.extract_relations(output)
extract_relations == [
  {
    "GDA": [
      ((("monoamine oxidase b", "maob"), "GENE"),
      (("parkinson's disease", "pd"), "DISEASE"))
    ]
  }
]

See the list of available PRETRAINED_MODELS in seq2rel/seq2rel.py

python -c "from seq2rel import PRETRAINED_MODELS ; print(list(PRETRAINED_MODELS.keys()))"

Running the demo locally

To run the demo locally, you will need to additionally install streamlit and pyvis (see here), then

streamlit run demo.py

Reproducing results

To reproduce the main results of the paper, use the allennlp evaluate command with one of our pretrained models

For example, to reproduce our results on the BioCreative V CDR task corpus, first, preprocess this data with seq2rel-ds

seq2rel-ds cdr main "path/to/preprocessed/cdr"

Then, call allennlp evaluate with the pretrained CDR model

allennlp evaluate "https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr.tar.gz" \
    "path/to/preprocessed/cdr/test.tsv" \
    --output-file "output/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "output/test_predictions.jsonl" \
    --include-package "seq2rel"

The results and predictions will be saved to --output-file and --predictions-output-file. Please see the reproducing-results notebook for more details.

Citing

If you use seq2rel in your work, please consider citing our paper:

@inproceedings{giorgi-etal-2022-sequence,
	title        = {A sequence-to-sequence approach for document-level relation extraction},
	author       = {Giorgi, John and Bader, Gary and Wang, Bo},
	year         = 2022,
	month        = may,
	booktitle    = {Proceedings of the 21st Workshop on Biomedical Language Processing},
	publisher    = {Association for Computational Linguistics},
	address      = {Dublin, Ireland},
	pages        = {10--25},
	doi          = {10.18653/v1/2022.bionlp-1.2},
	url          = {https://aclanthology.org/2022.bionlp-1.2}
}

seq2rel's People

Contributors

Stargazers

Watchers

Forkers

rashi1901 menyosoz lv184614886 reemalik94 yuhangjiang22 purrigin tonycsoka drmjuetz

seq2rel's Issues

Demo Link not working

Hey,

I am trying to open the Demo, but it's not working. Can you please help, need to go through it for my college project.

Thanks
Reewa

Re-train any pretrained models with AllenNLP 2.0

The current pre-trained models were trained with AllenNLP <2.0. They don't load properly with the new AllenNLP 2.0 codebase, so some tests are breaking. I am in the process of retraining them and will upload the new models when they are ready.

Does target string tokenization match predicted string?

It looks like there may be a mismatch in the target strings of our training dataset and the predicted strings produced by the model. For example,

from seq2rel import Seq2Rel

pretrained_model = "ade"
seq2rel = Seq2Rel(pretrained_model)
input_text = "Acute myocardial infarction due to coronary spasm associated with L-thyroxine therapy."

seq2rel(input_text)
>>> ['<ADE> l - thyroxine <DRUG> acute myocardial infarction <EFFECT> </ADE> <ADE> l - thyroxine <DRUG> coronary spasm <EFFECT> </ADE>']

But the target string in the dataset is represented like:

"<EFFECT> </ADE> <ADE> L-thyroxine <DRUG> coronary spasm <EFFECT> </ADE>"

I am almost sure this is solved by make_output_human_readable. But we should confirm, otherwise, the F1-score will be overly pessimistic.

Training the model

Hey,

I am trying to train model on my dataset, everything goes well, but at the end I am getting this error everytime. Not able to figure out what's wrong. Any help would be great.

python environment

Hello, can you export and share your python environment, I'm having trouble installing the package, I'm looking forward to your reply.

love

Add a unit test to check that we can overfit a single example

Ideally, the unit test for our model would check that the model can memorize a single training example. This is technically possible by providing the argument metric_to_check to ensure_model_can_train_save_and_load. However, this is reporting an error that the metric cannot be found. I am tabling this for now, but it should be revisited. Opening this issue so I don't forget.

A sequence-to-sequence approach for document-level relation extraction

Add integration tests for pretrained models

Add an integration test for each pretrained model, similar to what we currently have for ADE.

Train on BioRED

Try training on BioRED (https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/)

Training without entity type token

I am using Seq2rel on a dataset that only has a unique entity type. So, I'm thinking if I can remove the @entity_type@ token, to make the output to be like
entity1 ; entity2 ; entity3 @predicate@

The problem I've found is that after training, the output sometimes contains unknown token.

For example of what I've got,

amphotericin ; flucytosine ; ketoconazole ; fluconazole @ @ unknown @ @ @ @ unknown @ @ @ @ unknown @ @ @ @ unknown @ @

Could you please let me know if there's a way of solving this?

Loading the weights

How can I continue training by loading weights from the checkpoints I saved earlier?

Why are the decodings lowercased?

For some reason, the copied tokens in the decoder's output are all lower-cased, even if they appeared as uppercase in the source sequence. I have a feeling this is because of the source_token_ids. See here.

I don't even really know if this is a problem. Lowercasing everything might even simplify things. I wanted to document this behaviour somewhere though.

Add option to ignore order of entities in a relation

Currently, FBetaMeasureSeq2Rel considers an extracted relation correct only if the order of entities matches the gold standard. However, in some cases, the relation may not have an inherent order, and so we don't want to consider this when producing a score.

I propose we add some option to FBetaMeasureSeq2Rel, such as ordered_ents, which defaults to True, but can be set to False.

To use or not to use the cross-attention output projection?

The MultiHeadAttention implementation in PyTorch, that we use in our cross-attention mechanism (MultiheadAttention) has a linear transformation that is applied to the concatenation of all the attention heads.

Because of how AllenNLP implements attention mechanisms, we were not using this projection matrix, and were directly taking a weighted sum of the attention weights and the encoders output.

There is nothing particularly wrong with this, but we should investigate whether this output projection improves performance. I have created a branch with the neccecary fix if we decide to merge: https://github.com/JohnGiorgi/seq2rel/tree/use-multihead-attn-proj

Save only best model

Automatically detect special tokens

seq2rel

Error loading weights

When I ran this code in the reproducing results colab notebook:

!allennlp evaluate "$pretrained_model_url" "$preprocessed_data_dir/test.tsv" \
    --output-file "$output_dir/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "$output_dir/test_predictions.jsonl" \
    --include-package "seq2rel"

I got this Error:

2024-04-12 10:27:06,782 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2024-04-12 10:27:08,988 - INFO - cached_path - cache of https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr_hints.tar.gz is up-to-date
2024-04-12 10:27:08,989 - INFO - allennlp.models.archival - loading archive file https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr_hints.tar.gz from cache at /root/.allennlp/cache/5d845bebc5887213bab7c90a311e51d6dff9a03fb60648a6498d58be8397166c.82548b1687f75978154d471c6ead95e2dd4d865a01baaba9fa7873d62232ffbe
2024-04-12 10:27:08,990 - INFO - allennlp.models.archival - extracting archive file /root/.allennlp/cache/5d845bebc5887213bab7c90a311e51d6dff9a03fb60648a6498d58be8397166c.82548b1687f75978154d471c6ead95e2dd4d865a01baaba9fa7873d62232ffbe to temp dir /tmp/tmp0mo17roo
2024-04-12 10:27:15,245 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmp0mo17roo
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/transformers/configuration_utils.py", line 616, in _get_config_dict
    resolved_config_file = cached_path(
  File "/usr/local/lib/python3.8/site-packages/transformers/utils/hub.py", line 284, in cached_path
    output_path = get_from_cache(
  File "/usr/local/lib/python3.8/site-packages/transformers/utils/hub.py", line 508, in get_from_cache
    raise OSError(
OSError: Distant resource does not have an ETag, we won't be able to reliably ensure reproducibility.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.8/site-packages/allennlp/__main__.py", line 39, in run
    main(prog="allennlp")
  File "/usr/local/lib/python3.8/site-packages/allennlp/commands/__init__.py", line 120, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/allennlp/commands/evaluate.py", line 135, in evaluate_from_args
    return evaluate_from_archive(
  File "/usr/local/lib/python3.8/site-packages/allennlp/commands/evaluate.py", line 242, in evaluate_from_archive
    archive = load_archive(
  File "/usr/local/lib/python3.8/site-packages/allennlp/models/archival.py", line 232, in load_archive
    dataset_reader, validation_dataset_reader = _load_dataset_readers(
  File "/usr/local/lib/python3.8/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
    dataset_reader = DatasetReader.from_params(
  File "/usr/local/lib/python3.8/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/usr/local/lib/python3.8/site-packages/allennlp/common/from_params.py", line 636, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/usr/local/lib/python3.8/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/usr/local/lib/python3.8/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/usr/local/lib/python3.8/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
    result = annotation.from_params(params=popped_params, **subextras)
  File "/usr/local/lib/python3.8/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/usr/local/lib/python3.8/site-packages/allennlp/common/from_params.py", line 638, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.8/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
    self.tokenizer = cached_transformers.get_tokenizer(
  File "/usr/local/lib/python3.8/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 547, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 725, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/transformers/configuration_utils.py", line 561, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/transformers/configuration_utils.py", line 656, in _get_config_dict
    raise EnvironmentError(
OSError: Can't load config for 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext' is the correct path to a directory containing a config.json file

After upgrading to the latest version of transformers(4.39.3), this error is solved, but I got a new error:

2024-04-12 10:30:32,153 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2024-04-12 10:30:34,344 - INFO - cached_path - cache of https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr_hints.tar.gz is up-to-date
2024-04-12 10:30:34,345 - INFO - allennlp.models.archival - loading archive file https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr_hints.tar.gz from cache at /root/.allennlp/cache/5d845bebc5887213bab7c90a311e51d6dff9a03fb60648a6498d58be8397166c.82548b1687f75978154d471c6ead95e2dd4d865a01baaba9fa7873d62232ffbe
2024-04-12 10:30:34,345 - INFO - allennlp.models.archival - extracting archive file /root/.allennlp/cache/5d845bebc5887213bab7c90a311e51d6dff9a03fb60648a6498d58be8397166c.82548b1687f75978154d471c6ead95e2dd4d865a01baaba9fa7873d62232ffbe to temp dir /tmp/tmpq6ecqnkx
2024-04-12 10:30:41,043 - INFO - allennlp.data.vocabulary - Loading token dictionary from /tmp/tmpq6ecqnkx/vocabulary.
2024-04-12 10:30:43,684 - INFO - allennlp.modules.token_embedders.embedding - Loading a model trained before embedding extension was implemented; pass an explicit vocab namespace if you want to extend the vocabulary.
2024-04-12 10:30:44,093 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmpq6ecqnkx
Traceback (most recent call last):
  File "/usr/local/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.8/site-packages/allennlp/__main__.py", line 39, in run
    main(prog="allennlp")
  File "/usr/local/lib/python3.8/site-packages/allennlp/commands/__init__.py", line 120, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/allennlp/commands/evaluate.py", line 135, in evaluate_from_args
    return evaluate_from_archive(
  File "/usr/local/lib/python3.8/site-packages/allennlp/commands/evaluate.py", line 242, in evaluate_from_archive
    archive = load_archive(
  File "/usr/local/lib/python3.8/site-packages/allennlp/models/archival.py", line 235, in load_archive
    model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
  File "/usr/local/lib/python3.8/site-packages/allennlp/models/archival.py", line 279, in _load_model
    return Model.load(
  File "/usr/local/lib/python3.8/site-packages/allennlp/models/model.py", line 438, in load
    return model_class._load(config, serialization_dir, weights_file, cuda_device)
  File "/usr/local/lib/python3.8/site-packages/allennlp/models/model.py", line 380, in _load
    raise RuntimeError(
RuntimeError: Error loading state dict for CopyNetSeq2Rel
	Missing keys: []
	Unexpected keys: ['_source_embedder.token_embedder_tokens.transformer_model.embeddings.position_ids']

Could you please help fix this problem? Thanks for your great work!

Using Seq2rel function with cuda

Here's how I pass the device argument for function Seq2rel

from seq2rel import Seq2Rel
from seq2rel.common import util
model = 'model.tar.gz'
kwargs = {'cuda_device': 1}
seq2rel = Seq2Rel(model, **kwargs)

and got this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-15-d951d9bfd1c5>](https://localhost:8080/#) in <cell line: 2>()
      1 kwargs = {'cuda_device': 1}
----> 2 seq2rel = Seq2Rel(model, **kwargs)

12 frames
[/content/drive/MyDrive/seq2rel/seq2rel/seq2rel.py](https://localhost:8080/#) in __init__(self, pretrained_model_name_or_path, **kwargs)
     86         if "overrides" in kwargs:
     87             overrides.update(kwargs.pop("overrides"))
---> 88         archive = load_archive(pretrained_model_name_or_path, overrides=overrides, **kwargs)
     89         self._predictor = Predictor.from_archive(archive, predictor_name="seq2seq")
     90 

[/usr/local/lib/python3.9/dist-packages/allennlp/models/archival.py](https://localhost:8080/#) in load_archive(archive_file, cuda_device, overrides, weights_file)
    233             config.duplicate(), serialization_dir
    234         )
--> 235         model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
    236 
    237         # Load meta.

[/usr/local/lib/python3.9/dist-packages/allennlp/models/archival.py](https://localhost:8080/#) in _load_model(config, weights_path, serialization_dir, cuda_device)
    277 
    278 def _load_model(config, weights_path, serialization_dir, cuda_device):
--> 279     return Model.load(
    280         config,
    281         weights_file=weights_path,

[/usr/local/lib/python3.9/dist-packages/allennlp/models/model.py](https://localhost:8080/#) in load(cls, config, serialization_dir, weights_file, cuda_device)
    436             # get_model_class method, that recurses whenever it finds a from_archive model type.
    437             model_class = Model
--> 438         return model_class._load(config, serialization_dir, weights_file, cuda_device)
    439 
    440     def extend_embedder_vocab(self, embedding_sources_mapping: Dict[str, str] = None) -> None:

[/usr/local/lib/python3.9/dist-packages/allennlp/models/model.py](https://localhost:8080/#) in _load(cls, config, serialization_dir, weights_file, cuda_device)
    341         # in sync with the weights
    342         if cuda_device >= 0:
--> 343             model.cuda(cuda_device)
    344         else:
    345             model.cpu()

[/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in cuda(self, device)
    686             Module: self
    687         """
--> 688         return self._apply(lambda t: t.cuda(device))
    689 
    690     def xpu(self: T, device: Optional[Union[int, device]] = None) -> T:

[/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _apply(self, fn)
    576     def _apply(self, fn):
    577         for module in self.children():
--> 578             module._apply(fn)
    579 
    580         def compute_should_use_set_data(tensor, tensor_applied):

[/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _apply(self, fn)
    576     def _apply(self, fn):
    577         for module in self.children():
--> 578             module._apply(fn)
    579 
    580         def compute_should_use_set_data(tensor, tensor_applied):

[/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _apply(self, fn)
    576     def _apply(self, fn):
    577         for module in self.children():
--> 578             module._apply(fn)
    579 
    580         def compute_should_use_set_data(tensor, tensor_applied):

[/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _apply(self, fn)
    576     def _apply(self, fn):
    577         for module in self.children():
--> 578             module._apply(fn)
    579 
    580         def compute_should_use_set_data(tensor, tensor_applied):

[/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _apply(self, fn)
    576     def _apply(self, fn):
    577         for module in self.children():
--> 578             module._apply(fn)
    579 
    580         def compute_should_use_set_data(tensor, tensor_applied):

[/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _apply(self, fn)
    599             # `with torch.no_grad():`
    600             with torch.no_grad():
--> 601                 param_applied = fn(param)
    602             should_use_set_data = compute_should_use_set_data(param, param_applied)
    603             if should_use_set_data:

[/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in <lambda>(t)
    686             Module: self
    687         """
--> 688         return self._apply(lambda t: t.cuda(device))
    689 
    690     def xpu(self: T, device: Optional[Union[int, device]] = None) -> T:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

What is the correct way of using Seq2rel with cuda? Thank you!

Alternative way for poetry

I have trouble installing poetry on our school's clusters, is there an alternative way to run the code? (e.g. directly run a .py file)

Use min when computing validation start