Code Monkey home page Code Monkey logo

seq2rel's Introduction

seq2rel: A sequence-to-sequence approach for document-level relation extraction

PWC PWC PWC PWC PWC


ci codecov Checked with mypy GitHub Open in Streamlit

The corresponding code for our paper: A sequence-to-sequence approach for document-level relation extraction. Checkout our demo here!

Table of contents

Notebooks

The easiest way to get started is to follow along with one of our notebooks:

  • Training your own model Open In Colab
  • Reproducing results Open In Colab

Installation

This repository requires Python 3.8 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. If you need pointers on setting up a virtual environment, please see the AllenNLP install instructions.

Installing the library and dependencies

If you do not plan on modifying the source code, install from git using pip

pip install git+https://github.com/JohnGiorgi/seq2rel.git

Otherwise, clone the repository and install from source using Poetry:

# Install poetry for your system: https://python-poetry.org/docs/#installation
# E.g. for Linux, macOS, Windows (WSL)
curl -sSL https://install.python-poetry.org | python3 -

# Clone and move into the repo
git clone https://github.com/JohnGiorgi/seq2rel
cd seq2rel

# Install the package with poetry
poetry install

Usage

Preparing a dataset

Datasets are tab-separated files, where each example is contained on its own line. The first column contains the text, and the second column contains the relations. Relations themselves must be serialized to strings.

Take the following example, which expresses a gene-disease association ("@GDA@") between ESR1 ("@GENE@") and schizophrenia ("@DISEASE@")

Variants in the estrogen receptor alpha (ESR1) gene and its mRNA contribute to risk for schizophrenia. estrogen receptor alpha ; ESR1 @GENE@ schizophrenia @DISEASE@ @GDA@

For convenience, we provide a second package, seq2rel-ds, which makes it easy to generate data in this format for various popular corpora. See our paper for more details on serializing relations.

Training

To train the model, use the allennlp train command with one of our configs (or write your own!)

For example, to train a model on the BioCreative V CDR task corpus, first, preprocess this data with seq2rel-ds

seq2rel-ds cdr main "path/to/preprocessed/cdr"

Then, call allennlp train with the CDR config we have provided

train_data_path="path/to/preprocessed/cdr/train.tsv" \
valid_data_path="path/to/preprocessed/cdr/valid.tsv" \
dataset_size=500 \
allennlp train "training_config/cdr.jsonnet" \
    --serialization-dir "output" \
    --include-package "seq2rel" 

The best model checkpoint (measured by micro-F1 score on the validation set), vocabulary, configuration, and log files will be saved to --serialization-dir. This can be changed to any directory you like. Please see the training notebook for more details.

Inference

To use the model to extract relations, import Seq2Rel and pass it some text

from seq2rel import Seq2Rel
from seq2rel.common import util

# Pretrained models are stored on GitHub and will be downloaded and cached automatically.
# See: https://github.com/JohnGiorgi/seq2rel/releases/tag/pretrained-models.
pretrained_model = "gda"

# Models are loaded via a simple interface
seq2rel = Seq2Rel(pretrained_model)

# Flexible inputs. You can provide...
# - a string
# - a list of strings
# - a text file (local path or URL)
input_text = "Variations in the monoamine oxidase B (MAOB) gene are associated with Parkinson's disease (PD)."

# Pass any of these to the model to generate the raw output
output = seq2rel(input_text)
output == ["monoamine oxidase b ; maob @GENE@ parkinson's disease ; pd @DISEASE@ @GDA@"]

# To get a more structured (and useful!) output, use the `extract_relations` function
extract_relations = util.extract_relations(output)
extract_relations == [
  {
    "GDA": [
      ((("monoamine oxidase b", "maob"), "GENE"),
      (("parkinson's disease", "pd"), "DISEASE"))
    ]
  }
]

See the list of available PRETRAINED_MODELS in seq2rel/seq2rel.py

python -c "from seq2rel import PRETRAINED_MODELS ; print(list(PRETRAINED_MODELS.keys()))"

Reproducing results

To reproduce the main results of the paper, use the allennlp evaluate command with one of our pretrained models

For example, to reproduce our results on the BioCreative V CDR task corpus, first, preprocess this data with seq2rel-ds

seq2rel-ds cdr main "path/to/preprocessed/cdr"

Then, call allennlp evaluate with the pretrained CDR model

allennlp evaluate "https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr.tar.gz" \
    "path/to/preprocessed/cdr/test.tsv" \
    --output-file "output/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "output/test_predictions.jsonl" \
    --include-package "seq2rel"

The results and predictions will be saved to --output-file and --predictions-output-file. Please see the reproducing-results notebook for more details.

Citing

If you use seq2rel in your work, please consider citing our paper:

@inproceedings{giorgi-etal-2022-sequence,
	title        = {A sequence-to-sequence approach for document-level relation extraction},
	author       = {Giorgi, John and Bader, Gary and Wang, Bo},
	year         = 2022,
	month        = may,
	booktitle    = {Proceedings of the 21st Workshop on Biomedical Language Processing},
	publisher    = {Association for Computational Linguistics},
	address      = {Dublin, Ireland},
	pages        = {10--25},
	doi          = {10.18653/v1/2022.bionlp-1.2},
	url          = {https://aclanthology.org/2022.bionlp-1.2}
}

seq2rel's People

Contributors

johngiorgi avatar dependabot[bot] avatar menyosoz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.