Code Monkey home page Code Monkey logo

dygiepp's Introduction

DyGIE++

Implements the model described in the paper Entity, Relation, and Event Extraction with Contextualized Span Representations.

This repository is under construction and we're in the process of adding support for more datasets.

Table of Contents

Dependencies

This code was developed using Python 3.7. To create a new Conda environment using Python 3.7, do conda create --name dygiepp python=3.7.

The necessary dependencies can be installed with pip install -r requirements.txt.

The only dependencies for the modeling code are AllenNLP 0.9.0 and PyTorch 1.2.0. It may run with newer versions, but this is not guarenteed. For PyTorch GPU support, follow the instructions on the PyTorch.

For data preprocessing a few additional data and string processing libraries are required including, Pandas and Beautiful Soup 4.

Finally, you'll need SciBERT for the scientific datasets. Run python scripts/pretrained/get_scibert.py to download and extract the SciBERT model to ./pretrained.

Training a model

SciERC

To train a model for named entity recognition, relation extraction, and coreference resolution on the SciERC dataset:

  • Download the data. From the top-level folder for this repo, enter bash ./scripts/data/get_scierc.sh. This will download the scierc dataset into a folder ./data/scierc
  • Train the model. Enter bash ./scripts/train/train_scierc.sh [gpu-id]. The gpu-id should be an integer like 1, or -1 to train on CPU. The program will train a model and save a model at ./models/scierc.

GENIA

The steps are similar to SciERC.

  • Download the data. From the top-level folder for this repo, enter bash ./scripts/data/get_genia.sh.
  • Train the model. Enter bash ./scripts/train/train_genia.sh [gpu-id]. The program will train a model and save a model at ./models/genia.

ACE05 (ACE for entities and relations)

Creating the dataset

For more information on ACE relation and event preprocessing, see DATA.md and this issue.

We use preprocessing code adapted from the DyGIE repo, which is in turn adapted from the LSTM-ER repo. The following software is required:

  • Java, to run CoreNLP.
  • Perl.
  • zsh. If this isn't available on your system, you can create a conda environment and install zsh.

First, we need to download Stanford CoreNLP:

bash scripts/data/ace05/get_corenlp.sh

Then, run the driver script to preprocess the data:

bash scripts/data/get_ace05.sh [path-to-ACE-data]

The results will go in ./data/ace05/processed-data. The intermediate files will go in ./data/ace05/raw-data.

Training a model

In progress.

ACE05 Event

Creating the dataset

The preprocessing code I wrote breaks with the newest version of Spacy. So unfortunately, we need to create a separate virtualenv that uses an old version of Spacy and use that for preprocessing.

conda deactivate
conda create --name ace-event-preprocess python=3.7
conda activate ace-event-preprocess
pip install -r scripts/data/ace-event/requirements.txt
python -m spacy download en

Then, collect the relevant files from the ACE data distribution with

bash ./scripts/data/ace-event/collect_ace_event.sh [path-to-ACE-data].

The results will go in ./data/ace-event/raw-data.

Now, run the script

python ./scripts/data/ace-event/parse_ace_event.py [output-name] [optional-flags]

You can see the available flags by calling parse_ace_event.py -h. For detailed descriptions, see DATA.md. The results will go in ./data/ace-event/processed-data/[output-name]. We require an output name because you may want to preprocess the ACE data multiple times using different flags. For default preprocessing settings, you could do:

python ./scripts/data/ace-event/parse_ace_event.py default-settings

When finished, you should conda deactivate the ace-event-preprocess environment and re-activate your modeling environment.

Training the model

In progress.

Evaluating a model

To check the performance of one of your models or a pretrained model, you can use the allennlp evaluate command.

Note that allennlp commands will only be able to discover the code in this package if:

  • You run the commands from the root folder of this project, dygiepp, or:
  • You add the code to your Python path by running conda develop . from the root folder of this project.

Otherwise, you will get an error ModuleNotFoundError: No module named 'dygie'.

In general, you can make evaluate a model like this:

allennlp evaluate \
  [model-file] \
  [data-path] \
  --cuda-device [cuda-device] \
  --include-package dygie \
  --output-file [output-file] # Optional; if not given, prints metrics to console.

For example, to evaluate the pretrained SciERC model, you could do

allennlp evaluate \
  pretrained/scierc.tar.gz \
  data/scierc/processed_data/json/test.json \
  --cuda-device 2 \
  --include-package dygie

To evaluate a model you trained on the SciERC data, you could do

allennlp evaluate \
  models/scierc/model.tar.gz \
  data/scierc/processed_data/json/test.json \
  --cuda-device 2  \
  --include-package dygie \
  --output-file models/scierc/metrics_test.json

Pretrained models

We have versions of DyGIE++ trained on SciERC and GENIA available. More coming soon.

Downloads

Run ./scripts/pretrained/get_dygiepp_pretrained.sh to download all the available pretrained models to the pretrained directory. If you only want one model, here are the download links:

Performance of downloaded models

The SciERC model gives slightly better test set performance than reported in the paper:

2019-11-20 16:03:12,692 - INFO - allennlp.commands.evaluate - Finished evaluating.
...
2019-11-20 16:03:12,693 - INFO - allennlp.commands.evaluate - _ner_f1: 0.6855290303565666
...
2019-11-20 16:03:12,693 - INFO - allennlp.commands.evaluate - rel_f1: 0.4867781975175391

Similarly for GENIA:

2019-11-21 14:45:44,505 - INFO - allennlp.commands.evaluate - ner_f1: 0.7818707451272466

Making predictions

To make a prediction, you can use allennlp predict. For example, to make a prediction with the pretrained scierc model, you can do:

allennlp predict pretrained/scierc.tar.gz \
    data/scierc/processed_data/json/test.json \
    --predictor dygie \
    --include-package dygie \
    --use-dataset-reader \
    --output-file predictions/scierc-test.jsonl \
    --cuda-device 0

Caveat: Models trained to predict coreference clusters will make predictions on a whole document at once. This can cause memory issues. If you just need entity and relation extraction, re-train a model with the coref loss weight set to 0. During evaluation, models with no coref objective will predict a sentence at a time, mitigating the memory issues.

See the docs for more prediction options.

Relation extraction evaluation metric

Following Li and Ji (2014), we consider a predicted relation to be correct if "its relation type is correct, and the head offsets of two entity mention arguments are both correct".

In particular, we do not require the types of the entity mention arguments to be correct, as is done in some work (e.g. Zhang et al. (2017)). We welcome a pull request that implements this alternative evaluation metric. Please open an issue if you're interested in this.

dygiepp's People

Contributors

ulmewennberg avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.