OneIE v0.4.5

Requirements

Python 3.5+ Python packages

PyTorch 1.0+ (Install the CPU version if you use this tool on a machine without GPUs)
transformers
tqdm
lxml
nltk

This project uses Poetry to manage library versions.

Get the latest versions of the dependencies and to update the poetry.lock file.

poetry update

Install libraries.

poetry install

Spawns a shell within the virtual environment.

poetry shell

Ensure that the required libraries listed in pyproject.toml are installed in the virtual environment.

pip list

You're good to go! :)

Creating venv for Jupyter notebooks

Ensure that both jupyter and ipykernel are in the dependencies.

poetry add -D jupyter ipykernel

Create virtual environments with ipython.

poetry run ipython kernel install --user --name=oneie-venv

Start up Jupyter in the virtual environment.

poetry run jupyter notebook

In the notebook, Kernel > Change Kernel > oneie-venv. Restart the kernel.

How to Run

Pre-processing

Downloading RAMS

Leverage on existing parsed RAMS to DyGIE++ in toolkit:

eai data pull dsta.eventextraction.datasets@latest data/

Preprocess RAMS data

ONEIE requires the data to be in its own format. We may transform data in the DyGIE++ format to ONEIE's format with:

python oneie/preprocessing/process_dygiepp.py -i data/rams/collated-data/default-settings/json/dev.json -o data/oneie/rams/collated-data/default-settings/json/dev.json

Arguments:

-i, --input: Path to the input file.
-o, --output: Path to the output file.
-b, --bert: Name of BERT model used for tokenization (default: bert-large-cased).

A sample DyGIE++ format:

{"doc_key":"nw_RC00e90a0209cf7c63e3faf5008f034002ef61cea93a159a31aa33e18e","sentences":[["Three","specific","points","illustrate","why","Americans","see","Trump","as","the","problem",":","1",")","Trump","has","trouble","working","with","people","beyond","his","base",".","In","Saddam","Hussein","'s","Iraq","that","might","work","when","opponents","can","be","thrown","in","jail","or","exterminated",".","In","the","United","States","that","wo","n't","fly",":","presidents","must","build","bridges","within","and","beyond","their","core","support","to","resolve","challenges",".","Without","alliances",",","a","president","ca","n't","get","approval","to","get","things","done","."]],"events":[[[[40,"life.die.n\/a"],[33,33,"victim"],[28,28,"place"]]]],"ner":[[[33,33,"victim"],[28,28,"place"]]],"relations":[[]],"_sentence_start":[0],"dataset":"rams"}

A sample ONEIE format after using preprocessing/process_dygiepp.py:

{"doc_id": "nw_RC00e90a0209cf7c63e3faf5008f034002ef61cea93a159a31aa33e18e", "sent_id": "nw_RC00e90a0209cf7c63e3faf5008f034002ef61cea93a159a31aa33e18e-0", "entity_mentions": [{"id": "nw_RC00e90a0209cf7c63e3faf5008f034002ef61cea93a159a31aa33e18e-0-E0", "start": 33, "end": 34, "entity_type": "victim", "mention_type": "UNK", "text": "opponents"}, {"id": "nw_RC00e90a0209cf7c63e3faf5008f034002ef61cea93a159a31aa33e18e-0-E1", "start": 28, "end": 29, "entity_type": "place", "mention_type": "UNK", "text": "Iraq"}], "relation_mentions": [], "event_mentions": [{"event_type": "life:die:n/a", "id": "nw_RC00e90a0209cf7c63e3faf5008f034002ef61cea93a159a31aa33e18e-0-EV0", "trigger": {"start": 40, "end": 41, "text": "exterminated"}, "arguments": [{"entity_id": "nw_RC00e90a0209cf7c63e3faf5008f034002ef61cea93a159a31aa33e18e-0-E0", "text": "opponents", "role": "victim"}, {"entity_id": "nw_RC00e90a0209cf7c63e3faf5008f034002ef61cea93a159a31aa33e18e-0-E1", "text": "Iraq", "role": "place"}]}], "tokens": ["Three", "specific", "points", "illustrate", "why", "Americans", "see", "Trump", "as", "the", "problem", ":", "1", ")", "Trump", "has", "trouble", "working", "with", "people", "beyond", "his", "base", ".", "In", "Saddam", "Hussein", "'s", "Iraq", "that", "might", "work", "when", "opponents", "can", "be", "thrown", "in", "jail", "or", "exterminated", ".", "In", "the", "United", "States", "that", "wo", "n't", "fly", ":", "presidents", "must", "build", "bridges", "within", "and", "beyond", "their", "core", "support", "to", "resolve", "challenges", ".", "Without", "alliances", ",", "a", "president", "ca", "n't", "get", "approval", "to", "get", "things", "done", "."], "pieces": ["Three", "specific", "points", "illustrate", "why", "Americans", "see", "Trump", "as", "the", "problem", ":", "1", ")", "Trump", "has", "trouble", "working", "with", "people", "beyond", "his", "base", ".", "In", "Saddam", "Hussein", "'", "s", "Iraq", "that", "might", "work", "when", "opponents", "can", "be", "thrown", "in", "jail", "or", "ex", "##ter", "##minated", ".", "In", "the", "United", "States", "that", "w", "##o", "n", "'", "t", "fly", ":", "presidents", "must", "build", "bridges", "within", "and", "beyond", "their", "core", "support", "to", "resolve", "challenges", ".", "Without", "alliances", ",", "a", "president", "ca", "n", "'", "t", "get", "approval", "to", "get", "things", "done", "."], "token_lens": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1], "sentence": "Three specific points illustrate why Americans see Trump as the problem : 1 ) Trump has trouble working with people beyond his base . In Saddam Hussein 's Iraq that might work when opponents can be thrown in jail or exterminated . In the United States that wo n't fly : presidents must build bridges within and beyond their core support to resolve challenges . Without alliances , a president ca n't get approval to get things done ."}

Note: There are some RAMS data without events (events are extracted from gold_evt_links). For example:

{"rel_triggers": [], "gold_rel_links": [], "doc_key": "nw_RC013008ad72d04b5e4cca4706cad5cc71c88b0df4615bd597df0e3cf0", "ent_spans": [], "language_id": "eng", "source_url": "http://www.huffingtonpost.com/entry/why-trump-should-peacefully-protest-clintons-victory_us_5809d9b7e4b0b1bd89fdb0bc", "evt_triggers": [[103, 104, [["personnel.elect.winelection", 1.0]]]], "split": "dev", "sentences": [["In", "addition", "to", "working", "alongside", "super", "-", "PACs", ",", "there", "\u2019s", "the", "latest", "saga", "of", "two", "Democratic", "operatives", "losing", "their", "posts", "because", "of", "a", "leaked", "video", "."], ["The", "Chicago", "Tribune", "explains", "the", "impact", "of", "this", "video", "in", "a", "piece", "titled", "Two", "local", "Democratic", "operatives", "lose", "jobs", "after", "video", "sting", "on", "voter", "fraud", ":"], ["Robert", "Creamer", ",", "husband", "of", "Rep.", "Jan", "Schakowsky", ",", "D", "-", "Ill", ".", ",", "and", "Scott", "Foval", "--", "two", "little", "-", "known", "but", "influential", "Democratic", "political", "operatives", "--", "have", "left", "their", "jobs", "after", "video", "investigations", "by", "James", "O'Keefe", "'s", "Project", "Veritas", "Action", "found", "them", "entertaining", "dark", "notions", "about", "how", "to", "win", "elections", "."], ["Foval", "was", "laid", "off", "on", "Monday", "by", "Americans", "United", "for", "Change", ",", "where", "he", "had", "been", "national", "field", "director", "."], ["Creamer", "announced", "Tuesday", "night", "that", "he", "was", "\"", "stepping", "back", "\"", "from", "the", "work", "he", "was", "doing", "for", "the", "unified", "Democratic", "campaign", "for", "Hillary", "Clinton", "."]], "gold_evt_links": []}

The resultant dataset size after conversion is:

Data	RAMS	DyGIE++	ONEIE
train	7329	7046	7046
dev	924	909	909
test	871	851	851

ACE2005 to OneIE format

The prepreocessing/process_ace.py script converts raw ACE2005 datasets to the format used by OneIE. Example:

python preprocessing/process_ace.py -i <INPUT_DIR>/LDC2006T06/data -o <OUTPUT_DIR>
  -s resource/splits/ACE05-E -b bert-large-cased -c <BERT_CACHE_DIR> -l english

python preprocessing/process_ace.py -i <PATH_TO_ACE>/data -o input/preprocessed_ace -s <PATH_TO_SPLITS> -l english -b <transformers model, eg. albert-xxlarge-v2> --time_and_val

Arguments:

-i, --input: Path to the input directory (data folder in your LDC2006T06 package).
-o, --output: Path to the output directory.
-b, --bert: Bert model name.
-c, --bert_cache_dir: Path to the BERT cache directory.
-s, --split: Path to the split directory. We provide document id lists for all datasets used in our paper in resource/splits.
-l, --lang: Language (options: english, chinese).

Training

cd to the root directory of this package
Set the environment variable PYTHONPATH to the current directory. For example, if you unpack this package to ~/oneie_v0.4.5, run: export PYTHONPATH=~/oneie_v0.4.5
Run this commandline to train a model: python train.py -c <CONFIG_FILE_PATH>.
We provide an example configuration file config/example.json. Fill in the following paths in the configuration file:
- BERT_CACHE_DIR: Pre-trained BERT models, configs, and tokenizers will be downloaded to this directory.
- TRAIN_FILE_PATH, DEV_FILE_PATH, TEST_FILE_PATH: Path to the training/dev/test files.
- OUTPUT_DIR: The model will be saved to sub folders in this directory.
- VALID_PATTERN_DIR: Valid patterns created based on the annotation guidelines or training set. Example files are provided in resource/valid_patterns.

We may train a new model with train.py.

python train.py -c ./config/train_rams.json

Training takes many arguments, all of which should be contained in a json file.
Here are two key arguments in the json file that should be taken note of.

Location of the train / dev / test data files and the log file.

    "train_file": "<TRAIN_FILE_PATH>",
    "dev_file": "<DEV_FILE_PATH>",
    "test_file": "<TEST_FILE_PATH>",
    "log_path": "<OUTPUT_DIR>",

Path of the valid patterns.

    "valid_pattern_path": "<VALID_PATTERN_DIR>",

The files in this path should define specifics of entity / relation / event extraction.
For instance, event_role.json will specify what arguments come with the Movement:Transport event.

  "Movement:Transport": [
    "Vehicle",
    "Artifact",
    "Agent",
    "Origin",
    "Destination"
  ],

Training in EAI toolkit

In the root eai-dsta directory, run

./send_jobs <NAME OF JOB> <TYPE OF JOB>

TYPE OF JOB enums: train, test_acepp

Evaluation

cd to the root directory of this package
Set the environment variable PYTHONPATH to the current directory. For example, if you unpack this package to ~/oneie_v0.4.5, run: export PYTHONPATH=~/oneie_v0.4.5
Example commandline to use OneIE: python predict.py -m best.role.mdl -i input -o output -c output_cs --format ltf
- Arguments:
  - -m, --model_path: Path to the trained model.
  - -i, --input_dir: Path to the input directory. LTF format sample files can be found in the input directory.
  - -o, --output_dir: Path to the output directory (json format). Output files are in the JSON format. Sample files can be found in the output directory.
  - -c, --cs_dir: (optional) Path to the output directory (cs format). Sample files can be found in the output_cs directory.
  - -l, --log_path: (optional) Path to the log file. A sample file log.json can be found in output.
  - --gpu: (optional) Use GPU
  - -d, --device: (optional) GPU device index (for multi-GPU machines).
  - -b, --batch_size: (optional) Batch size. For a 16GB GPU, a batch size of 10~15 is a reasonable value.
  - --max_len: (optional) Max sentence length. Sentences longer than this value will be ignored. You may need to decrease batch_size if you set max_len to a larger number.
  - --beam_size: (optional) Beam set size of the decoder. Increasing this value may improve the results and make the decoding slower.
  - --lang: (optional) Model language.
  - --format: Input file format (txt or ltf).

Inference

We can then pass the preprocessed data into the pretrained model with predict.py.
Note that the data needs to be in the specified folder paths.

python predict.py -m models/english.role.v0.3.mdl -i test_train/data/input -o data/output --format json

Post-processing

To analyse errors from trained models:

python main_formatter.py 
   -test R11/test.oneie.json 
   -preds R11/predictions.oneie.json 
   --from_oneie 
   --filter_classes

python eventspecific.py 
   -preds T4/oneie_formatted_results.jsonl 
   -gold T4/test.json

Output Format

OneIE save results in JSON format. Each line is a JSON object for a sentence containing the following fields:

doc_id (string): Document ID
sent_id (string): Sentence ID
tokens (list): A list of tokens
token_ids (list): A list of token IDs (doc_id:start_offset-end_offset)
graph (object): Information graph predicted by the model
- entities (list): A list of predicted entities. Each item in the list has exactly four values: start_token_index, end_token_index, entity_type, mention_type, score. For example, "[3, 5, "GPE", "NAM", 1.0]" means the index of the start token is 3, index of the end token is 4 (5 - 1), entity type is GPE, mention type is NAM, and local score is 1.0.
- triggers (list): A list of predicted triggers. It is similar to entities, while each item has three values: start_token_index, end_token_index, event_type, score.
- relations (list): A list of predicted relations. Each item in the list has three values: arg1_entity_index, arg2_entity_index, relation_type, score. In the following example, [1, 0, "ORG-AFF", 0.52] means there is a ORG-AFF relation between entity 1 ("leader") and entity 0 ("North Korean") with a local score of 0.52. The order of arg1 and arg2 can be ignored for "SOC-PER" as this relation is symmetric.
- roles (list): A list of predicted argument roles. Each item has three values: trigger_index, entity_index, role, score. In the following example, [0, 2, "Attacker", 0.8] means entity 2 (Kim Jong Un) is the Attacker argument of event 0 ("detonate": Conflict:Attack), and the local score is 0.8.

Output example:

{"doc_id": "HC0003PYD", "sent_id": "HC0003PYD-16", "token_ids": ["HC0003PYD:2295-2296", "HC0003PYD:2298-2304", "HC0003PYD:2305-2305", "HC0003PYD:2307-2311", "HC0003PYD:2313-2318", "HC0003PYD:2320-2325", "HC0003PYD:2327-2329", "HC0003PYD:2331-2334", "HC0003PYD:2336-2337", "HC0003PYD:2339-2348", "HC0003PYD:2350-2351", "HC0003PYD:2353-2360", "HC0003PYD:2362-2362", "HC0003PYD:2364-2367", "HC0003PYD:2369-2376", "HC0003PYD:2378-2383", "HC0003PYD:2385-2386", "HC0003PYD:2388-2390", "HC0003PYD:2392-2397", "HC0003PYD:2399-2401", "HC0003PYD:2403-2408", "HC0003PYD:2410-2412", "HC0003PYD:2414-2415", "HC0003PYD:2417-2425", "HC0003PYD:2427-2428", "HC0003PYD:2430-2432", "HC0003PYD:2434-2437", "HC0003PYD:2439-2441", "HC0003PYD:2443-2447", "HC0003PYD:2449-2450", "HC0003PYD:2452-2454", "HC0003PYD:2456-2464", "HC0003PYD:2466-2472", "HC0003PYD:2474-2480", "HC0003PYD:2481-2481", "HC0003PYD:2483-2485", "HC0003PYD:2487-2491", "HC0003PYD:2493-2502", "HC0003PYD:2504-2509", "HC0003PYD:2511-2514", "HC0003PYD:2516-2523", "HC0003PYD:2524-2524"], "tokens": ["On", "Tuesday", ",", "North", "Korean", "leader", "Kim", "Jong", "Un", "threatened", "to", "detonate", "a", "more", "powerful", "H-bomb", "in", "the", "future", "and", "called", "for", "an", "expansion", "of", "the", "size", "and", "power", "of", "his", "country's", "nuclear", "arsenal", ",", "the", "state", "television", "agency", "KCNA", "reported", "."], "graph": {"entities": [[3, 5, "GPE", "NAM", 1.0], [5, 6, "PER", "NOM", 0.2], [6, 9, "PER", "NAM", 0.5060472888322202], [15, 16, "WEA", "NOM", 0.5332313915378754], [30, 31, "PER", "PRO", 1.0], [32, 33, "WEA", "NOM", 1.0], [33, 34, "WEA", "NOM", 0.5212696155645499], [36, 37, "GPE", "NOM", 0.4998288792916457], [38, 39, "ORG", "NOM", 1.0], [39, 40, "ORG", "NAM", 0.5294904130032032]], "triggers": [[11, 12, "Conflict:Attack", 1.0]], "relations": [[1, 0, "ORG-AFF", 1.0]], "roles": [[0, 2, "Attacker", 0.4597024700555278], [0, 3, "Instrument", 1.0]]}}

maoyingmy / event-extraction-oneie Goto Github PK

event-extraction-oneie's Introduction

Requirements

Creating venv for Jupyter notebooks

How to Run

Pre-processing

Downloading RAMS

Preprocess RAMS data

ACE2005 to OneIE format

Training

Training in EAI toolkit

Evaluation

Inference

Post-processing

Output Format

event-extraction-oneie's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent