raspberryice / gen-arg Goto Github PK

View Code? Open in Web Editor NEW

113.0 7.0 28.0 3.68 MB

Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

License: MIT License

Shell 1.05% Python 26.23% HTML 72.73%

information-extraction event-extraction

gen-arg's Introduction

Argument Extraction by Generation

Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

Dependencies

pytorch=1.6
transformers=3.1.0
pytorch-lightning=1.0.6
spacy=2.3.2

Model Checkpoints

Checkpoints trained from this repo are shared for the WikiEvents dataset and the ACE dataset are available at: [s3://gen-arg-data/checkpoints/].

You can download all the contents from the S3 bucket using AWS cli: aws s3 cp s3://gen-arg-data/checkpoints/ ./ --recursive

Model Predictions

The model predictions on WikiEvents is provided in outputs/wikievents-pointer-pred. Running this file through the scorer.py function should give you the exact same numbers as Table 5.

Datasets

RAMS (Download at [https://nlp.jhu.edu/rams/])
ACE05 (Access from LDC[https://catalog.ldc.upenn.edu/LDC2006T06] and preprocessing following OneIE[http://blender.cs.illinois.edu/software/oneie/])
WikiEvents (Available here [s3://gen-arg-data/wikievents/])

You can download the data through the AWS cli or AWS console. Alternatively, you can download individual files by

wget https://gen-arg-data.s3.us-east-2.amazonaws.com/wikievents/data/<split>.jsonl for split={train, dev,test}.
wget https://gen-arg-data.s3.us-east-2.amazonaws.com/wikievents/data/coref/<split>.jsonlines for split={train, dev, test}.

Additional processed test files for RAMS can be downloaded by

wget https://gen-arg-data.s3.us-east-2.amazonaws.com/RAMS/test_head_coref.jsonlines
wget https://gen-arg-data.s3.us-east-2.amazonaws.com/RAMS/test_head.jsonlines

gen-arg's People

Contributors

Stargazers

Watchers

gen-arg's Issues

convert_pointer_logits_to_lm_logits函数中的fill_value=-1000是怎么选取的？

最近在把bart模型换成t5模型，但是loss很高，初始在400左右，迭代30个epoch后仍有200，仔细看了下代码，里面有个convert_pointer_logits_to_lm_logits函数，fill_value=-1000是怎么选取的？

Only 10 F1 score on wikievent dataset

Hi,
I tried to follow scripts/train_kairos.sh and scripts/test_kairos.sh but only received low performance as follow:

Role identification: P: 16.88, R: 4.456, F: 7.18
Role: P: 15.58, R: 4.21, F: 6.63
Coref Role identification: P: 19.48, R: 5.26, F: 8.29
Coref Role: P: 15.58, R: 4.21, F: 6.63

Even I tried to have more epochs , I can only get F1 score around 10. Is there anything goes wrong?

By the way, I failed to download the ckpt you shared on s3 due to a network error, is there any other way to acquire these files?

Thanks.

Share pretrained checkpoints for generation

Great work!

Would it be possible to share your trained PL checkpoints to try the generation? Thanks.

Tuning the model to handle imbalanced data

Love the paper.

I've tried it on my own closed domain dataset and achieved poor recall.

Role identification: P: 49.30, R: 28.43, F: 36.06
Role: P: 44.41, R: 25.60, F: 32.48
Coref Role identification: P: 69.93, R: 40.32, F: 51.15
Coref Role: P: 48.60, R: 28.02, F: 35.55

I believe the low recall is due to imbalanced labels, but I value recall over precision.
Is there some way to tune the model to increase recall at the cost of precision?

where the "restrict the vocabulary of words to Vc" mentioned in the paper is reflected in the code?

Hi, thanks for sharing code!

I would like to ask where the "restrict the vocabulary of words to Vc" mentioned in the paper is reflected in the code?

In the code, it seems that the token with the highest probability is taken as the result by default.

download the checkpoint without aws account

hi, nice job!

i don't have aws account.
how can i download the checkpoint?
could you give a google drive/baidu cloud link?
thanks!

Access denied when downloading checkpoints from S3

Hi!

When executing
$ aws s3 ls s3://gen-arg-data/checkpoints/
or
$ aws s3 cp s3://gen-arg-data/checkpoints/ ./ --recursive,
I get the following error:

"An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied".

I am not sure if the issue is on my side?

关于生成时约束词表

请问为何只在wikievents上解码时将词表限制在输入的文本中(MODEL=constrained-gen)，而在RAMS上没有约束词表？(MODEL=gen)

Where is the WikiEvents dataset

139 event types in paper vs 149 in csv file

Hi,

Great paper 👍
Quick question, why do you report 139 Event types in your paper but in the clean ontology csv file there are 149 event types.
And how did you create the aida_cleaned_ontology.csv file?

Thanks in advance! 🤗

the implementation part of clarification in code?

Hello, I found that in the “clarification” methods mentioned in Section 2.2 of the paper, I did not see the implementation in the code?

The event arguments annotation.

Since there are coreferential entity mentions in a document, what is the principle to decide which entity mention should be annotated as the argument of an event?
Besides, I find entity coreference clusters stored in split.jsonlines, but I do not find the event coreference clusters. Where are they stored?

tgr-pred-file

Hi, thank you for this amazing work. When we try to test the model with pipeline_scorer it requires tgr-pred-file. The question is how to get this file for kairos dataset?

TapKey Model Missing

Hello,
I really like this project! I think the TapKey model for event trigger detection is not included in the repository. I would like to use it. I could just be missing it as well. Let me know. Thanks so much!

RAMS dataset missing test files

In scripts/test_rams.sh, when in head eval mode, the input file "data/RAMS_1.0/data/test_head.jsonlines" is missing in downloaded rams dataset. Also found missing file of "data/RAMS_1.0/data/test_head_coref.jsonlines".

Could you upload the above two missing files? Thanks.

lose RAMS dataset file?

Hi.

I tried to follow the code on dataset RAMS.And I changed the argument called 'test-file' in convert_gen_to_output.py to 'data/RAMS_1.0/data/test.jsonlines' when evaluating head score but only received low performance as follow:

Precision: 33.5640 Recall: 36.2133 F1: 34.8384.

I did't find the file called 'test_head_coref.jsonlines' and 'test_head.jsonlines'. Are they include in RAMS or created by yourself?

It puzzled me. Did I downloaded wrong version? My version: RAMS_1.0c.tar.gz [current as of March 2023, 4.5MB].

Thanks for reply.

The head word F1

Hi, I have a question about head word F1, is head word the first word of the argument?

About the evaluation set of ACE

Hello, thank you for your excellent work.

As mentioned in your paper, We used two settings for selecting known types: 10 most frequent events types and 8 event types, one from each parent type of the event ontology and The evaluation is done on the complete set of event types.

Could you please tell me what is the ACE evaluation set used in the results of rows 10 most frequent and 1 per general type in Table 8 ?
Does the evaluation set of these experimental settings contain seen event types?

data process

the data you provide is right??
when I use preprocessed_KAIROS this file ,I don't find the following items which used in dataloader:
'input_token_ids': input_token_ids,
'input_attn_mask': input_attn_mask,
'tgt_token_ids': tgt_token_ids,
'tgt_attn_mask': tgt_attn_mask,
'doc_key': doc_keys,

Clarification needed on the implementation of Equation 4 of the paper

Hi,

Thank you for sharing your amazing work! I need some clarification regarding the implementation of equation 4 mentioned in the paper. From the initial reading of the code, it appears to me that, type clarification statements weren't used in the implementation.

Am I missing anything? Any help would be much appreciated!

Downloading checkpoint from s3

The link given in the repo s3://gen-arg-data/checkpoints doesnt seem to work for me, neither the aws cli comand. Can you please guide me to the latest version of the checkpoints?

Testing the checkpoint on WikiEvents dataset

Hi,
could you just quickly summarize the steps, required to test the downloaded checkpoint on the WikiEvents dataset?

As I have observed, the WikiEvents dataset is actually referred to as KAIROS in some parts of the code - it also uses KAIROS data module which requires, for example, the test file to be located in preprocessed_KAIROS/test.jsonl.

I did the following steps:

I have downloaded the WikiEvents dataset from S3 and stored in at data/wikievents.
I have downloaded the checkpoints from S3 which are stored at checkpoints/Wikievents/ (note that the directory contains both epoch=1-v0.ckpt and epoch=2-v0.ckpt).
I had to add the "--coref_dir" argument to the scripts/test_KAIROS.sh as it is referring to some other (non-existing) directory by default.
The command for "train.py" is the following:

python train.py --model=constrained-gen --ckpt_name=WikiEvents-pred \
    --load_ckpt=checkpoints/WikiEvents/epoch=2-v0.ckpt \
    --dataset=KAIROS \
    --eval_only \
    --train_file=data/wikievents/train.jsonl \
    --val_file=data/wikievents/dev.jsonl \
    --test_file=data/wikievents/test.jsonl \
    --coref_dir=data/wikievents/coref \
    --train_batch_size=4 \
    --eval_batch_size=4 \
    --learning_rate=3e-5 \
    --accumulate_grad_batches=4 \
    --num_train_epochs=3

Note that this throws an error as it still tries to load the test_file from "preprocessed_KAIROS/test.jsonl".
5. Hoping to fix the issue, I have copied the data/wikievents/ to ./preprocessed_KAIROS/. Unfortunately, I get the following error:

  File "/home/patrik/gen-arg/src/genie/data.py", line 15, in my_collate                                                
    doc_keys = [ex['doc_key'] for ex in batch]                                                                         
  File "/home/patrik/gen-arg/src/genie/data.py", line 15, in <listcomp>                                                
    doc_keys = [ex['doc_key'] for ex in batch]                                                                         
KeyError: 'doc_key'

Do you maybe have an idea about what am I doing wrong?

Best, Patrik

Share pretrained class vectors and tagger checkpoints

Thank you for the great work.
Could you please share the pretrained class vectors and tagger checkpoints for the tagger, for example
all_class_vec_KAIROS.pt?
Also I cannot quite figure out how to reproduce zero-shot event extraction (paragraph 4.4 of your paper). What should be the process (command line with arguments) to extract a new event? Alternatively, if I add more events to the KAIROS ontology, how could I fine-tune for them?
As I understand, the tagger checkpoint is created in train_tagger.py. Then I'd need to use it to classify documents and add event types to training/test/dev data. Are there scripts for that?

你好想问一下关于任务和数据的问题。

你好！想请教一下event_role_KAIROS.json这个文件中，role_type的含义，是每个role可能的实体类型吗？其中像bal、com、mon等类型不是很清楚具体的含义，不知道有没有对应的具体含义或者有什么可以查找到相应解释的方法。

how to generate event_role_ACE.json and event_role_KAIROS.json

hi,

Could you provide the code of generate these 2 files?

thanks!

About The Appendix A，B，C and D in "Document-Level Event Argument Extraction by Conditional Generation

   Hi @raspberryice. I am a loyal reader of the paper "Document-Level Event Argument Extraction by Conditional Generation". I like this paper very much and benefit a lot from it.
   Thanks a lot for your work. I was very interested in your research, but there is A barrier in the process of reading. I tried to find the Appendices A, B, C and D in this paper, but I couldn‘t find them. Can you give me some instructions that provide the addresses for the Appendices A, B, C and D? 
   Thank you again for your work and sharing the code.

Multiple arguments of the same argument role

Hi. I'm dealing with a dataset that has several argument roles, and each role might have multiple arguments. For example, '<arg1>, <arg2>, <arg3>, <arg4>, <arg5> has participate in an military activity on <arg6> ... ' In this sentence, arg1 to arg5 is the same role but different arguments, they are all the role 'countries'. And the problem is, each data in my dataset might have different number of 'countries'. In my example there are five countries which is arg1 to arg5, while in other data it might be just 3 countries which is arg1 to arg3 or just 1 countries. And I tried to use template like '<arg1>, <arg2>, <arg3>, <arg4>, <arg5> has participate in an military activity on <arg6> ... ', but I got pretty bad result, because it seems to be predicting a lot of '<arg> <arg> <arg>' and the result sentence looks absurd. If I use use template like '<arg1> has participate in an military activity on <arg2> ... ', the result is normal and acceptable, but in this way I can only predict a single one countries(arg1). Is there a way to deal with this? Thanks.

Missing script `visualize_output.py`

I tried to run the scripts\train_rams.sh and noticed that the python script visualize_output.py appears to be missing from the master branch.

the directory of checkpoints is empty

Hello Sha Li, according to your code，I found that the checkpoint directory is still empty after the execution of code，which made it impossible to test the model on test set. Have you ever met this problem and could you give me some clue to fix it?

The Length of Document

Hi, thanks for your nice work.
When analyzing data, I find many documents are very long.
I find in your code, the MAX_LENGTH is 424. However, in Figure 4 of the paper:

the distance of many informative args is greater than MAX_LENGTH, so many args are not even inputted to the encoder?

'BartConstrainedGen' object has no attribute 'postprocess_next_token_scores'

Thanks for sharing the code!

I could successfully training model on the wikievents dataset. But got the error saying "'BartConstrainedGen' object has no attribute postprocess_next_token_scores" when run scripts/test_kairos.sh.

It seems postprocess_next_token_scores method is not include in the BartConstrainedGen class (src/genie/constrained_gen.py). Is there something missing in the code or any other cause ? Look forward to your reply!