131250208 / tplinker-joint-extraction Goto Github PK

Python 37.34% Jupyter Notebook 62.66%

tplinker-joint-extraction's Introduction

TPLinker

TPLinker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking

This repository contains all the code of the official implementation for the paper: TPLinker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking. The paper has been accepted to appear at COLING 2020. [slides] [poster]

TPLinker is a joint extraction model resolved the issues of relation overlapping and nested entities, immune to the influence of exposure bias, and achieves SOTA performance on NYT (TPLinker: 91.9, TPlinkerPlus: 92.6 (+3.0)) and WebNLG (TPLinker: 91.9, TPlinkerPlus: 92.3 (+0.5)). Note that the details of TPLinkerPlus will be published in the extended paper, which is still in progress.

Note: Please refer to Q&A and closed issues to find your question before proposed a new issue.

Model
Results
Usage
Citation
Q&A

Update

2020.11.01: Fixed bugs and added comments in BuildData.ipynb and build_data_config.yaml; TPLinkerPlus can support entity classification now, see build data for the data format; Updated the datasets (added entity_list for TPLinkerPlus).
2020.12.04: The original default parameters in build_data_config.yaml are all for Chinese datasets. It might be misleading for reproducing the results. I changed back to the ones for English datasets. Note that you must set ignore_subword to true for English datasets, or it will hurt the performance and can not reach the scores reported in the paper.
2020.12.09: We published model states for fast tests. See Super Parameters.
2021.03.22: Add Q&A part in README.

Model

Results

Usage

Prerequisites

Our experiments are conducted on Python 3.6 and Pytorch == 1.6.0. The main requirements are:

tqdm
glove-python-binary==0.2.0
transformers==3.0.2
wandb # for logging the results
yaml

In the root directory, run

pip install -e .

Data

download data

Get and preprocess NYT* and WebNLG* following CasRel (note: named NYT and WebNLG by CasRel). Take NYT* as an example, rename train_triples.json and dev_triples.json to train_data.json and valid_data.json and move them to ori_data/nyt_star, put all test*.json under ori_data/nyt_star/test_data. The same process goes for WebNLG*.

Get raw NYT from CopyRE, rename raw_train.json and raw_valid.json to train_data.json and valid_data.json and move them to ori_data/nyt, rename raw_test.json to test_data.json and put it under ori_data/nyt/test_data.

Get WebNLG from ETL-Span, rename train.json and dev.json to train_data.json and valid_data.json and move them to ori_data/webnlg, rename test.json to test_data.json and put it under ori_data/webnlg/test_data.

If you are bother to prepare data on your own, you could download our preprocessed datasets.

build data

Build data by preprocess/BuildData.ipynb. Set configuration in preprocess/build_data_config.yaml. In the configuration file, set exp_name corresponding to the directory name, set ori_data_format corresponding to the source project name of the data. e.g. To build NYT*, set exp_name to nyt_star and set ori_data_format to casrel. See build_data_config.yaml for more details. If you want to run on other datasets, transform them into the normal format for TPLinker, then set exp_name to <your folder name> and set ori_data_format to tplinker:

[{
"id": <text_id>,
"text": <text>,
"relation_list": [{
    "subject": <subject>,
    "subj_char_span": <character level span of the subject>, # e.g [3, 10] This key is optional. If no this key, set "add_char_span" to true in "build_data_config.yaml" when you build the data
    "object": <object>,
    "obj_char_span": <character level span of the object>, # optional
    "predicate": <predicate>,
 }],
"entity_list": [{ # This key is optional, only for TPLinkerPlus. If no this key, BuildData.ipynb will auto genrate a entity list based on the relation list.
    "text": <entity>,
    "type": <entity_type>,
    "char_span": <character level span of the object>, # This key relys on subj_char_span and obj_char_span in relation_list, if not given, set "add_char_span" to true in "build_data_config.yaml".
 }],
}]

Pretrained Model and Word Embeddings

Download BERT-BASE-CASED and put it under ../pretrained_models. Pretrain word embeddings by preprocess/Pretrain_Word_Embedding.ipynb and put models under ../pretrained_emb.

If you are bother to train word embeddings by yourself, use our's directly.

Train

Set configuration in tplinker/config.py as follows:

common["exp_name"] = nyt_star # webnlg_star, nyt, webnlg
common["device_num"] = 0 # 1, 2, 3 ...
common["encoder"] = "BERT" # BiLSTM
train_config["hyper_parameters"]["batch_size"] = 24 # 6 for webnlg and webnlg_star
train_config["hyper_parameters"]["match_pattern"] = "only_head_text" # "only_head_text" for webnlg_star and nyt_star; "whole_text" for webnlg and nyt.

# if the encoder is set to BiLSTM
bilstm_config["pretrained_word_embedding_path"] = ""../pretrained_word_emb/glove_300_nyt.emb""

# Leave the rest as default

Start training

cd tplinker
python train.py

Super Parameters

TPLinker

# NYT*
T_mult: 1
batch_size: 24
dist_emb_size: -1
ent_add_dist: false
epochs: 100
inner_enc_type: lstm
log_interval: 10
loss_weight_recover_steps: 12000
lr: 0.00005
match_pattern: only_head_text
max_seq_len: 100
rel_add_dist: false
rewarm_epoch_num: 2
scheduler: CAWR
seed: 2333
shaking_type: cat
sliding_len: 20

# NYT
match_pattern: whole_text
...(the rest is the same as above)

# WebNLG*
batch_size: 6
loss_weight_recover_steps: 6000
match_pattern: only_head_text
...

# WebNLG
batch_size: 6
loss_weight_recover_steps: 6000
match_pattern: whole_text
...

We also provide model states for fast tests. You can download them here! You can get results as follows:

Before you run, make sure:
1. transformers==3.0.2
2. "max_test_seq_len" is set to 512
3. "match_pattern" should be the same as in the training:  `whole_text` for NYT/WebNLG, `only_head_text` for NYT*/WebNLG*.

# The test_triples and the test_data are the complete test datasets. The tuple means (precision, recall, f1).
# In the codes, I did not set the seed in a right way, so you might get different scores (higher or lower). But they should be very close to the results in the paper. 

# NYT*
{'2og80ub4': {'test_triples': (0.9118290017002562,
                               0.9257706535141687,
                               0.9187469407233642),
              'test_triples_1': (0.8641055045871312,
                                 0.929099876695409,
                                 0.8954248365513463),
              'test_triples_2': (0.9435444280804642,
                                 0.9196172248803387,
                                 0.9314271867684752),
              'test_triples_3': (0.9550056242968554,
                                 0.9070512820511851,
                                 0.9304109588540408),
              'test_triples_4': (0.9635099913118189,
                                 0.9527491408933889,
                                 0.9580993520017547),
              'test_triples_5': (0.9177877428997133,
                                 0.9082840236685046,
                                 0.9130111523662223),
              'test_triples_epo': (0.9497520661156711,
                                   0.932489451476763,
                                   0.9410415983777494),
              'test_triples_normal': (0.8659532526048757,
                                      0.9281617869000626,
                                      0.895979020929055),
              'test_triples_seo': (0.9476190476190225,
                                   0.9258206254845974,
                                   0.9365930186452366)}}
                                   
# NYT
{'y84trnyf': {'test_data': (0.916494217894085,
                            0.9272167487684615,
                            0.9218243035924758)}}

# WebNLG*
{'2i4808qi': {'test_triples': (0.921855146124465,
                               0.91777356103726,
                               0.919809825623476),
              'test_triples_1': (0.8759398496237308,
                                 0.8759398496237308,
                                 0.8759398495737308),
              'test_triples_2': (0.9075144508667897,
                                 0.9235294117644343,
                                 0.9154518949934687),
              'test_triples_3': (0.9509043927646122,
                                 0.9460154241642813,
                                 0.9484536081971787),
              'test_triples_4': (0.9297752808986153,
                                 0.9271708683470793,
                                 0.928471248196584),
              'test_triples_5': (0.9360730593603032,
                                 0.8951965065498275,
                                 0.9151785713781879),
              'test_triples_epo': (0.9764705882341453,
                                   0.9431818181807463,
                                   0.959537572203241),
              'test_triples_normal': (0.8780487804874479,
                                      0.8780487804874479,
                                      0.8780487804374479),
              'test_triples_seo': (0.9299698795180023,
                                   0.9250936329587321,
                                   0.9275253473025405)}}
                                   
# WebNLG
{'1g7ehpsl': {'test': (0.8862619808306142,
                       0.8630989421281354,
                       0.8745271121819839)}}

TPLinkerPlus

# NYT*/NYT
# The best F1: 0.931/0.934 (on validation set), 0.926/0.926 (on test set)
T_mult: 1
batch_size: 24
epochs: 250
log_interval: 10
lr: 0.00001
max_seq_len: 100
rewarm_epoch_num: 2
scheduler: CAWR
seed: 2333
shaking_type: cln
sliding_len: 20
tok_pair_sample_rate: 1

# WebNLG*/WebNLG
# The best F1: 0.934/0.889 (on validation set), 0.923/0.882 (on test set)
T_mult: 1 
batch_size: 6 
epochs: 250
log_interval: 10
lr: 0.00001
max_seq_len: 100
rewarm_epoch_num: 2
scheduler: CAWR
seed: 2333
shaking_type: cln
sliding_len: 20
tok_pair_sample_rate: 1

Evaluation

Set configuration in tplinker/config.py as follows:

eval_config["model_state_dict_dir"] = "./wandb" # if use wandb, set "./wandb"; if you use default logger, set "./default_log_dir" 
eval_config["run_ids"] = ["46qer3r9", ] # If you use default logger, run id is shown in the output and recorded in the log (see train_config["log_path"]); If you use wandb, it is logged on the platform, check the details of the running projects.
eval_config["last_k_model"] = 1 # only use the last k models in to output results
# Leave the rest as the same as the training

Start evaluation by running tplinker/Evaluation.ipynb

Citation

@inproceedings{wang-etal-2020-tplinker,
    title = "{TPL}inker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking",
    author = "Wang, Yucheng and Yu, Bowen and Zhang, Yueyang and Liu, Tingwen and Zhu, Hongsong and Sun, Limin",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.coling-main.138",
    pages = "1572--1582"
}

Frequently Asked Questions

Why did you make all entities to be "DEFAULT" type? TPLinker can not recognize the type of entities?

Because it is not necessary to recognize the type of entities for the relation extraction task since a predefined relation usually has fixed types for its subject and object. For example, ("place", "contains", "place") and ("country", "capital", "city"). If you need this feature, you can redefine the output tags of the EH-to-ET sequence or use TPLinkerPlus, which has already had this feature. If you use TPLinkerPlus, just set a specific entity type instead of "DEFAULT" in the entity_list.

What is the difference between <dataset_name>_star and <dataset_name>?

For fair comparison, we use the preprocessed data from previous works. NYT is from CopyRE (the raw version); WebNLG is from ETL-span; NYT* and WebNLG* are from CasRel. For a detailed description of these datasets, please refer to the data description part of our paper.

Previous works claim that WebNLG has 246 relations. Why do you use the WebNLG and WebNLG* that have less relations?

We directly use the datasets preprocessed by previous SoTA models. We get them from their Github repositories. We recounted the relations in the datasets and found out the real relation number of WebNLG and WebNLG* are less than they claimed in the papers. In fact, they use a subset (6000+ samples) WebNLG instead of the original WebNLG (10000+ samples) but use the statistics of the original one. If you re-count the relations in their datasets you will also find this problem.

My training process is far slower than you claimed (24h). Could you give any suggestions?

Please use the hyper-parameters I provided in README to reproduce the results. If you want to change them, please make the max_seq_length of training less than or equal to 100. From my experience, increasing the max_seq_length to larger than 100 did not bring obvious improvement but hurt the training speed so much. Using smaller batch_size will also speed up the training but it might hurt the performance if you use too small batch_size.

I see you split the long text to short ones. You might miss some golden entities and relations by splitting. How do you handle this?

We use a sliding window to split the samples, which will contain the most part of golden entities and relations. And we use different max_seq_length for training and inference. The former is set to 100 and the latter is set to 512. It is ok to lose one or two entities or relations when training, which will not influence the training too much.

How to use this model for Chinese datasets?

Please refer to issue #15.

My f1 score stay at 0 for a very long time. Do you have any idea?

Please refer to issue #24 and #25.

tplinker-joint-extraction's People

Contributors

Stargazers

Watchers

Forkers

vhientran gstoica27 pvcastro qianrenjian yyht mingkin cindytech berryhn wangbq18 zhangzhiyi0108 factide aeoling markwjj xuanfang1121 mecthew ztwu cristalezx gshan4056 colinsongf hodge-ge wind2375like xrosliang zhihao-chen xbjasper teanon hellojoyce huizhaowang piaofu110 cchengz 1342068120 henryyuen128 ly-888 xiaomogui gaohuan2015 twosmallhands limkim gaohongkui chenssyy zhangqile900621 yuanyuan25 preetmodi chao99bing johnyanccer tiffen snaildm zmtdya zhesun821 chunyu226 hlee-top mysqlsc ajaykumarr123 xiaoanshi wuhaipeng1997 fangzheng354 ruin823 kscnliu littlerookie ponyo1 anatanick nextguido kzjava1998 lightcraft2020 kiminh dr-haoliu johnson7788 songyandong sjyttkl chenchenchentebiehao gekelly naturegate zurichrain 714428593 albertbj xgswlg fffyyzhang lizzyzhan jaimeyan jerrysbest na2na8 wychenzhou yin0713 liqq2113 ghabi-imen yuhuajoe coreyxing wujietao233 michal-olek bengshaoye dlsnort a101269 yuanxw0828 anentropic haojiepan1 lv184614886 rickyshaw999

tplinker-joint-extraction's Issues

中文数据集下面

请问，你有在中文环境下跑过tplinker没，我在BuildData.ipynb中，看到你是在百度开放的信息抽取数据集，应该是做过实验吧。我按照readme的要求，把duie2的数据进行了转换，然后跑了一遍tplinker，发现训练速度非常慢，我的实验环境是gpu是v100。

why special tokens such as cls, seq were not added?

as I know, almost all bert models use special tokens , which was proved to have a better result that who did not. do you make comparative experiments? could you give your explainations?

请教在设计TPLinker的时候为什么使用的是加性Attention而不是简单的乘性呢？

我理解如果把Attention简化成乘性，可以减轻一些复杂度，求教在设计的时候这里是怎么考虑的呢？谢谢

What are the causes of predicting weird char span of entities?

Hi there, I notice that there are some really weird entity char spans from my model predictions. e.g. The length of text is only 250, but some entities can have a span of [3510, 3512], which apparently makes no sense; the char span of a entity can also be predicted as [0, 0], that represents nothing. The ent_f1 and rel_f1 for my model is 0.81 and 0.78

What can be the causes of predicting these weird entity spans? Thank you.

#############################################
There are no token span errors on the preprocessing and training phases.
the config for evaluation is listed as following:

eval_config = {
"model_state_dict_dir": "./wandb", # if use wandb, set "./wandb", or set "./default_log_dir" if you use default logger
"run_ids": ["2n26hvto", ],
"last_k_model": 1,
"test_data": "test.json",

"save_res": True,  
"save_res_dir": "../datasets/result_data",

"score": True,

"hyper_parameters": {
    "batch_size": 32, 
    "force_split": False,
    "max_seq_len": 240,
    "sliding_len": 50,
},

}

Tplinker可以用于事件抽取吗？

作者您好，我看tplink-plus代码里有事件抽取相关的代码，想问下事件抽取可以用tplinker plus做吗？
数据的输入格式是什么样的呢，谢谢！

Unable to save checkpoint with TPLinkerPlus

Description
The checkpoints were saved successfully with my previous datasets and NYT_star, which contain thousands of entities and relations. However, last week when I tried to apply TPLinkerPlus to a new chinese dataset, which contains no relations and the lengths of all text are less than 20 chars; while the scores were being improved, no checkpoint files were found in the wandb folder.

My debugging

Initially I thought it was caused by wandb, then I moved to the default logger, but still no checkpoints were saved.
After that, I guessed the bug was caused by my dataset that contains no relations; therefore, I randomly added two or three relations into my training set, sadly it did nothing and I got no checkpoint files saved.
I switched the dataset to my previous ones, the checkpoints are normally saved as the performance is improved while training.

Training parameters
"hyper_parameters": {
"batch_size": 28, # 32
"epochs": 1000,
"seed": 2333,
"log_interval": 10,
"max_seq_len": 80, # 128
"sliding_len": 20, #
"scheduler": "CAWR", # Step
"ghm": False, # set True if you want to use GHM to adjust the weights of gradients, this will speed up the training process and might improve the results. (Note that ghm in current version is unstable now, may hurt the results)
"tok_pair_sample_rate": 1, # (0, 1] How many percent of token paris you want to sample for training, this would slow down the training if set to less than 1. It is only helpful when your GPU memory is not enought for the training.
},

关于loss的权重？

作者您好，感谢分享代码。

我看训练时对两个loss有不同的权重：loss_weights = {"ent": w_ent, "rel": w_rel}。刚开始w_ent很大，后面逐渐降低。

这里我想问下这个权重的选择，是靠经验，还是有什么原理吗？谢谢！

tplinker训练acc上不去

想问下有训练log参考下吗？nyt数据集用tplinker训练了100个epoch，实体准确率很高，但t_head_rel_sample_acc, t_tail_rel_sample_acc始终上不去。大约多少个epoch能看到显著变化呢？另外tplinker-plus训练过程中看不到t_head_rel_sample_acc了吗？只有一个t_sample_acc

有关BuildData使用中文样本途中遇到的问题

我模仿NYT_star的数据格式把中文样本转换为模型要求的格式，但是发现模型无法标注出实体和关系，ent2id.json和rel2id.json里面都是空的{}。我的数据格式如下：

[
{
"text": "急诊胸部CT：临床提示：胸闷头痛3天扫描层厚：5mm影像所示：两下肺少许渗出，两侧胸腔微量积液。无明显气管、支气管异物；无明显食管异物；无气胸、液气胸征象；无明显纵隔气肿、占位；无明显心脏、大血管形态改变，无明显心包积液。（所示肋骨）无明显肋骨错位性骨折。",
"triple_list": [
[
"微量",
"修饰",
"积液"
],
[
"两侧",
"修饰",
"胸腔"
],
[
"胸腔",
"修饰",
"积液"
],
[
"少许",
"修饰",
"两下肺"
],
[
"两下肺",
"修饰",
"渗出"
]
]
},
......
]

BuildData配置文件build_data_config.yaml中我的设置是：

exp_name: deepwise # nyt_star, nyt, webnlg_star, webnlg, ace05_lu
data_in_dir: ../datasets/ori_data
ori_data_format: casrel # casrel (webnlg_star, nyt_star), etl_span (webnlg), raw_nyt (nyt), tplinker (see readme)

encoder: BERT
bert_path: ../../pretrained_models/chinese-bert-wwm-ext-hit-pytorch-huggingface
data_out_dir: ../datasets/train_data/debugging

add_char_span: true
ignore_subword: true
separate_char_by_white: false
check_tok_span: true

中文BERT模型我用的是哈工大的wwm

下载地址是：https://huggingface.co/hfl/chinese-bert-wwm-ext

torch.repeat and for loop in forward step cost a lot of time

Do you have any suggestions to speedup train step or inference step?

.yaml format

Will you give one for reference, thanks.

TPLinkerPlus: find cuda error at training stage when add special tokens for bert

Hi author,
I'm trying to add some special tokens for bert tokenizer, it works fine with DataBuilder. However, the cuda error is found at the training stage. I've modified the bert tokenizer like this:

if config["encoder"] == "BERT":
tokenizer = BertTokenizerFast.from_pretrained(config["bert_path"], add_special_tokens = True, do_lower_case = False)
# The special tokens are added here.....
tokenizer.add_special_tokens({'additional_special_tokens':[
"mm³","cm³","mm²","cm²","mm3","cm3","mm2","cm2",
"cm", "mm", "ml", "CM", "MM", "ML", "x", "*",
"Hu", "hu", "HU", "Se", "se", "SE",
"Image", "image", "IMAGE", "Im", "im", "IM"
]})
data_maker = DataMaker4Bert(tokenizer, handshaking_tagger)

here goes the cuda error info:
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [127,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [127,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed
........
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [73,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [73,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered

implementation questions

Hi,
Thanks for your good idea and paper, have some questions about TPLinkerPlus

I see you use logsum to calculate the loss, is it better than BCE?
Is there any reference/paper for "conditional" layernorm which is used for "cln" shaking_type?
Thanks.

the eval result

'time': 34.05615592002869,
'val_ent_seq_acc': 0.8174603375650588,
'val_f1': 0.7436676797886321,
'val_head_rel_acc': 0.39087302432883353,
'val_prec': 0.8524970963994364,
'val_recall': 0.659478885893921,
'val_tail_rel_acc': 0.40277778587880586

Current avf_f1: 0.7436676797886321, Best f1: 0.7436676797886321
I want to know how is the prec and the recall compute ?
And the ent_seq_acc and the head_rel_acc, tail_rel_acc is considered ?

TPlinker在中文数据集上的效果探讨

作者你好，我参加了2021语言与智能技术竞赛，我非常喜欢tplinker模型的设计思路，所以用tplinker plus模型在duie关系抽取上测试了效果。
训练数据有17万条左右，用的是bert base模型，但是最终的效果只有 86.67/55.61/67.75。训练日志如下

{'val_shaking_tag_acc': 0.44333228247162676, 'val_rel_prec': 0.7269814343605692, 'val_rel_recall': 0.6981153419760628, 'val_rel_f1': 0.7122560391486524, 'val_ent_prec': 0.7853231106243155, 'val_ent_recall': 0.8241971405991809, 'val_ent_f1': 0.8042906719944524, 'time': 1675.9016613960266}

有几个问题想请教下作者。

1、这里面我发现精确率比召回率高很多，并且val_shaking_tag_acc 也不高，不知道作者认为这大概是什么原因呢？有什么方法可以改进呢？谢谢

我这边觉得是矩阵稀疏的问题，一个100长度的句子，展开成5050个span，但是其实只有10%左右的span是有效的，感觉是其他90%的无效的span 影响了模型的学习。

2、我看作者对这方面也做了一些尝试，比如作者设置了这么一个参数：

"tok_pair_sample_rate": 1, # (0, 1] How many percent of token paris you want to sample for training,

当时我看默认是1，就没有调参，不知道作者当时调这个参数时，模型性能的表现如何，谢谢

3、不知道作者有没有兴趣组队呢，我们团队目前成绩前10，厚着脸皮邀请作者大大加入，谢谢！

中文最终数据的额数据输入格式是怎样的？

您好，中文最终模型输入的数据 char_span 和 token_span 分别指的是中文基于字排列的位置和基于分词后排列的位置吗？能简单给一下tplinker和tplinker++ 中文输入的示例吗？

Question about the relation categories.

I notice that you use argmax in your paper. But in the dataset, some entity pairs are labelled with multiple relations. Such as Iceland and Reykjavik are labelled as "contain" and "capital". So how do you handle that if you only choose the one with the largest probability?

请问tplinker在WebNLG上大概多少个epoch可以收敛呢？

您好，我使用您的tplinker（非plus），在WebNLG上尝试进行实验，但是目前跑了100多个epochs还是head, tail rel acc为0，我观察到您在其它问题中回复到batch_size是6，epoch为200，请问这个数据集大概多少个epoch head_rel, tail_rel会有明显的提升呢？
我除了batch_size似乎其他参数都是与您提供的相同的，按理说batch_size差别不至于这么大吧？参数如下。
#24 说的是NYT的问题，但是我在NYT跑的结果还挺好的，就是WebNLG不太对。所以另起了一个issue。

common = {
    "exp_name": "webnlg",
    "rel2id": "rel2id.json",
    "device_num": 0,
#     "encoder": "BiLSTM",
    "encoder": "BERT", 
    "hyper_parameters": {
        "shaking_type": "cat", # cat, cat_plus, cln, cln_plus; Experiments show that cat/cat_plus work better with BiLSTM, while cln/cln_plus work better with BERT. The results in the paper are produced by "cat". So, if you want to reproduce the results, "cat" is enough, no matter for BERT or BiLSTM.
        "inner_enc_type": "lstm", # valid only if cat_plus or cln_plus is set. It is the way how to encode inner tokens between each token pairs. If you only want to reproduce the results, just leave it alone.
        "dist_emb_size": -1, # -1: do not use distance embedding; other number: need to be larger than the max_seq_len of the inputs. set -1 if you only want to reproduce the results in the paper.
        "ent_add_dist": False, # set true if you want add distance embeddings for each token pairs. (for entity decoder)
        "rel_add_dist": False, # the same as above (for relation decoder)
        "match_pattern": "whole_text", # only_head_text (nyt_star, webnlg_star), whole_text (nyt, webnlg), only_head_index, whole_span
    },
}
common["run_name"] = "{}+{}+{}".format("TP1", common["hyper_parameters"]["shaking_type"], common["encoder"]) + ""

run_id = ''.join(random.sample(string.ascii_letters + string.digits, 8))
train_config = {
    "train_data": "train_data.json",
    "valid_data": "valid_data.json",
    "rel2id": "rel2id.json",
    "logger": "wandb", # if wandb, comment the following four lines
    
#     # if logger is set as default, uncomment the following four lines
#     "logger": "default", 
#     "run_id": run_id,
#     "log_path": "./default_log_dir/default.log",
#     "path_to_save_model": "./default_log_dir/{}".format(run_id),

    # only save the model state dict if F1 score surpasses <f1_2_save>
    "f1_2_save": 0, 
    # whether train_config from scratch
    "fr_scratch": True,
    # write down notes here if you want, it will be logged 
    "note": "start from scratch",
    # if not fr scratch, set a model_state_dict
    "model_state_dict_path": "",
    "hyper_parameters": {
        "batch_size": 24,
        "epochs": 200,
        "seed": 2333,
        "log_interval": 10,
        "max_seq_len": 100,
        "sliding_len": 20,
        "loss_weight_recover_steps": 6000, # to speed up the training process, the loss of EH-to-ET sequence is set higher than other sequences at the beginning, but it will recover in <loss_weight_recover_steps> steps.
        "scheduler": "CAWR", # Step
    },
}

eval_config = {
    "model_state_dict_dir": "./default_log_dir", # if use wandb, set "./wandb", or set "./default_log_dir" if you use default logger
    "run_ids": ["DGKhEFlH", ],
    "last_k_model": 1,
    "test_data": "*test*.json", # "*test*.json"
    
    # where to save results
    "save_res": False,
    "save_res_dir": "../results",
    
    # score: set true only if test set is annotated with ground truth
    "score": True,
    
    "hyper_parameters": {
        "batch_size": 32,
        "force_split": False,
        "max_test_seq_len": 512,
        "sliding_len": 50,
    },
}

请问数据集中的char_span有什么作用呢？

作者您好，我是直接下载的您清洗好的数据集进行debug。

然后我感觉训练评估和预测都是基于token的操作，并且感觉sub_char_span和obj_char_span 没有发挥作用，是否可以舍弃呢

请问诸位在debug这个代码的时候有没有出现陷入死循环的现象（无法运行下一行代码）

代码可以直接run。但是在debug单步的时候，涉及train_step里面的代码时，一旦单步调试无法进入到下一行，在当前行中运行什么命令。等很久都没有执行完。但是不单步调试，直接正常的运行是没有这个问题的。请问诸位可有遇到与我相似的问题？请不吝赐教！感谢！

For images

DuIE

很赞，请问该方案在DuIE数据集上表现怎样？做过实验吗？

关于WebNLG数据集关系数量

我使用set对训练集、校验集、测试集进行关系去重，最后只能得到216种关系，请问246种关系是如何得到的呢？

训练过程中验证集的f1值一直上不去是什么情况呢？

目前训练了40epoch，验证集val_f1还停留在0.21左右，请问这种情况正常么？

How can I resume the training with the latest checkpoint with wandb?

I trained the model for over one day with wandb but sadly my training was over by network interruption. I've tried to train the model again but it didn't resume from the latest checkpoint but restarted.

Question about layernorm?

line 86 in components.py:
std = (variance + self.epsilon) **2
Is this a bug? I think it should be:
std = (variance + self.epsilon) ** 0.5

BuildData 处理不了中文，求助

有几点疑问请教

非常感谢作者开源，很棒的模型架构。

1、第一个问题是关于数据集。
训练集有这么一个样本：

'text': '李治即位后，萧淑妃受宠，王皇后为了排挤萧淑妃，答应李治让身在感业寺的武则天续起头发，重新纳入后宫'

其中spo为(李治，妻子，萧淑妃)，妻子为关系。这里萧淑妃和李治分别有两个。

我的问题是，在构建训练集时，是否有必要把所有位置的 s 和 o 都考虑到呢，如下可以构建4个spo，不知道这样去训练是否有问题。因为如果4个spo全部训练，预测的时候也会需要有预测4个spo的能力，是否会加大预测难度。

或者只选取第一次出现的实体，是否更好点呢？

{'predicate': '妻子', 'subject': '李治', 'subj_tok_span': [0, 2], 'object': '萧淑妃', 'obj_tok_span': [6, 9]},
{'predicate': '妻子', 'subject': '李治', 'subj_tok_span': [0, 2], 'object': '萧淑妃', 'obj_tok_span': [19, 22]},
{'predicate': '妻子', 'subject': '李治', 'subj_tok_span': [25, 27], 'object': '萧淑妃', 'obj_tok_span': [6, 9]},
{'predicate': '妻子', 'subject': '李治', 'subj_tok_span': [25, 27], 'object': '萧淑妃', 'obj_tok_span': [19, 22]}

2、中文数据集的tokenizer

我这边把作者在英文数据集的代码改造成了中文，为了简单，我去除了wordpiece，直接改用一个字符一个token。
比如
原句: "●1981年2月27日"，直接按单个字符切分成：['●', '1', '9', '8', '1', '年', '2', '月', '2', '7', '日']
其实这种方式，我感觉没啥问题，不知道作者有什么看法呢？

3、关于训练复杂度的问题。

我看self.shaking_ind2matrix_ind是样本长度的排列组合，像作者推荐样本长度是100，那么会有5050的span。复杂度为n^2。

因为训练时长的关系，然后我这边想对span进行剪枝，比如设定一个实体span的长度限制20，这样的话，span的总数会少得多，我就可以放心的把样本长度加长，不知道作者有没有试过呢，这种剪枝方案对性能会有影响吗？

testing on NYT

Hi! Since the NYT data is built through Distant Supervision, I guess that it has to be noisy, and using it as a test set does not lead to a correct evaluation. Then, how do you evaluate on NYT? Is there a part of NYT that is labeled manually?

...

tplinker-plus关系acc低

同一个中文数据集试验了tplinker和plus，tplinker f1能到74，但plus 的rel f1只有54，其中acc很高86，召回就只有39了，这个可能是因为什么导致的呢

Did you apply muilti EH-ETs in TPLinker-plus?

Hi dear author, your project is helping us a lot, thanks again! TPLinker Plus is really a strong model especially for lots of nested entities and complicated relations.

I notice that a single word or a word group can be predicted as two entity at the same time. e.g. "orange" -> fruit and "orange" -> color at the same time.

What makes it possible to tag 1+ entity labels on the same char_span? I learned that the different entity label represent the different number on TPLinker's EH-ET; which means it's not possible to have 1+ entity types on the same span. (must be orange -> fruit or orange -> color, but not the same time)

Did you apply muiltiple EH-ETs in TPLinker-plus? (Each EH-ET represents each entity type)

使用tplinker-plus时的GPU内存问题

我使用tplinker-plus的模型代码时，我的GPU采用了最大的GeForce RTX 3090,但是依然出现cuda内存不够的情况。而使用tplinker模型则不会报错。请问如何解决在tplinker-plus模型中的问题，有什么办法吗？

tok_sapn在tplinker-plus里有用到，但使用自己的数据集如何生成这个key呢

断错误

非常感谢您的分享！
我这边运行代码的时候，直接断错误，没有其他报错，这个是

发现BuildData预处理长样本内存溢出的问题

我观察BuildData在预处理NYT_star数据集的时候，内存占用很少。但是昨天我使用一个中文样本集做预处理的时候，发现在add_token_span的时候发生了内存溢出（20GB内存被占满）的问题。我的样本集总览是这样的：

样本集中最小文本长度 : 600 CHAR
样本集前 10% 文本最大长度 : 5092 CHAR
样本集前 20% 文本最大长度 : 6216 CHAR
样本集前 30% 文本最大长度 : 6852 CHAR
样本集前 40% 文本最大长度 : 7476 CHAR
样本集前 50% 文本最大长度 : 8074 CHAR
样本集前 60% 文本最大长度 : 9075 CHAR
样本集前 70% 文本最大长度 : 10171 CHAR
样本集前 80% 文本最大长度 : 11265 CHAR
样本集前 90% 文本最大长度 : 12293 CHAR
样本集中最大文本长度 : 18453 CHAR

每个样本都有几百个关系和几百个实体，样本集中一共有不到400个这样的样本。
不知内存溢出是否和滑动窗口有关呢？求教发生这种问题时，除了使用BuildData处理分割后的样本集再拼接train,valid,test,rel2id,ent2id还有data_stattistics，还有没有更好的方法呢？谢谢作者大大~

训练和测试时max_len的差异

您好，

您的代码中，在train_config中设置max_seq_len为128，在eval_config中设置max_seq_len为512。
因此，我想请教以下问题：
1.在训练时将长句划分为长度128以内的短句训练；测试阶段，长度超过128的长句子不会分割，是否会存在问题?
2.如此训练的模型是否只适合短句而不适合长句？

谢谢

请问模型在这几个数据集上训练的时间大致是多久，实验的环境是怎么样的？

对显卡显存要求高吗？训练时间大致多久？

What does mean of the metric in train log?

What does mean of the metric in train log? As shown in the following picture.

请问TPlinker-plus和原版有什么区别？

codes太乱了，工作挺不错的，😄

关系类别数量不平衡的问题

您好，再次感谢您分享的这么棒的工作。
有个问题请假下，我查看了 nyt 训练数据的关系类别，简单统计了下，发现关系类别的数量分布不平衡，有的差别很大，不知道您有没有特别处理这方面问题，或者这个问题最模型实际效果有没有影响。
如果我没有看错的话，训练以及测试过程中的评估都是用的所有类别的平均准确率，也就是好像没有考虑关系类别的问题。

        {'/business/company/advisors': 44,
         '/business/company/founders': 812,
         '/business/company/industry': 2,
         '/business/company/major_shareholders': 283,
         '/business/company/place_founded': 409,
         '/business/company_shareholder/major_shareholder_of': 283,
         '/business/person/company': 5614,
         '/location/administrative_division/country': 7303,
         '/location/country/administrative_divisions': 7303,
         '/location/country/capital': 8366,
         '/location/location/contains': 54669,
         '/location/neighborhood/neighborhood_of': 6018,
         '/people/deceased_person/place_of_death': 1964,
         '/people/ethnicity/geographic_distribution': 58,
         '/people/ethnicity/people': 22,
         '/people/person/children': 453,
         '/people/person/ethnicity': 22,
         '/people/person/nationality': 8599,
         '/people/person/place_lived': 6887,
         '/people/person/place_of_birth': 3098,
         '/people/person/profession': 2,
         '/people/person/religion': 69,
         '/sports/sports_team/location': 328,
         '/sports/sports_team_location/teams': 328})

softmax

在论文中你使用了softmax对标签进行对分类，然而在代码中，我没发现softmax函数，而是直接使用linear直接输出两个值

[dataset_name]_star与[dataset_name]两个数据集的差异是什么？存在这两个数据集的意义是什么？

您好，感谢分享这么优秀的工作！刚入此任务的小白，有几个关于数据集的疑问，麻烦您帮忙解惑，不甚感激。
1、nyt与nyt_star,webnlg与webnlg_star分别有什么差异？
2、[dataset_name]_star是经过[dataset_name]转化而来的吗？如果是的话，转化得到新数据集的意义是什么？
3、如果将该方法（使用bert作为encoder方法）应用于中文数据集（每一条文本句子可能掺杂数字和英文等），有什么需要注意的地方吗？
4、这两数据集都是没有提供实体类型，有没有提供这一信息的相关数据集?

期待您的回复！！！！

Evaluating other datasets

Hi @131250208 !

I was interested in evaluating other datasets (such as conll04 and scierc, as evaluated by SpERT). I was wondering wich other settings I'd have to change in your model to try them. So far, for conll04, the only change I made was changing match_pattern to "whole_text". I wrote a script to change conll04 dataset to CasRel's format, so I could use your build_data script to convert it to your format. As far as I can see, everything looks good with this procedure.
Since conll04 is such a smaller dataset (931 training sentences compared to nyt's 56k sentences, and around 200 validation/test sentences), I also considered changing loss_weight_recover_steps to 100. With a batch size of 12, there are only 79 training steps per epoch.
For nyt_star I was able to get good results with your model, without changing anything. However, for conll04, I had no success....after 20 epochs the scores are still 0. For scierc I get similar results, with the difference that the samples accuracies (ent_seq, head_rel, tail_rel) stay around 30%.
What else do you recommend changing for these benchmarks?

Thanks!

可以用ACE2005数据集来复现代码嘛？

中文数据集下add_token_spans

已解决

The metrics of training stage are not the same as they are on the evaluation stage

I have a rel_f1 at 0.52 and an ent_f1 at 0.73 on my training stage, where a checkpoint was saved; however, it says that the f1 values on my test set are only 0.25 and 0.5 (with the saved checkpoint).

I've tried to replace the test set with the same valid set that I used for training, but the evaluation scores are still much lower than the training metrics. (rel_f1 = 0.3, ent_f1 = 0.55).

what's more, I notice that the system create time of the checkpoint is totally the same as the wandb wall time when this checkpoint was saved. -> It's likely I'm not using a wrong checkpoint.

How do you get the accurate precision, recall and f1 score on evaluation stage? Thanks

是否可以加入外部知识？

再次感谢您分享的工作！
我这里有个例子，以下结果是用nyt模型训练，然后单条预测的结果：
{'rels': [{'id': '0', 'text': 'US Election 2020: Trump COVID-19 positive: What next?', 'relation_list': [{'subject': 'US', 'object': 'Trump', 'subj_tok_span': [0, 1], 'obj_tok_span': [4, 5], 'subj_char_span': [0, 2], 'obj_char_span': [18, 23], 'predicate': '/location/location/contains'}]}]}
从这个例子来看，模型应该是把 Trump 当做一个地名了。假设模型能正确识别Trump为一个人，然后进行关系预测，也很难预测总统关系出来。
我想问的是，有没有什么办法让模型学习一些知识？比如这个例子，告诉模型Trump是美国总统。当然，最后对结果进行外部知识检索匹配是一个方法！

输出能否加上实体类型

数据集中给的实体类型为什么都是DEFAULT，输出的时候能够预测实体类型并输出吗