Question about the relation categories.

I notice that you use argmax in your paper. But in the dataset, some entity pairs are labelled with multiple relations. Such as Iceland and Reykjavik are labelled as "contain" and "capital". So how do you handle that if you only choose the one with the largest probability?



样本集中最小文本长度 : 600 CHAR
样本集前 10% 文本最大长度 : 5092 CHAR
样本集前 20% 文本最大长度 : 6216 CHAR
样本集前 30% 文本最大长度 : 6852 CHAR
样本集前 40% 文本最大长度 : 7476 CHAR
样本集前 50% 文本最大长度 : 8074 CHAR
样本集前 60% 文本最大长度 : 9075 CHAR
样本集前 70% 文本最大长度 : 10171 CHAR
样本集前 80% 文本最大长度 : 11265 CHAR
样本集前 90% 文本最大长度 : 12293 CHAR
样本集中最大文本长度 : 18453 CHAR



作者您好,我看tplink-plus代码里有事件抽取相关的代码,想问下事件抽取可以用tplinker plus做吗?



TPLinkerPlus: find cuda error at training stage when add special tokens for bert

Hi author,
I'm trying to add some special tokens for bert tokenizer, it works fine with DataBuilder. However, the cuda error is found at the training stage. I've modified the bert tokenizer like this:

if config["encoder"] == "BERT":
tokenizer = BertTokenizerFast.from_pretrained(config["bert_path"], add_special_tokens = True, do_lower_case = False)
# The special tokens are added here.....
"cm", "mm", "ml", "CM", "MM", "ML", "x", "*",
"Hu", "hu", "HU", "Se", "se", "SE",
"Image", "image", "IMAGE", "Im", "im", "IM"
data_maker = DataMaker4Bert(tokenizer, handshaking_tagger)

here goes the cuda error info:
/pytorch/aten/src/THC/ indexSelectLargeIndex: block: [127,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/THC/ indexSelectLargeIndex: block: [127,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed
/pytorch/aten/src/THC/ indexSelectLargeIndex: block: [73,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/THC/ indexSelectLargeIndex: block: [73,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered

implementation questions

Thanks for your good idea and paper, have some questions about TPLinkerPlus

  1. I see you use logsum to calculate the loss, is it better than BCE?
  2. Is there any reference/paper for "conditional" layernorm which is used for "cln" shaking_type?







然后我感觉训练评估和预测都是基于token的操作,并且感觉sub_char_span和obj_char_span 没有发挥作用,是否可以舍弃呢

Evaluating other datasets

Hi @131250208 !

I was interested in evaluating other datasets (such as conll04 and scierc, as evaluated by SpERT). I was wondering wich other settings I'd have to change in your model to try them. So far, for conll04, the only change I made was changing match_pattern to "whole_text". I wrote a script to change conll04 dataset to CasRel's format, so I could use your build_data script to convert it to your format. As far as I can see, everything looks good with this procedure.
Since conll04 is such a smaller dataset (931 training sentences compared to nyt's 56k sentences, and around 200 validation/test sentences), I also considered changing loss_weight_recover_steps to 100. With a batch size of 12, there are only 79 training steps per epoch.
For nyt_star I was able to get good results with your model, without changing anything. However, for conll04, I had no success....after 20 epochs the scores are still 0. For scierc I get similar results, with the difference that the samples accuracies (ent_seq, head_rel, tail_rel) stay around 30%.
What else do you recommend changing for these benchmarks?




我看训练时对两个loss有不同的权重:loss_weights = {"ent": w_ent, "rel": w_rel}。刚开始w_ent很大,后面逐渐降低。





您好,我使用您的tplinker(非plus),在WebNLG上尝试进行实验,但是目前跑了100多个epochs还是head, tail rel acc为0,我观察到您在其它问题中回复到batch_size是6,epoch为200,请问这个数据集大概多少个epoch head_rel, tail_rel会有明显的提升呢?
#24 说的是NYT的问题,但是我在NYT跑的结果还挺好的,就是WebNLG不太对。所以另起了一个issue。

common = {
    "exp_name": "webnlg",
    "rel2id": "rel2id.json",
    "device_num": 0,
#     "encoder": "BiLSTM",
    "encoder": "BERT", 
    "hyper_parameters": {
        "shaking_type": "cat", # cat, cat_plus, cln, cln_plus; Experiments show that cat/cat_plus work better with BiLSTM, while cln/cln_plus work better with BERT. The results in the paper are produced by "cat". So, if you want to reproduce the results, "cat" is enough, no matter for BERT or BiLSTM.
        "inner_enc_type": "lstm", # valid only if cat_plus or cln_plus is set. It is the way how to encode inner tokens between each token pairs. If you only want to reproduce the results, just leave it alone.
        "dist_emb_size": -1, # -1: do not use distance embedding; other number: need to be larger than the max_seq_len of the inputs. set -1 if you only want to reproduce the results in the paper.
        "ent_add_dist": False, # set true if you want add distance embeddings for each token pairs. (for entity decoder)
        "rel_add_dist": False, # the same as above (for relation decoder)
        "match_pattern": "whole_text", # only_head_text (nyt_star, webnlg_star), whole_text (nyt, webnlg), only_head_index, whole_span
common["run_name"] = "{}+{}+{}".format("TP1", common["hyper_parameters"]["shaking_type"], common["encoder"]) + ""

run_id = ''.join(random.sample(string.ascii_letters + string.digits, 8))
train_config = {
    "train_data": "train_data.json",
    "valid_data": "valid_data.json",
    "rel2id": "rel2id.json",
    "logger": "wandb", # if wandb, comment the following four lines
#     # if logger is set as default, uncomment the following four lines
#     "logger": "default", 
#     "run_id": run_id,
#     "log_path": "./default_log_dir/default.log",
#     "path_to_save_model": "./default_log_dir/{}".format(run_id),

    # only save the model state dict if F1 score surpasses <f1_2_save>
    "f1_2_save": 0, 
    # whether train_config from scratch
    "fr_scratch": True,
    # write down notes here if you want, it will be logged 
    "note": "start from scratch",
    # if not fr scratch, set a model_state_dict
    "model_state_dict_path": "",
    "hyper_parameters": {
        "batch_size": 24,
        "epochs": 200,
        "seed": 2333,
        "log_interval": 10,
        "max_seq_len": 100,
        "sliding_len": 20,
        "loss_weight_recover_steps": 6000, # to speed up the training process, the loss of EH-to-ET sequence is set higher than other sequences at the beginning, but it will recover in <loss_weight_recover_steps> steps.
        "scheduler": "CAWR", # Step

eval_config = {
    "model_state_dict_dir": "./default_log_dir", # if use wandb, set "./wandb", or set "./default_log_dir" if you use default logger
    "run_ids": ["DGKhEFlH", ],
    "last_k_model": 1,
    "test_data": "*test*.json", # "*test*.json"
    # where to save results
    "save_res": False,
    "save_res_dir": "../results",
    # score: set true only if test set is annotated with ground truth
    "score": True,
    "hyper_parameters": {
        "batch_size": 32,
        "force_split": False,
        "max_test_seq_len": 512,
        "sliding_len": 50,


想问下有训练log参考下吗?nyt数据集用tplinker训练了100个epoch,实体准确率很高,但t_head_rel_sample_acc, t_tail_rel_sample_acc始终上不去。大约多少个epoch能看到显著变化呢?另外tplinker-plus训练过程中看不到t_head_rel_sample_acc了吗?只有一个t_sample_acc




testing on NYT

Hi! Since the NYT data is built through Distant Supervision, I guess that it has to be noisy, and using it as a test set does not lead to a correct evaluation. Then, how do you evaluate on NYT? Is there a part of NYT that is labeled manually?



The metrics of training stage are not the same as they are on the evaluation stage

I have a rel_f1 at 0.52 and an ent_f1 at 0.73 on my training stage, where a checkpoint was saved; however, it says that the f1 values on my test set are only 0.25 and 0.5 (with the saved checkpoint).

I've tried to replace the test set with the same valid set that I used for training, but the evaluation scores are still much lower than the training metrics. (rel_f1 = 0.3, ent_f1 = 0.55).

what's more, I notice that the system create time of the checkpoint is totally the same as the wandb wall time when this checkpoint was saved. -> It's likely I'm not using a wrong checkpoint.

How do you get the accurate precision, recall and f1 score on evaluation stage? Thanks

Question about layernorm?

line 86 in
std = (variance + self.epsilon) **2
Is this a bug? I think it should be:
std = (variance + self.epsilon) ** 0.5

the eval result

'time': 34.05615592002869,
'val_ent_seq_acc': 0.8174603375650588,
'val_f1': 0.7436676797886321,
'val_head_rel_acc': 0.39087302432883353,
'val_prec': 0.8524970963994364,
'val_recall': 0.659478885893921,
'val_tail_rel_acc': 0.40277778587880586

Current avf_f1: 0.7436676797886321, Best f1: 0.7436676797886321
I want to know how is the prec and the recall compute ?
And the ent_seq_acc and the head_rel_acc, tail_rel_acc is considered ?



"text": "急诊胸部CT:临床提示:胸闷头痛3天扫描层厚:5mm影像所示:两下肺少许渗出,两侧胸腔微量积液。无 明显气管、支气管异物;无 明显食管异物;无 气胸、液气胸征象;无 明显纵隔气肿、占位;无 明显心脏、大血管形态改变,无 明显心包积液。(所示肋骨)无 明显肋骨错位性骨折。",
"triple_list": [


exp_name: deepwise # nyt_star, nyt, webnlg_star, webnlg, ace05_lu
data_in_dir: ../datasets/ori_data
ori_data_format: casrel # casrel (webnlg_star, nyt_star), etl_span (webnlg), raw_nyt (nyt), tplinker (see readme)

encoder: BERT
bert_path: ../../pretrained_models/chinese-bert-wwm-ext-hit-pytorch-huggingface
data_out_dir: ../datasets/train_data/debugging

add_char_span: true
ignore_subword: true
separate_char_by_white: false
check_tok_span: true



Unable to save checkpoint with TPLinkerPlus

The checkpoints were saved successfully with my previous datasets and NYT_star, which contain thousands of entities and relations. However, last week when I tried to apply TPLinkerPlus to a new chinese dataset, which contains no relations and the lengths of all text are less than 20 chars; while the scores were being improved, no checkpoint files were found in the wandb folder.

My debugging

  1. Initially I thought it was caused by wandb, then I moved to the default logger, but still no checkpoints were saved.
  2. After that, I guessed the bug was caused by my dataset that contains no relations; therefore, I randomly added two or three relations into my training set, sadly it did nothing and I got no checkpoint files saved.
  3. I switched the dataset to my previous ones, the checkpoints are normally saved as the performance is improved while training.

Training parameters
"hyper_parameters": {
"batch_size": 28, # 32
"epochs": 1000,
"seed": 2333,
"log_interval": 10,
"max_seq_len": 80, # 128
"sliding_len": 20, #
"scheduler": "CAWR", # Step
"ghm": False, # set True if you want to use GHM to adjust the weights of gradients, this will speed up the training process and might improve the results. (Note that ghm in current version is unstable now, may hurt the results)
"tok_pair_sample_rate": 1, # (0, 1] How many percent of token paris you want to sample for training, this would slow down the training if set to less than 1. It is only helpful when your GPU memory is not enought for the training.


{'rels': [{'id': '0', 'text': 'US Election 2020: Trump COVID-19 positive: What next?', 'relation_list': [{'subject': 'US', 'object': 'Trump', 'subj_tok_span': [0, 1], 'obj_tok_span': [4, 5], 'subj_char_span': [0, 2], 'obj_char_span': [18, 23], 'predicate': '/location/location/contains'}]}]}
从这个例子来看,模型应该是把 Trump 当做一个地名了。假设模型能正确识别Trump为一个人,然后进行关系预测,也很难预测总统关系出来。


同一个中文数据集试验了tplinker和plus,tplinker f1能到74,但plus 的rel f1只有54,其中acc很高86,召回就只有39了,这个可能是因为什么导致的呢


我使用tplinker-plus的模型代码时,我的GPU采用了最大的GeForce RTX 3090,但是依然出现cuda内存不够的情况。而使用tplinker模型则不会报错。请问如何解决在tplinker-plus模型中的问题,有什么办法吗?

Did you apply muilti EH-ETs in TPLinker-plus?

Hi dear author, your project is helping us a lot, thanks again! TPLinker Plus is really a strong model especially for lots of nested entities and complicated relations.

I notice that a single word or a word group can be predicted as two entity at the same time. e.g. "orange" -> fruit and "orange" -> color at the same time.

What makes it possible to tag 1+ entity labels on the same char_span? I learned that the different entity label represent the different number on TPLinker's EH-ET; which means it's not possible to have 1+ entity types on the same span. (must be orange -> fruit or orange -> color, but not the same time)

Did you apply muiltiple EH-ETs in TPLinker-plus? (Each EH-ET represents each entity type)


作者你好,我参加了2021语言与智能技术竞赛,我非常喜欢tplinker模型的设计思路,所以用tplinker plus模型在duie关系抽取上测试了效果。
训练数据有17万条左右,用的是bert base模型,但是最终的效果只有 86.67/55.61/67.75。训练日志如下

{'val_shaking_tag_acc': 0.44333228247162676, 'val_rel_prec': 0.7269814343605692, 'val_rel_recall': 0.6981153419760628, 'val_rel_f1': 0.7122560391486524, 'val_ent_prec': 0.7853231106243155, 'val_ent_recall': 0.8241971405991809, 'val_ent_f1': 0.8042906719944524, 'time': 1675.9016613960266}


1、这里面我发现精确率比召回率高很多,并且val_shaking_tag_acc 也不高,不知道作者认为这大概是什么原因呢?有什么方法可以改进呢?谢谢

我这边觉得是矩阵稀疏的问题,一个100长度的句子,展开成5050个span,但是其实只有10%左右的span是有效的,感觉是其他90%的无效的span 影响了模型的学习。


"tok_pair_sample_rate": 1, # (0, 1] How many percent of token paris you want to sample for training,






'text': '李治即位后,萧淑妃受宠,王皇后为了排挤萧淑妃,答应李治让身在感业寺的武则天续起头发,重新纳入后宫'


我的问题是,在构建训练集时,是否有必要把所有位置的 s 和 o 都考虑到呢,如下可以构建4个spo,不知道这样去训练是否有问题。因为如果4个spo全部训练,预测的时候也会需要有预测4个spo的能力,是否会加大预测难度。


{'predicate': '妻子', 'subject': '李治', 'subj_tok_span': [0, 2], 'object': '萧淑妃', 'obj_tok_span': [6, 9]},
{'predicate': '妻子', 'subject': '李治', 'subj_tok_span': [0, 2], 'object': '萧淑妃', 'obj_tok_span': [19, 22]},
{'predicate': '妻子', 'subject': '李治', 'subj_tok_span': [25, 27], 'object': '萧淑妃', 'obj_tok_span': [6, 9]},
{'predicate': '妻子', 'subject': '李治', 'subj_tok_span': [25, 27], 'object': '萧淑妃', 'obj_tok_span': [19, 22]}


原句: "●1981年2月27日",直接按单个字符切分成:['●', '1', '9', '8', '1', '年', '2', '月', '2', '7', '日']





有个问题请假下,我查看了 nyt 训练数据的关系类别,简单统计了下,发现关系类别的数量分布不平衡,有的差别很大,不知道您有没有特别处理这方面问题,或者这个问题最模型实际效果有没有影响。

        {'/business/company/advisors': 44,
         '/business/company/founders': 812,
         '/business/company/industry': 2,
         '/business/company/major_shareholders': 283,
         '/business/company/place_founded': 409,
         '/business/company_shareholder/major_shareholder_of': 283,
         '/business/person/company': 5614,
         '/location/administrative_division/country': 7303,
         '/location/country/administrative_divisions': 7303,
         '/location/country/capital': 8366,
         '/location/location/contains': 54669,
         '/location/neighborhood/neighborhood_of': 6018,
         '/people/deceased_person/place_of_death': 1964,
         '/people/ethnicity/geographic_distribution': 58,
         '/people/ethnicity/people': 22,
         '/people/person/children': 453,
         '/people/person/ethnicity': 22,
         '/people/person/nationality': 8599,
         '/people/person/place_lived': 6887,
         '/people/person/place_of_birth': 3098,
         '/people/person/profession': 2,
         '/people/person/religion': 69,
         '/sports/sports_team/location': 328,
         '/sports/sports_team_location/teams': 328})

What are the causes of predicting weird char span of entities?

Hi there, I notice that there are some really weird entity char spans from my model predictions. e.g. The length of text is only 250, but some entities can have a span of [3510, 3512], which apparently makes no sense; the char span of a entity can also be predicted as [0, 0], that represents nothing. The ent_f1 and rel_f1 for my model is 0.81 and 0.78

What can be the causes of predicting these weird entity spans? Thank you.

There are no token span errors on the preprocessing and training phases.
the config for evaluation is listed as following:

eval_config = {
"model_state_dict_dir": "./wandb", # if use wandb, set "./wandb", or set "./default_log_dir" if you use default logger
"run_ids": ["2n26hvto", ],
"last_k_model": 1,
"test_data": "test.json",

"save_res": True,  
"save_res_dir": "../datasets/result_data",

"score": True,

"hyper_parameters": {
    "batch_size": 32, 
    "force_split": False,
    "max_seq_len": 240,
    "sliding_len": 50,


