fastnlp / tener Goto Github PK

Codes for "TENER: Adapting Transformer Encoder for Named Entity Recognition"

Python 100.00%

tener's Introduction

TENER: Adapting Transformer Encoder for Named Entity Recognition

This is the code for the paper TENER.

TENER (Transformer Encoder for Named Entity Recognition) is a Transformer-based model which aims to tackle the NER task. Compared with the naive Transformer, we found relative position embedding is quite important in the NER task. Experiments in the English and Chinese NER datasets prove the effectiveness.

Requirements

This project needs the natural language processing python package fastNLP. You can install by the following command

pip install fastNLP

Run the code

(1) Prepare the English dataset.

Conll2003

Your file should like the following (The first token in a line is the word, the last token is the NER tag.)

LONDON NNP B-NP B-LOC
1996-08-30 CD I-NP O

West NNP B-NP B-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP B-PER

OntoNotes

Suggest to use the following code to prepare your data OntoNotes-5.0-NER. Or you can prepare data like the Conll2003 style, and then replace the OntoNotesNERPipe with Conll2003NERPipe in the code.

For English datasets, we use the Glove 100d pretrained embedding. FastNLP will download it automatically.

You can use the following code to run (make sure you have changed the data path)

python train_tener_en.py --dataset conll2003

python train_tener_en.py --dataset en-ontonotes

Although we tried hard to make sure you can reproduce our results, the results may still disappoint you. This is usually caused by the best dev performance does not correlate well with the test performance . Several runs should be helpful.

The ELMo version (FastNLP will download ELMo weights automatically, you just need to change the data path in train_elmo_en.)

python train_elmo_en.py --dataset en-ontonotes

MSRA, OntoNotes4.0, Weibo, Resume

Your data should only have two columns, the first is the character, the second is the tag, like the following

口 O
腔 O
溃 O
疡 O
加 O
上 O

For the Chinese datasets, you can download the pretrained unigram and bigram embeddings in Baidu Cloud. Download the 'gigaword_chn.all.a2b.uni.iter50.vec' and 'gigaword_chn.all.a2b.bi.iter50.vec'. Then replace the embedding path in train_tener_cn.py

You can run the code by the following command

python train_tener_cn.py --dataset ontonotes

tener's People

Contributors

Stargazers

Watchers

Forkers

yhcc guojson sunnymarkliu wangvince amaz0ser zeionara sbmaruf zhenyangiacas qiaoxiaosong mteterin jeniyat anddoit wuz12345 binzhouchn qcwthu gaohaihui ybqu gamehoo chenzuge1 gycg ai-surfing torrient huicao1995 yz-liu sshuster gztangde fireindark707 congsun-dlut qianrenjian helloworldhellokitty humdingers houpanpan xiaopp123 zl-xiang damionfan mars-wei zyq2016 chouisgiser jiazewang empty2enrich tulpen techthiyanes aqhali daaiyiyejian srbhr neurotech-hq 550952213 githubpgq swardiantara 43reyerhrstj wayne6172 choronx 6cigarette px6927

tener's Issues

Memory out 问题

使用中文数据集，train.txt大小50M，BIO标注法，更改数据路径和模型路径，以及BSEM为BIO,其他没有变动。使用python train_tenser_cn.py dataset ontonotes 即使是batchsize=1，仍会报错。
RuntimeError: CUDA out of memory. Tried to allocate 57.83 GiB (GPU 0; 8.00 GiB total capacity
windows和Linux测试，出现同样问题。
torch=1.5.1
fastnlp=0.5.5
prettytable=0.7.2

调取batch模块读取数据时出现的无限递归问题

作者你好！

尝试在微博数据上复现结果时出现使用DataSetGetter的__getattr__时出现无限递归问题，尝试修改为字典方法也无效，不知道程序是否在windows上测试过？用的fastNLP版本为5.5，尝试了1.1和1.5.1版本的torch都没有解决这个问题。

D:\Anaconda\envs\torch_env\python.exe E:/BoXiao/TENER-master/train_tener_cn.py --dataset weibo
Read cache from caches/weibo_transformer_bmeso_True.pkl.
In total 3 datasets:
train has 1350 instances.
dev has 270 instances.
test has 270 instances.
In total 3 vocabs:
chars has 3356 entries.
bigrams has 42184 entries.
target has 29 entries.

input fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
chars: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
bigrams: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
target fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])

training epochs started 2020-12-07-18-47-38-732204
Epoch 1/100: 0%| | 0/8500 [00:00<?, ?it/s, loss:{0:<6.5f}]Read cache from caches/weibo_transformer_bmeso_True.pkl.
In total 3 datasets:
train has 1350 instances.
dev has 270 instances.
test has 270 instances.
In total 3 vocabs:
chars has 3356 entries.
bigrams has 42184 entries.
target has 29 entries.

Traceback (most recent call last):
File "", line 1, in
File "D:\Anaconda\envs\torch_env\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\Anaconda\envs\torch_env\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
[Previous line repeated 662 more times]
RecursionError: maximum recursion depth exceeded
Read cache from caches/weibo_transformer_bmeso_True.pkl.
In total 3 datasets:
train has 1350 instances.
dev has 270 instances.
test has 270 instances.
In total 3 vocabs:
chars has 3356 entries.
bigrams has 42184 entries.
target has 29 entries.

Traceback (most recent call last):
File "E:/BoXiao/TENER-master/train_tener_cn.py", line 130, in
trainer.train(load_best_model=False)
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\trainer.py", line 618, in train
raise e
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\trainer.py", line 611, in train
self._train()
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\trainer.py", line 658, in _train
for batch_x, batch_y in self.data_iterator:
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 267, in iter
for indices, batch_x, batch_y in self.dataiter:
File "D:\Anaconda\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 193, in iter
return _DataLoaderIter(self)
File "D:\Anaconda\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 493, in init
Traceback (most recent call last):
File "", line 1, in
self._put_indices()
File "D:\Anaconda\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 591, in _put_indices
File "D:\Anaconda\envs\torch_env\lib\multiprocessing\spawn.py", line 105, in spawn_main
indices = next(self.sample_iter, None)
File "D:\Anaconda\envs\torch_env\lib\site-packages\torch\utils\data\sampler.py", line 172, in iter
exitcode = _main(fd)
File "D:\Anaconda\envs\torch_env\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
[Previous line repeated 662 more times]
RecursionError: maximum recursion depth exceeded
for idx in self.sampler:
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 121, in iter
return iter(self.sampler(self.dataset))
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\sampler.py", line 79, in call
seq_lens = data_set.get_all_fields()[self.seq_len_field_name].content
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 101, in getattr
return object.getattr(self.dataset, item)
AttributeError: type object 'object' has no attribute 'getattr'

期待回复，万分感谢。

有关数据集的一个小问题

您好，请问fastnlp中下载的peopleDaily是哪里的数据呢

attention score计算的疑问

请问代码relative_transformer.py种127行的E_如何理解，看了论文中在计算attention score时似乎没有这一项？

OntoNotes4.0在哪里下载呢？

作者你好，请问OntoNotes4.0是在哪里下载呢，我只看到https://github.com/yhcc/OntoNotes-5.0-NER
RuntimeError: Ontonotes cannot be downloaded automatically, you can refer https://github.com/yhcc/OntoNotes-5.0-NER to download and preprocess.

RelativeMultiHeadAttn

你好，请问msra数据集为什么要把train和dev都放在训练集呢

你好，请问在程序中为什么训练集是train_dev，验证集和测试集都是test呢

UserWarning: Couldn't retrieve source code for container of type ConditionalRandomField. It won't be checked for correctness upon loading.

Trainer的sava_path不为None的时候，会出现：
UserWarning: Couldn't retrieve source code for container of type ConditionalRandomField. It won't be checked for correctness upon loading.
虽然也保存了模型，但是加载之后会报错，请问这个怎么处理呢？

请问测试的代码没有吗

请教如何理解函数_transpose_shift

请教relative_transformer.py中_transpose_shift函数中的几个问题：
（1）它实现了矩阵的什么变换（如移动、旋转等）？
（2）怎么理解它是如何实现这种变换的呢？
（3）倒数第3行中的indice为什么只选取奇数行呢？
（4）_transpose_shift函数与_shift有什么区别，又有什么联系？
十分感谢


def _transpose_shift(self, E): 
       #E=[B,N,L,2*L]=[bsz, head, max_len, 2max_len] 如[2, 4, 68, 136]；
        bsz, n_head, max_len, _ = E.size()
        zero_pad = E.new_zeros(bsz, n_head, max_len, 1)
        E = torch.cat([E, zero_pad], dim=-1).view(bsz, n_head, -1, max_len) # [B,N,2L,L] 
        indice = (torch.arange(max_len) * 2 + 1).to(E.device) # 选取是奇数行：[1,3,5...135]
        E = E.index_select(index=indice, dim=-2).transpose(-1, -2)  
        return E

the OOV question

hello，How can the TENER solve the OOV problem, please?

Can this be used on non CONLL-2003 data format?

As above, can TENER preprocessing be done on dataset that does not follow CONLL-2003 format? My dataset does not have BIO scheme tagging. Meaning the sentences will look like this.

sentence = ['Hi', 'I', 'study', 'in', 'China', 'and', 'work' , 'in', 'ABC']
tag = ['O', 'O', 'O', 'O', 'Country', 'O', 'O', 'O', 'Company']

how to inference after the train has been completed?

i want to load model TENER after trained for predict
predict with input type Str
please!

How to split the msra dataset?

Thanks very much for sharing the code! When I tried to reproduce the msra result, I found the msra dataset only contained train and test set. So, how do your split the train set into train and dev set? It seems the paper does not mention it. I will be grateful if you could reply.

如何将中文数据集转为

您好，我的数据集已经处理为了如下形式：
口 O
腔 O
溃 O
疡 O
加 O
上 O
如何进一步将数据集处理为‘train.char.bmes'形式，让模型run起来呢？

数据下载

from fastNLP.io.loader import WeiboNERLoader
save_path = WeiboNERLoader().download()
print(save_path) # 查看文件在本地的存放路径

一个关于weibo数据集的问题

作者您好，有关weibo数据集，请问您在论文中使用的数据集是weiboNER.conll.train还是weiboNER_2nd_conll.train呢？

实现了TF版本的adapting transformer

很棒的工作，解决了我之前Transformer在NER任务上不work的问题
本人复现了TF版本的代码，如果代码存在问题请指出（简单在中文NER任务上测试，取得与BiLSTM相近的性能，没有严格调参）
https://github.com/qq547276542/TF_adapting_transformer

发现一个小bug

modules文件夹下transformer.py中用于实现qkv矩阵的线性层，bias参数应该是False吧

class MultiHeadAttn(nn.Module):
    def __init__(self, d_model, n_head, dropout=0.1, scale=False):
        super().__init__()
        assert d_model%n_head==0

        self.n_head = n_head
        self.qkv_linear = nn.Linear(d_model, 3*d_model)
        self.fc = nn.Linear(d_model, d_model)

在relative_transformer.py文件中qv的实现，bias参数是False的

class RelativeMultiHeadAttn(nn.Module):
    def __init__(self, d_model, n_head, dropout, r_w_bias=None, r_r_bias=None, scale=False):
        super().__init__()
        self.qv_linear = nn.Linear(d_model, d_model * 2, bias=False)
        self.n_head = n_head
        self.head_dim = d_model // n_head
        self.dropout_layer = nn.Dropout(dropout)

TENER vs BERT

Hi, have you compared adapted transformer with bert, where pre-trained knowledge might make up for the drawback of vanilla transformer?

fastNLP如何将index映射回去

作者你好，这两天在阅读代码，看到将character和tag转换为index这里，我在想如果我想得到预测的BIO标签，那我怎么把index映射回去呢。

请问train.all.bmes 这种格式的文件怎么找？

您好，我正在学习您的tener。里面发现WeiboNER/train.all.bmes 这种格式的文件。缺少。可是我找了一下，weibo的标注，没看的这种标注的。
介绍里的是
科 O
技 O
全 O
方 O
位 O
资 O
讯 O
智 O

可是 bmes是什么呢？怎么获取，或怎么标注出来？
谢谢

Hello, your work is very nice. I want to cite it.

@misc{yan2019tener,
    title={TENER: Adapting Transformer Encoder for Named Entity Recognition},
    author={Hang Yan and Bocao Deng and Xiaonan Li and Xipeng Qiu},
    year={2019},
    eprint={1911.04474},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Is it right?

Significance of using 'num_embeddings' and 'origin_shift'

Hi,

Could you please give clarity on the significance of using 'num_embeddings' and 'origin_shift' when computing RelativeSinusoidalPositionalEmbedding?

self.origin_shift = num_embeddings//2 + 1

Thanks,
Neena

标签不能只有“O”

if tag_set == bio_tag_set or tag_set==set('o'):
    return 'bio'

原来标签不能只有“O”

For the unigram and bigram embedding for Chinese dataset, what if i wanna train my own set of embedding??

Hi,

Thanks for your amazing work.

For the following unigram and bigram embedding, If I wanna try my own set of vec files, what can i do? T

Thanks a lot,

Marcus

请问一下用训练好的模型做测试？

how to (1) Prepare the English dataset.?

sorry I am not able to prepare the dateset since it is mainly explained is Chinese. can you please help to prepare or download the prepared version?

关于PE向量的问题，为什么要shift?以及为什么要transpose_shift？是为了计算R中的t-j吗

如题

测试结果

感谢您的工作！我在运行时想请教一个问题，论文中的实验结果是选取的'may not correspond to the best dev performance' 还是‘correspond to the best dev performance’。谢谢！！

您好我想问一下对测试集计算P\R\F1时的一个小问题？

我想问下对于测试集长度长于训练集的训练长度的文本时？对于这部分文本长度比训练集设定的最大训练文本长度长的话，在计算P\R\F1时是怎么进行处理的呀？
是截断成两个句子分别输入模型预测吗？
还是对于测试集的文本长于训练集长度的部分进行抛弃，计算PRF1时候也不用它们？
比如训练集最大长度为128，测试集长度是200，那么在计算P R F1时是怎么计算的呢?
是将测试集截断成128从而与预测的128长度进行计算吗

reproducing the results on the paper

Thanks for sharing the code,
can you please share arguments that has been used for training, specifically in the paper propose a modified version of positional encoding but I see that it is set to use None in the file.
Thanks for your help.

你好，我换成自己的数据后出现了这样的错：

assert tag in tags, f"{tag} is not a valid tag in encoding type:{encoding_type}. Please check your "
AssertionError: i is not a valid tag in encoding type:bmeso. Please check your encoding_type.
是哪里有问题呢？

Testing and predicting with TENER

Hello, first and foremost I hope this message finds you well and healthy. I need to load the model and test it on specific data. The data type is str. Can you please send me the demo? Thank you.
I look forward to your reply.
Sincerely,

论文中取哪一个结果，评价指标问题

作者大大您好，我想问一下运行过程出现两个结果，一个是test-data后的，一个是dev后的。论文中的结果是取哪一个在6个数据及上和其他模型对比呢？还有一般不都是取test吗？为什么最后是get best dev？期待您的回答！

BrokenPipeError: [Errno 32] Broken pipe

Linux和Windows下，使用CPU训练时，可设置num_workers=0

trainer = Trainer(data_bundle.get_dataset('train'), model, optimizer, batch_size=batch_size, sampler=BucketSampler(), num_workers=0, n_epochs=n_epochs, dev_data=data_bundle.get_dataset('dev'), metrics=SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'), encoding_type=encoding_type), dev_batch_size=batch_size, callbacks=callbacks, device=device, test_use_tqdm=False, use_tqdm=True, print_every=300, save_path=None)