Code Monkey home page Code Monkey logo

tener's Introduction

TENER: Adapting Transformer Encoder for Named Entity Recognition

This is the code for the paper TENER.

TENER (Transformer Encoder for Named Entity Recognition) is a Transformer-based model which aims to tackle the NER task. Compared with the naive Transformer, we found relative position embedding is quite important in the NER task. Experiments in the English and Chinese NER datasets prove the effectiveness.

Requirements

This project needs the natural language processing python package fastNLP. You can install by the following command

pip install fastNLP

Run the code

(1) Prepare the English dataset.

Conll2003

Your file should like the following (The first token in a line is the word, the last token is the NER tag.)

LONDON NNP B-NP B-LOC
1996-08-30 CD I-NP O

West NNP B-NP B-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP B-PER

OntoNotes

Suggest to use the following code to prepare your data OntoNotes-5.0-NER. Or you can prepare data like the Conll2003 style, and then replace the OntoNotesNERPipe with Conll2003NERPipe in the code.

For English datasets, we use the Glove 100d pretrained embedding. FastNLP will download it automatically.

You can use the following code to run (make sure you have changed the data path)

python train_tener_en.py --dataset conll2003

or

python train_tener_en.py --dataset en-ontonotes

Although we tried hard to make sure you can reproduce our results, the results may still disappoint you. This is usually caused by the best dev performance does not correlate well with the test performance . Several runs should be helpful.

The ELMo version (FastNLP will download ELMo weights automatically, you just need to change the data path in train_elmo_en.)

python train_elmo_en.py --dataset en-ontonotes
MSRA, OntoNotes4.0, Weibo, Resume

Your data should only have two columns, the first is the character, the second is the tag, like the following

口 O
腔 O
溃 O
疡 O
加 O
上 O

For the Chinese datasets, you can download the pretrained unigram and bigram embeddings in Baidu Cloud. Download the 'gigaword_chn.all.a2b.uni.iter50.vec' and 'gigaword_chn.all.a2b.bi.iter50.vec'. Then replace the embedding path in train_tener_cn.py

You can run the code by the following command

python train_tener_cn.py --dataset ontonotes

tener's People

Contributors

yhcc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tener's Issues

Memory out 问题

使用中文数据集,train.txt大小50M,BIO标注法,更改数据路径和模型路径,以及BSEM为BIO,其他没有变动。使用python train_tenser_cn.py dataset ontonotes 即使是batchsize=1,仍会报错。
RuntimeError: CUDA out of memory. Tried to allocate 57.83 GiB (GPU 0; 8.00 GiB total capacity
windows和Linux测试,出现同样问题。
torch=1.5.1
fastnlp=0.5.5
prettytable=0.7.2

调取batch模块读取数据时出现的无限递归问题

作者你好!

尝试在微博数据上复现结果时出现使用DataSetGetter的__getattr__时出现无限递归问题,尝试修改为字典方法也无效,不知道程序是否在windows上测试过?用的fastNLP版本为5.5,尝试了1.1和1.5.1版本的torch都没有解决这个问题。

D:\Anaconda\envs\torch_env\python.exe E:/BoXiao/TENER-master/train_tener_cn.py --dataset weibo
Read cache from caches/weibo_transformer_bmeso_True.pkl.
In total 3 datasets:
train has 1350 instances.
dev has 270 instances.
test has 270 instances.
In total 3 vocabs:
chars has 3356 entries.
bigrams has 42184 entries.
target has 29 entries.

input fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
chars: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
bigrams: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
target fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])

training epochs started 2020-12-07-18-47-38-732204
Epoch 1/100: 0%| | 0/8500 [00:00<?, ?it/s, loss:{0:<6.5f}]Read cache from caches/weibo_transformer_bmeso_True.pkl.
In total 3 datasets:
train has 1350 instances.
dev has 270 instances.
test has 270 instances.
In total 3 vocabs:
chars has 3356 entries.
bigrams has 42184 entries.
target has 29 entries.

Traceback (most recent call last):
File "", line 1, in
File "D:\Anaconda\envs\torch_env\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\Anaconda\envs\torch_env\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
[Previous line repeated 662 more times]
RecursionError: maximum recursion depth exceeded
Read cache from caches/weibo_transformer_bmeso_True.pkl.
In total 3 datasets:
train has 1350 instances.
dev has 270 instances.
test has 270 instances.
In total 3 vocabs:
chars has 3356 entries.
bigrams has 42184 entries.
target has 29 entries.

Traceback (most recent call last):
File "E:/BoXiao/TENER-master/train_tener_cn.py", line 130, in
trainer.train(load_best_model=False)
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\trainer.py", line 618, in train
raise e
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\trainer.py", line 611, in train
self._train()
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\trainer.py", line 658, in _train
for batch_x, batch_y in self.data_iterator:
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 267, in iter
for indices, batch_x, batch_y in self.dataiter:
File "D:\Anaconda\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 193, in iter
return _DataLoaderIter(self)
File "D:\Anaconda\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 493, in init
Traceback (most recent call last):
File "", line 1, in
self._put_indices()
File "D:\Anaconda\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 591, in _put_indices
File "D:\Anaconda\envs\torch_env\lib\multiprocessing\spawn.py", line 105, in spawn_main
indices = next(self.sample_iter, None)
File "D:\Anaconda\envs\torch_env\lib\site-packages\torch\utils\data\sampler.py", line 172, in iter
exitcode = _main(fd)
File "D:\Anaconda\envs\torch_env\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 100, in getattr
if hasattr(self.dataset, item):
[Previous line repeated 662 more times]
RecursionError: maximum recursion depth exceeded
for idx in self.sampler:
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 121, in iter
return iter(self.sampler(self.dataset))
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\sampler.py", line 79, in call
seq_lens = data_set.get_all_fields()[self.seq_len_field_name].content
File "D:\Anaconda\envs\torch_env\lib\site-packages\fastNLP\core\batch.py", line 101, in getattr
return object.getattr(self.dataset, item)
AttributeError: type object 'object' has no attribute 'getattr'

期待回复,万分感谢。

attention score计算的疑问

请问代码relative_transformer.py种127行的E_如何理解,看了论文中在计算attention score时似乎没有这一项?

请教如何理解函数_transpose_shift

请教relative_transformer.py中_transpose_shift函数中的几个问题:
(1)它实现了矩阵的什么变换(如移动、旋转等)?
(2)怎么理解它是如何实现这种变换的呢?
(3)倒数第3行中的indice为什么只选取奇数行呢?
(4)_transpose_shift函数与_shift有什么区别,又有什么联系?
十分感谢


def _transpose_shift(self, E): 
       #E=[B,N,L,2*L]=[bsz, head, max_len, 2max_len] 如[2, 4, 68, 136];
        bsz, n_head, max_len, _ = E.size()
        zero_pad = E.new_zeros(bsz, n_head, max_len, 1)
        E = torch.cat([E, zero_pad], dim=-1).view(bsz, n_head, -1, max_len) # [B,N,2L,L] 
        indice = (torch.arange(max_len) * 2 + 1).to(E.device) # 选取是奇数行:[1,3,5...135]
        E = E.index_select(index=indice, dim=-2).transpose(-1, -2)  
        return E

Can this be used on non CONLL-2003 data format?

As above, can TENER preprocessing be done on dataset that does not follow CONLL-2003 format? My dataset does not have BIO scheme tagging. Meaning the sentences will look like this.

sentence = ['Hi', 'I', 'study', 'in', 'China', 'and', 'work' , 'in', 'ABC']
tag = ['O', 'O', 'O', 'O', 'Country', 'O', 'O', 'O', 'Company']

How to split the msra dataset?

Thanks very much for sharing the code! When I tried to reproduce the msra result, I found the msra dataset only contained train and test set. So, how do your split the train set into train and dev set? It seems the paper does not mention it. I will be grateful if you could reply.

如何将中文数据集转为

您好,我的数据集已经处理为了如下形式:
口 O
腔 O
溃 O
疡 O
加 O
上 O
如何进一步将数据集处理为‘train.char.bmes'形式,让模型run起来呢?

数据下载

from fastNLP.io.loader import WeiboNERLoader
save_path = WeiboNERLoader().download()
print(save_path) # 查看文件在本地的存放路径

发现一个小bug

modules文件夹下transformer.py中用于实现qkv矩阵的线性层,bias参数应该是False吧

class MultiHeadAttn(nn.Module):
    def __init__(self, d_model, n_head, dropout=0.1, scale=False):
        super().__init__()
        assert d_model%n_head==0

        self.n_head = n_head
        self.qkv_linear = nn.Linear(d_model, 3*d_model)
        self.fc = nn.Linear(d_model, d_model)

在relative_transformer.py文件中qv的实现,bias参数是False的

class RelativeMultiHeadAttn(nn.Module):
    def __init__(self, d_model, n_head, dropout, r_w_bias=None, r_r_bias=None, scale=False):
        super().__init__()
        self.qv_linear = nn.Linear(d_model, d_model * 2, bias=False)
        self.n_head = n_head
        self.head_dim = d_model // n_head
        self.dropout_layer = nn.Dropout(dropout)

TENER vs BERT

Hi, have you compared adapted transformer with bert, where pre-trained knowledge might make up for the drawback of vanilla transformer?

fastNLP如何将index映射回去

作者你好,这两天在阅读代码,看到将character和tag转换为index这里,我在想如果我想得到预测的BIO标签,那我怎么把index映射回去呢。

请问train.all.bmes 这种格式的文件怎么找?

您好,我正在学习您的tener。里面发现WeiboNER/train.all.bmes 这种格式的文件。缺少。可是我找了一下,weibo的 标注,没看的这种标注的。
介绍里的 是
科 O
技 O
全 O
方 O
位 O
资 O
讯 O
智 O

可是 bmes是什么呢?怎么获取,或怎么标注出来?
谢谢

Hello, your work is very nice. I want to cite it.

@misc{yan2019tener,
    title={TENER: Adapting Transformer Encoder for Named Entity Recognition},
    author={Hang Yan and Bocao Deng and Xiaonan Li and Xipeng Qiu},
    year={2019},
    eprint={1911.04474},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Is it right?

For the unigram and bigram embedding for Chinese dataset, what if i wanna train my own set of embedding??

Hi,

Thanks for your amazing work.

For the following unigram and bigram embedding, If I wanna try my own set of vec files, what can i do? T

For the Chinese datasets, you can download the pretrained unigram and bigram embeddings in Baidu Cloud. Download the 'gigaword_chn.all.a2b.uni.iter50.vec' and 'gigaword_chn.all.a2b.bi.iter50.vec'. Then replace the embedding path in train_tener_cn.py

Thanks a lot,

Marcus

测试结果

感谢您的工作!我在运行时想请教一个问题,论文中的实验结果是选取的'may not correspond to the best dev performance' 还是‘correspond to the best dev performance’。谢谢!!

您好我想问一下对测试集计算P\R\F1时的一个小问题?

我想问下 对于测试集长度长于训练集的训练长度的文本时?对于这部分文本长度比训练集设定的最大训练文本长度长的话,在计算P\R\F1时是怎么进行处理的呀?
是截断成两个句子分别输入模型预测吗?
还是对于测试集的文本 长于训练集长度的部分进行抛弃,计算PRF1时候也不用它们?
比如 训练集 最大长度为128,测试集长度是200,那么在计算P R F1时是怎么计算的呢?
是将测试集截断成128从而与预测的128长度进行计算吗

reproducing the results on the paper

Thanks for sharing the code,
can you please share arguments that has been used for training, specifically in the paper propose a modified version of positional encoding but I see that it is set to use None in the file.
Thanks for your help.

Testing and predicting with TENER

Hello, first and foremost I hope this message finds you well and healthy. I need to load the model and test it on specific data. The data type is str. Can you please send me the demo? Thank you.
I look forward to your reply.
Sincerely,

论文中取哪一个结果,评价指标问题

作者大大您好,我想问一下运行过程出现两个结果,一个是test-data后的,一个是dev后的。论文中的结果是取哪一个在6个数据及上和其他模型对比呢?还有一般不都是取test吗?为什么最后是get best dev?期待您的回答!

BrokenPipeError: [Errno 32] Broken pipe

Linux和Windows下,使用CPU训练时,可设置num_workers=0

trainer = Trainer(data_bundle.get_dataset('train'), model, optimizer, batch_size=batch_size, sampler=BucketSampler(), num_workers=0, n_epochs=n_epochs, dev_data=data_bundle.get_dataset('dev'), metrics=SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'), encoding_type=encoding_type), dev_batch_size=batch_size, callbacks=callbacks, device=device, test_use_tqdm=False, use_tqdm=True, print_every=300, save_path=None)

训练结果问题

你好,我用自己的数据集训练,loss下降很少,100epoch训练的评分SpanFPreRecMetric: f=0.0, pre=0.0, rec=0.0 ,loss为113.这是什么问题?是数据集的问题吗?数据集格式为BIO格式

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.