nealcly / bilstm-lan Goto Github PK

View Code? Open in Web Editor NEW

284.0 284.0 50.0 5.3 MB

Hierarchically-Refined Label Attention Network for Sequence Labeling

License: Apache License 2.0

Python 99.86% Shell 0.14%

ccg emnlp2019 label-attention-network named-entity-recognition part-of-speech-tagger pytorch sequence-labeling

bilstm-lan's Introduction

👋 Hi, I’m @Nealcly, a senior researcher at Natural Language Processing Center, Tencent AI lab.
👀 Please email me ([email protected]), if you would like to work with us.

bilstm-lan's People

Contributors

Stargazers

Watchers

Forkers

gokunwu kelleyyin yueping123 gaoyiyeah seeker1943 bigmai-1234 sjyttkl jieliorz yueyedeai yongheshinian zhang1546 chapzq77 flyflywang xrosliang fangfang22-oss cxncu001 jzwei023 yishuihanhan chocowu zxbnmop 90217 tj1116 davidinwuhanchina dutyhong wudi001007 learnerhouse uranusyu yk2018 bapleliu richiesui superrichiesui waterzxj mongjin lichunnan speedcell4 18106574249 aiedward harryingit3 greitzmann lgw863 jalola edmontdants qzqzzw abhi1nandy2 wangludewdrop vinayasathyanarayana napoler pranavajitnair kunlp

bilstm-lan's Issues

Performance on Ontonotes v5.0

First of all, Thanks for your last reply.
As your command I execute model with Ontonotes v5.0.
Although your official f1-score is 88.16%, I always get 85%.
When I execute your model with UD, I got very good performance. So I think I have something mistake.

It is my command.
python main.py --learning_rate 0.01 --lr_decay 0.035 --dropout 0.5 --hidden_dim 400 --lstm_layer 4 --momentum 0.9 --whether_clip_grad True --clip_grad 5.0 --train_dir 'data/onto.train.txt' --dev_dir 'data/onto.development.txt' --test_dir 'data/onto.test.txt' --model_dir 'model/' --word_emb_dir 'glove.6B.100d.txt'

It is summary.
DATA SUMMARY START:
I/O:
Tag scheme: BIO
MAX SENTENCE LENGTH: 250
MAX WORD LENGTH: -1
Number normalized: False
Word alphabet size: 69812
Char alphabet size: 119
Label alphabet size: 38
Word embedding dir: glove.6B.100d.txt
Char embedding dir: None
Word embedding size: 100
Char embedding size: 30
Norm word emb: False
Norm char emb: False
Train file directory: data/onto.train.txt
Dev file directory: data/onto.development.txt
Test file directory: data/onto.test.txt
Raw file directory: None
Dset file directory: None
Model file directory: model/
Loadmodel directory: None
Decode file directory: None
Train instance number: 115812
Dev instance number: 15679
Test instance number: 12217
Raw instance number: 0
FEATURE num: 0
++++++++++++++++++++++++++++++++++++++++
Model Network:
Model use_crf: False
Model word extractor: LSTM
Model use_char: True
Model char extractor: LSTM
Model char_hidden_dim: 50
++++++++++++++++++++++++++++++++++++++++
Training:
Optimizer: SGD
Iteration: 100
BatchSize: 10
Average batch loss: False
++++++++++++++++++++++++++++++++++++++++
Hyperparameters:
Hyper lr: 0.01
Hyper lr_decay: 0.035
Hyper HP_clip: 5.0
Hyper momentum: 0.9
Hyper l2: 1e-08
Hyper hidden_dim: 400
Hyper dropout: 0.5
Hyper lstm_layer: 4
Hyper bilstm: True
Hyper GPU: True
DATA SUMMARY END.

I think I follow the hyperparameters well written in your paper.
Is there any mistake?

Thanks for reading.

标签embedding

请问一下，标签embedding要训练吗？

Random initialize the word embedding

I have changed the Lao language dataset and want to perform pos, but there is no embedding for this language. How can I change it to randomly initialize embedding?

About OntoNotes 5.0

Hello.

I have a question about result when I use OntoNotes 5.0 as dataset for NER task.
As you wrote the performance of BiLSTM-LAN, expected accuracy is 88.16%..
But when I implemented model, I got 91.85% at Epoch 1.
Is there any mistake in my command?
My command is here.

python main.py --learning_rate 0.01 --lr_decay 0.035 --dropout 0.5 --hidden_dim 400 --lstm_layer 3 --momentum 0.9 --whether_clip_grad True --clip_ grad 5.0 --train_dir 'data/onto.train.txt' --dev_dir 'data/onto.development.txt' --test_dir 'data/onto.test.txt' --model_dir 'model/' --word_emb_dir 'glove.6B.1 00d.txt'

And this is Summary

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA SUMMARY START:
I/O:
Tag scheme: BIO
MAX SENTENCE LENGTH: 250
MAX WORD LENGTH: -1
Number normalized: False
Word alphabet size: 69812
Char alphabet size: 119
Label alphabet size: 38
Word embedding dir: glove.6B.100d.txt
Char embedding dir: None
Word embedding size: 100
Char embedding size: 30
Norm word emb: False
Norm char emb: False
Train file directory: data/onto.train.txt
Dev file directory: data/onto.development.txt
Test file directory: data/onto.test.txt
Raw file directory: None
Dset file directory: None
Model file directory: model/
Loadmodel directory: None
Decode file directory: None
Train instance number: 2200752
Dev instance number: 304684
Test instance number: 230111
Raw instance number: 0
FEATURE num: 0
++++++++++++++++++++++++++++++++++++++++
Model Network:
Model use_crf: False
Model word extractor: LSTM
Model use_char: True
Model char extractor: LSTM
Model char_hidden_dim: 50
++++++++++++++++++++++++++++++++++++++++
Training:
Optimizer: SGD
Iteration: 100
BatchSize: 4
Average batch loss: False
++++++++++++++++++++++++++++++++++++++++
Hyperparameters:
Hyper lr: 0.01
Hyper lr_decay: 0.035
Hyper HP_clip: None
Hyper momentum: 0.9
Hyper l2: 1e-08
Hyper hidden_dim: 400
Hyper dropout: 0.5
Hyper lstm_layer: 3
Hyper bilstm: True
Hyper GPU: True
DATA SUMMARY END.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This is a sample of dataset. I convert data format to txt file.

Just O
now O
we O
were O
primarily O
talking O
about O
those O
produced O
by O
counterfeiting O

Thanks you.

Do concated size of the last layer equal to label_alpha table size ?

As parameters listed in the utils/data.py , the label_dim equal to HP_hidden_dim, which specified to 200.
The output size of the LAN layer in the last layer is the size of [HP_hidden_dim, label_dim], that is 2*HP_hidden_dim.
After this, if we make use_crf = False , Do the concated size of the last layer eq to label_alpha size?
Or make use_crf = True, how to make the output teasor with size (batch, seq_len, HP_hidden_dim) to the emission probability ?

Could Anyone give me some advise ? Thx !

关于标签L的维度问题

请问标签embedding的时候，维度（dh）是不是需要和标签数量L（或label_num）一样大？
不然的话先看（假如维度512），attention=qk：（length, 512）（label_num,512）==》（length, label_num），--继续计算--- attentionV=（length,label_num）*(label_num,512)==》（length，512），这就是LAN的结果，这个结果去映射标签，假如标签只有128个，那512去对应128个标签映射肯定是有问题的啊？

所以文中的维度，也就是文中的512是不是必须和标签数量L（或label_num）相等，不然没法和输出对应呀

TypeError:mul():argument 'other' (position 1) must be Tensor,not list

when I run the main.py, the above problems occurs in line 213
(line 213 is "mask[idx,:seqlen] = torch.Tensor([1]*seqlen)")

您好，请教一下标签表征和代码中masking相关问题。

1.其中的label representation，是指对所有标签做embedding然后和biLSTM的embedding做attention吗
2.论文中并没有将mask，但是代码中attention有mask，没看懂
请教您，谢谢谢谢

Experiments on CoNLL03NER

Hello!
I try to run your code on conll03-ner dataset. But, performance I get is not as good as bilstm-crf. Could you help me find the bug? Thanks.
Here is part of my experiment log.

True
Seed num: 42
MODEL: train
Load pretrained word embedding, norm: False, dir: ../Data/pretrain_emb/glove.6B.100d.txt
Embedding:
pretrain word:400000, prefect match:11415, case_match:11656, oov:2234, oov%:0.08827945941673912
Training model...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA SUMMARY START:
I/O:
Tag scheme: BMES
MAX SENTENCE LENGTH: 250
MAX WORD LENGTH: -1
Number normalized: True
Word alphabet size: 25306
Char alphabet size: 78
Label alphabet size: 18
Word embedding dir: ../Data/pretrain_emb/glove.6B.100d.txt
Char embedding dir: None
Word embedding size: 100
Char embedding size: 30
Norm word emb: False
Norm char emb: False
Train file directory: ../Data/conll03/conll03.train.bmes
Dev file directory: ../Data/conll03/conll03.dev.bmes
Test file directory: ../Data/conll03/conll03.test.bmes
Raw file directory: None
Dset file directory: None
Model file directory: save/label_embedding
Loadmodel directory: None
Decode file directory: None
Train instance number: 14987
Dev instance number: 3466
Test instance number: 3684
Raw instance number: 0
FEATURE num: 0
++++++++++++++++++++++++++++++++++++++++
Model Network:
Model use_crf: False
Model word extractor: LSTM
Model use_char: True
Model char extractor: LSTM
Model char_hidden_dim: 50
++++++++++++++++++++++++++++++++++++++++
Training:
Optimizer: SGD
Iteration: 100
BatchSize: 10
Average batch loss: False
++++++++++++++++++++++++++++++++++++++++
Hyperparameters:
Hyper lr: 0.01
Hyper lr_decay: 0.04
Hyper HP_clip: None
Hyper momentum: 0.9
Hyper l2: 1e-08
Hyper hidden_dim: 400
Hyper dropout: 0.5
Hyper lstm_layer: 4
Hyper bilstm: True
Hyper GPU: True
DATA SUMMARY END.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
build network...
use_char: True
char feature extractor: LSTM
word feature extractor: LSTM
build word sequence feature extractor: LSTM...
build word representation...
build char sequence feature extractor: LSTM ...
--------pytorch total params--------
9849140
Epoch: 0/100
Learning rate is set as: 0.01
Instance: 14987; Time: 125.29s; loss: 2452.2396; acc: 172887.0/204567.0=0.8451
Epoch: 0 training finished. Time: 125.29s, speed: 119.62st/s, total loss: 126550.64305019379
totalloss: 126550.64305019379
gold_num = 5942 pred_num = 6508 right_num = 2556
Dev: time: 11.00s, speed: 317.98st/s; acc: 0.9036, p: 0.3927, r: 0.4302, f: 0.4106
gold_num = 5648 pred_num = 6351 right_num = 2261
Test: time: 10.95s, speed: 339.54st/s; acc: 0.8919, p: 0.3560, r: 0.4003, f: 0.3769
Exceed previous best f score: -10

inference

Ask a question:How to do in prediction, which don't have label? Thanks.

how to save predict output in text.txt

Afer run this project, there isn't any predict file generate. how to get them ?

Can you provide the pre-trained model ?

It's a very wonderful work.
Can you provide the pre-trained model ? I want to use it to decode text directly.

Thanks.

why exit query masking?

In lstm_attention.py , there exit the following code:

# Query Masking
query_masks = torch.sign(torch.abs(torch.sum(queries, dim=-1))) # (N, T_q)
query_masks = query_masks.repeat(self.num_heads, 1) # (hN, T_q)
query_masks = torch.unsqueeze(query_masks, 2).repeat(1, 1, keys.size()[1]) # (hN, T_q, T_k)
outputs = outputs * query_masks

The query_masks seem to be a tensor which contain 0 and 1, what is the effect of it and why does it exit ?

how about on chinese word segmentation task?

Confuse about the effect of label embedding in the model

As described in Fig2 of the paper, the label embedding is concated with the output of BiLSTM of Layer1 and Layer2, as well as the output of Label Attention Inference Layer. However,
how does the label embedding correctly take effect in Layer1 and Layer2 ? In addition, why the label embedding didn't concated as the input of final prediction.

Question about dataset splitting

Hi~

How did you split your OntoNotes datasets into train/dev/test?
I found a script (https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO), but the number of sentences in the results is different from Table 1 in your paper.
Could you please show me how did you split them?

"Download data and word embedding", where are they?

data and word embedding

Could you please provide links for downloading them? If anything requires licence, could you please also provide a comment informing about that ?

why is very slow about the model on CPU platform?

l run the model in windows10 with CPU, but it will spend 4 hours every epoch, that is, 100 epoches need 400 hour in order to run the whole model. it claims it is faster than biLSTM+CRF, actually,it is not.
ok, l run the BERT+biLSTM+CRF on same envirment(windows10 with CPU), it only costs 10 hours, however, it's accuracy is 0.92
Please can you tell me that is why?

Train/Decode speed comparation with CRF

Hello, your work is very great but I have a question about the speed comparation.
From your code published, when we use the CRF, in the seqlab.py file, the CRF layer only be used to calculate the loss in the neg_log_liklihood_loss function, but in the forward function, there is nothing with CRF layer. Did I miss something somewhere ? Hope hear your reply, thank you very much.