lemonhu / ner-bert-pytorch Goto Github PK

View Code? Open in Web Editor NEW

435.0 4.0 110.0 7.63 MB

PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.

License: MIT License

Python 100.00%

ner named-entity-recognition entity-extraction chinese-ner google-bert transformer msra information-extraction pytorch

ner-bert-pytorch's Introduction

NER-BERT-pytorch

PyTorch solution of Named Entity Recognition task with Google AI's BERT model.

利用Google AI的BERT模型进行中文命名实体识别任务的PyTorch实现。

Welcome to watch, star or fork.

MSRA dataset

Here, we take the Chinese NER data MSRA as an example. Of course, the English NER data is also fully applicable.

Named entity recognition task is one of the tasks of the Third SIGHAN Chinese Language Processing Bakeoff, we take the simplified Chinese version of the Microsoft NER dataset as the research object.

Data Formats

The NER dataset of MSRA consists of training set data/msra_train_bio and test set data/msra_test_bio, and no validation set is provided. There are 45000 training samples and 3442 test samples, and we will divide them appropriately later.

The dataset contains three types of entities: Person, Organization, Location and Other, the corresponding abbreviated tags are PER, ORG and LOC and O.

The format is similar to that of the Co-NLL NER task 2002, adapted for Chinese. The data is presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:

Tag	Meaning
O	Not part of a named entity
B-PER	Beginning character of a person name
I-PER	Non-beginning character of a person name
B-ORG	Beginning character of an organization name
I-ORG	Non-beginning character of an organization name
B-LOC	Beginning character of a location name
I-LOC	Non-beginning character of a location name
B-GPE	Beginning character of a geopolitical entity
I-GPE	Non-beginning character of a geopolitical entity

Dataset patition

We randomly select 3000 samples from the training set as the validation set, and the test set is unchanged. Thus, the dataset distribution is as follows.

Dataset	Number
training set	42000
validation set	3000
test set	3442

Requirements

This repo was tested on Python 3.5+ and PyTorch 0.4.1/1.0.0. The requirements are:

tensorflow >= 1.11.0
torch >= 0.4.1
pytorch-pretrained-bert == 0.4.0
tqdm
apex

Note: The tensorflow library is only used for the conversion of pre-trained models from TensorFlow to PyTorch. apex is a tool for easy mixed precision and distributed training in Pytorch, please see https://github.com/NVIDIA/apex.

Results

We didn't search best parameters and obtained the following results.

Overall results

Based on the best performance of the model on the validation set, the overall effect of the model is as follows:

Dataset	F1_score
training set	99.88
validation set	95.90
test set	94.62

Detail results on test set

Based on the best model on the validation set, we can get the recognition effect of each entity type on the test set.

NE Types	Precison	Recall	F1_score
PER	96.36	96.43	96.39
ORG	89.64	92.07	90.84
LOC	95.92	95.13	95.52

Usage

Get BERT model for PyTorch

There are two ways to get the pre-trained BERT model in a PyTorch dump for your experiments :
- Direct download of the converted pytorch version of the BERT model
  
  You can download the pytorch dump I converted from the tensorflow checkpont from my Google Cloud Drive folder bert-base-chinese-pytorch, including the BERT parameters file bert_config.json, the model file pytorch_model.bin and the vocabulary file vocab.txt.
- Convert the TensorFlow checkpoint to a PyTorch dump by yourself
  - Download the Google's BERT base model for Chinese from BERT-Base, Chinese (Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters), and decompress it.
  - Execute the following command, convert the TensorFlow checkpoint to a PyTorch dump.
```
export TF_BERT_BASE_DIR=/path/to/chinese_L-12_H-768_A-12
export PT_BERT_BASE_DIR=/path/to/NER-BERT-pytorch/bert-base-chinese-pytorch

pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
	$TF_BERT_BASE_DIR/bert_model.ckpt \
	$TF_BERT_BASE_DIR/bert_config.json \
	$PT_BERT_BASE_DIR/pytorch_model.bin
```
  - Copy the BERT parameters file bert_config.json and dictionary file vocab.txt to the directory $PT_BERT_BASE_DIR.
```
cp $TF_BERT_BASE_DIR/bert_config.json $PT_BERT_BASE_DIR/bert_config.json
cp $TF_BERT_BASE_DIR/vocab.txt $PT_BERT_BASE_DIR/vocab.txt
```
Build dataset and tags
```
python build_msra_dataset_tags.py
```
It will extract the sentences and tags from the dataset data/msra_train_bio and data/msra_test_bio, split them into train/val/test and save them in a convenient format for our model, and create a file tags.txt containing a collection of tags.
Set experimental hyperparameters

We created a base_model directory for you under the experiments directory. It contains a file params.json which sets the hyperparameters for the experiment. It looks like
```
{
    "full_finetuning": true,
    "max_len": 180,

    "learning_rate": 3e-5,
    "weight_decay": 0.01,
    "clip_grad": 5,
}
```
For every new experiment, you will need to create a new directory under experiments with a params.json file.
Train and evaluate your experiment

if you use default parameters, just run
```
python train.py
```
Or specify parameters on the command line
```
python train.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model --multi_gpu
```
It will instantiate a model and train it on the training set following the hyperparameters specified in params.json. It will also evaluate some metrics on the development set.
Evaluation on the test set

Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set.

if you use default parameters, just run
```
python evaluate.py
```
Or specify parameters on the command line
```
python evaluate.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model
```

References

Devlin et al. BERT: Pre-training of Deep Bidirectional Trasnsformers for Language Understanding (2018) [paper]
google-research/bert [github]
huggingface/pytorch-pretrained-BERT [github]
NVIDIA/apex [github]
chakki-works/seqeval [github]

ner-bert-pytorch's People

Contributors

Stargazers

Watchers

Forkers

faker2cumtb changyanghe xiao2mo sunying1985 zyxdstu michael-wzhu stefensa xiesibo lusonpan62678 18106574249 xumeng123 sofiaxue lz-xinlin lzhan011 awesomemachinelearning leesehoon focox claudiu-mihaila yanghaihuo zhangbo2008 lcyuanjiang xinfeng1i thunder123321 zhouliyan111111 haidaoxiaofei cac2333 ywang85 shenfuli siliam sxrczh eppoha fword blueskybubble youyou606 bapleliu maxfeeling aaaves fionattu chenny0808 lc22222 buringcarbon jevishoo aiedward soulscb gongzishuye yipeng91 wells-qiang-chen salma9619 sampreeth-sarma 18166035475 yikedouer gongcq donniezhang586 jakobjanot a101269 ying97-code hhhhedy ai-mart flyingroastduck xiaomogui cytsinghua xuan-zw phac123 ml2457 napoler dlsnort huoshanfei b3y0nd rayshark since1886 louie-lou adbx zikf220 visionshao bob603049648 fdkup rambits-ai tianyudizhua luke-lujunxian bit-engd 00mjk aahmadai dliofindia kol66 ccnudhj beyondhyx joe-zsc evawang songbo-song wonder4o4 xukunpeng24 lstary7 messiczb graphalg xiaolei1220 shicode xiaoxiaoliang-lab byshev333 aqhali zjujeremy

ner-bert-pytorch's Issues

没有使用高版本f1值也并不理想

您的项目本身是非常好的，文档的指引说明也很详细

我自己pytorch-pretrained-bert采用0.4.0，在项目的msra上f1是80左右，在自己的数据上就只有十几了

另外冒昧问一下这个pytorch-pretrained-bert的官方文档可否给个指引，以及这个和huggingface的关系是什么

Project dependencies may have API risk issues

Hi, In NER-BERT-pytorch, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

tensorflow>=1.11.0
torch>=0.4.1
tqdm
pytorch-pretrained-bert==0.4.0
apex

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project,
The version constraint of dependency tqdm can be changed to >=4.36.0,<=4.64.0.
The version constraint of dependency pytorch-pretrained-bert can be changed to >=0.3.0,<=0.6.2.

The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the tqdm

tqdm.trange

The calling methods from the pytorch-pretrained-bert

pytorch_pretrained_bert.BertForTokenClassification

The calling methods from the all methods

torch.nn.DataParallel.to
sentences.append
metrics.classification_report
print
numpy.argmax
list
str
loss.mean.backward
self.__dict__.update
collections.defaultdict.items
tqdm.trange
file.write
ps.append
data_loader.DataLoader.data_iterator
zip
json.dump
chunks.append
torch.nn.DataParallel.train
filter
set
torch.nn.DataParallel
any
model
tqdm.trange.set_postfix
torch.nn.DataParallel.parameters
rs.append
logging.info
self.load_tags.append
self.tag2idx.get
logging.getLogger.addHandler
torch.tensor.to
file_sentences.write
metrics.items
batch_output.detach.cpu.numpy.detach
logging.getLogger.setLevel
self.load_tags
s.append
logging.StreamHandler
isinstance
collections.defaultdict
utils.load_checkpoint
numpy.average
train_and_evaluate
torch.optim.Adam
file_tags.write
torch.nn.DataParallel.half
torch.save
get_entities
train
hasattr
float
words.append
format
enumerate
utils.RunningAverage.update
sum
utils.RunningAverage
logging.StreamHandler.setFormatter
set.append
pytorch_pretrained_bert.BertTokenizer.from_pretrained
torch.cuda.device_count
open
batch_output.detach.cpu.numpy
range
utils.Params
pytorch_pretrained_bert.BertConfig.from_json_file
line.strip.split
self.tokenizer.tokenize
utils.set_logger
model.load_state_dict
torch.cuda.manual_seed_all
pytorch_pretrained_bert.BertForTokenClassification
pred_tags.extend
ImportError
e.d2.add
torch.optim.lr_scheduler.LambdaLR
build_tags
batch_tags.to.numpy.to
row_fmt.format
torch.optim.Adam.backward
min
logging.getLogger
evaluate.evaluate
logging.FileHandler
apex.optimizers.FusedAdam
join
json.load
max
shutil.copyfile
chunk.split
dataset.append
random.seed
model.classifier.named_parameters
loss_avg
argparse.ArgumentParser.parse_args
hasattr.state_dict
data_loader.DataLoader
torch.nn.DataParallel.named_parameters
apex.optimizers.FP16_Optimizer
e.d1.add
ValueError
numpy.ones
data_loader.DataLoader.load_data
true_tags.extend
logging.FileHandler.setFormatter
argparse.ArgumentParser.add_argument
self.tokenizer.convert_tokens_to_ids
torch.load
torch.optim.lr_scheduler.LambdaLR.step
next
torch.nn.DataParallel.zero_grad
load_dataset
save_dataset
end_of_chunk
torch.nn.DataParallel.eval
line.strip.strip
start_of_chunk
optimizer_to_save.state_dict
torch.manual_seed
optimizer.load_state_dict
f1s.append
os.path.join
pytorch_pretrained_bert.BertForTokenClassification.from_pretrained
batch_tags.to.numpy
self.load_sentences_tags
logging.Formatter
numpy.sum
len.format
torch.device
line.strip
os.path.isfile
batch_data.gt
set.update
idx2tag.get
os.mkdir
evaluate
utils.save_checkpoint
argparse.ArgumentParser
tag.strip
loss.mean.mean
loss.mean.item
torch.nn.utils.clip_grad_norm_
random.shuffle
torch.optim.Adam.step
os.path.exists
metrics.f1_score
len
torch.cuda.is_available
torch.tensor
batch_output.detach.cpu
os.makedirs

@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.

Can't reproduce F1 scores

After training the model using the default parameters:
python train.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model
I ran evaluation with
python evaluate.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model
and got the following results, which are very much lower than those described in the "detail results on the test set" from the repo README:

Test metrics: loss: 00.05; f1: 58.29
precision recall f1-score support

  PER      35.36     90.09     50.79      1372
  ORG      35.24     85.70     49.94      1287
  LOC      53.90     92.11     68.01      2790

avg / total 44.83 90.09 59.41 5449
Any advice on how to reproduce the expected scores would be greatly appreciated.

请问能要一份bakeoff2006MSRA的原始数据吗

数据很少能跑吗？

博主你好，我最近也在使用bert，遇到了问题。我一个实验使用训练集有超百万字的规模，但是使用另一个只有几万字的训练集（ccks）时不能识别出一个实体。请问使用bert时，我遇到的问题时训练集规模吗？

About Pre-processing of MSRA

How do you deal with the very long sentences in the dataset?

请问如何用该模型预测我的数据？

how to make ner prediction after fine-tuning the bert model?

Thanks for your contribution.

Some questions about the performance

你好，我想请问下你experiment/base_model中的train.log对应的那次训练用的参数是params.json中的还是有调整过？
因为在你的train.log中epoch是1的时候,你的train_F1和val_F1都有0.9几了，我按照你的code一模一样跑了一遍(除了args中可能有不同外)，我跑出来在前几个epoch的指标都只有0.4几。

我显卡不太行，把batch_size改成了5，其余参数没变，6个epoch后测试集上的F1只有67.51，然后模型自动停止训练，没有训练到20epoch，请问是我卡的原因吗？

How to solve the increasing of tokens when tokenize?

Hi,
First I want to thank you for releasing this amazing work.
I am trying to use your work to train another model.

However, I have to deal with the following problem.
In data_loader process, when using BERT tokenizer, a word in my training sentence has been splitted into 4 tokens:
Courtaulds => ['court', '##au', '##ld', '##s']
Can you suggest how to represent the corresponding tag of this word?
Because in the tag file, we only have 1 tag per 1 word, when running the assertion to check the length of tag list and token list, it will raise an error.

这两个f1值为什么：不一样?

表格上的f1值和表格内的f1值不一样，是为什么？

预训练的bin、config和txt文件地址打不开

AttributeError: 'tuple' object has no attribute 'backward'

跑train.py，在训练中出现：

loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin from cache at /home/xinliu/.cache/torch/pytorch_transformers/b1b5e295889f2d0979ede9a78ad9cb5dc6a0e25ab7f9417b315f0a2c22f4683d
Weights of BertForTokenClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
Weights from pretrained model not used in BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
Starting training for 20 epoch(s)
Epoch 1/20
0%| | 0/1400 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 220, in
train_and_evaluate(model, train_data, val_data, optimizer, scheduler, params, args.model_dir, args.restore_file)
File "train.py", line 99, in train_and_evaluate
train(model, train_data_iterator, optimizer, scheduler, params)
File "train.py", line 64, in train
loss.backward()
AttributeError: 'tuple' object has no attribute 'backward'

看起来有两个问题：
（1）BertForTokenClassification模型并没有从pretrained 模型初始化权重，但是pretrain 模型是下载成功了啊
（2）AttributeError: 'tuple' object has no attribute 'backward'

bert-base-chinese-pytorch 无法下载

你好,开发者们:
bert-base-chinese-pytorch 模型的pytorch dump 无法下载,能否单独发一下?谢谢

邮箱:[email protected]

bert-lstm-crf_run.py

Traceback (most recent call last):
File "run.py", line 138, in
run()
File "run.py", line 134, in run
train(train_loader, dev_loader, model, optimizer, scheduler, config.model_dir)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/train.py", line 47, in train
train_epoch(train_loader, model, optimizer, scheduler, epoch)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/train.py", line 22, in train_epoch
token_type_ids=None, attention_mask=batch_masks, labels=batch_labels)[0]
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/model.py", line 52, in forward
logits = self.classifier(lstm_output)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1600x2048 and 1024x31)

please help me😬

无法复现结果

batch size: 32
GPU: RTX 2080Ti
torch: 1.7.1
其他都一样，参数用的默认params.json里的，能分享一下你的超参嘛，谢谢

tokenizer

你好，首先感谢你把代码开放出来，代码写的很优雅，但是我有个疑问，我看你在处理分词的时候，用的是bert自带的tokenizer，也就是WordpieceTokenizer，但是这样会出现字符合并的现象，就ner这种任务，是不是需要做一些处理来保证分词之后的序列能和之前的序列长度保持一致？？

chinese word or chinese character

Hi lemonhu,

I am confused that the Chinese bert model labeling on each Chinese word or Chinese character?
And is there a Chinese wordpiece processing?

AttributeError: 'NoneType' object has no attribute 'tokenize'

大佬我想学习你的代码，这段代码是什么意思呀

$WJ~Q3J_IJK}0GD{P(BQ){~T$

Train on custom data

Hi
Thanks for sharing your work, could you please guide as to how to train on custom data?

Thanks in advance.

wordpiece是怎么处理的？

wordpiece是怎么处理的？没看到相关的代码。

训练中在train和val数据上f1正常，但是在test数据上f1得分特低

训练得分

2019-11-19 10:20:56,050:INFO: Epoch 14/20
2019-11-19 11:34:31,893:INFO: - Train metrics: loss: 00.00; f1: 96.55
2019-11-19 11:42:04,513:INFO: - Val metrics: loss: 00.01; f1: 95.67
2019-11-19 11:42:11,501:INFO: Best val f1: 98.54

测试得分

2019-11-19 14:29:30,180:INFO: Starting evaluation...
2019-11-19 14:36:44,370:INFO: - Test metrics: loss: 14.14; f1: 00.02
2019-11-19 14:36:53,070:INFO:              precision    recall  f1-score   support

       prop       0.01      0.63      0.02     42707
      parts       0.04      0.01      0.01     96256
      model       0.01      0.03      0.01     23660

avg / total       0.03      0.18      0.02    162623

训练和测试都是用的evaluate.py, 为什么会出现这种情况

官方中文BERT权重的链接好像不对

Download the Google's BERT base model for Chinese from BERT-Base, Chinese (Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters), and decompress it.

这里应该是chinese_L-12_H-768_A-12

无法复现结果，准确率较低

按照默认参数运行程序之后，准确率较低，结果如下
`Starting evaluation...

Test metrics: loss: 00.12; f1: 68.16
precision recall f1-score support

  ORG      46.25     65.73     54.30      1287
  LOC      66.21     81.61     73.11      2790
  PER      68.62     76.68     72.43      1372

avg / total 62.11 76.62 68.50 5449
`
请问是什么问题呢？

Segmentation fault (core dumped)

你好，我在使用python evaluation.py 进行评估时碰到这样的问题，请问这又是什么原因导致的呢？完整运行输出如下：
Loading the dataset...
loading vocabulary file bert-base-chinese-pytorch/vocab.txt
done.
Starting evaluation...
Segmentation fault (core dumped)

how to use the trained model to predict raw data

你好，请问利用你的代码该怎样应用起来呢，例如给定新的语料，怎样找出NER实体呢？谢谢啦

f1值不能重现

使用的是原始数据，但是用的是新的Pytorch和transformers , 但是f1值只能达到50%左右。
而且这值是micro还是macro?

Something wrong with the total number of entities being evaluated ?

For msra dataset, I realised that the total number of entities being evaluated is not the same as it is.
As you can see, For test data, the support (true entities) is:

But the true entities are (I also checked the dataset you created, which also match the counts below):

Can you please take a look into this problem ?

调用txt文件中的句子（大概有十几万条），预测结果有问题...

我在interactive.py中调用了txt中，每次读取一行来给一句话打标签，想请问以下一开始结果还挺正确，为什么到后面就逐渐将空字符识别成MISC、LOC什么的。。（‘’，‘MISC’）
再重新运行的时候还是以前的错误结果了
要重新下载一遍文件夹再去用单句的input输入，结果才会很好...
这是为什么呢...

an epoch cannot finish

when I ran the project by python train.py, it seemed to freeze when an epoch came to the end

Epoch 1/10
/Users/xxx/opt/anaconda3/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
100%|█████████████████████████| 1400/1400 [4:48:03<00:00, 12.35s/it, loss=0.051]

and then nothing happened in four or more hours, with the cpu and memory consumption still very high.
How could I solve this problem?

能否在训练好的模型上继续 finetune

现在有两个数据集，两个数据集的样本数量大致为100:1。
但两个数据集的标记类型并不一致。
能否在大数据集上训练好之后，再在小数据集上进行 finetune。
如果可以，应该怎么去做呢？