lemonhu / ner-bert-pytorch Goto Github PK
View Code? Open in Web Editor NEWPyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.
License: MIT License
PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.
License: MIT License
你好,我在使用python evaluation.py 进行评估时碰到这样的问题,请问这又是什么原因导致的呢?完整运行输出如下:
Loading the dataset...
loading vocabulary file bert-base-chinese-pytorch/vocab.txt
done.
Starting evaluation...
Segmentation fault (core dumped)
我在interactive.py中调用了txt中,每次读取一行来给一句话打标签,想请问以下一开始结果还挺正确,为什么到后面就逐渐将空字符识别成MISC、LOC什么的。。(‘’,‘MISC’)
再重新运行的时候还是以前的错误结果了
要重新下载一遍文件夹再去用单句的input输入,结果才会很好...
这是为什么呢...
你好,开发者们:
bert-base-chinese-pytorch 模型的pytorch dump 无法下载,能否单独发一下?谢谢
训练得分
2019-11-19 10:20:56,050:INFO: Epoch 14/20
2019-11-19 11:34:31,893:INFO: - Train metrics: loss: 00.00; f1: 96.55
2019-11-19 11:42:04,513:INFO: - Val metrics: loss: 00.01; f1: 95.67
2019-11-19 11:42:11,501:INFO: Best val f1: 98.54
测试得分
2019-11-19 14:29:30,180:INFO: Starting evaluation...
2019-11-19 14:36:44,370:INFO: - Test metrics: loss: 14.14; f1: 00.02
2019-11-19 14:36:53,070:INFO: precision recall f1-score support
prop 0.01 0.63 0.02 42707
parts 0.04 0.01 0.01 96256
model 0.01 0.03 0.01 23660
avg / total 0.03 0.18 0.02 162623
训练和测试都是用的evaluate.py, 为什么会出现这种情况
你好,请问利用你的代码该怎样应用起来呢,例如给定新的语料,怎样找出NER实体呢?谢谢啦
现在有两个数据集,两个数据集的样本数量大致为100:1。
但两个数据集的标记类型并不一致。
能否在大数据集上训练好之后,再在小数据集上进行 finetune。
如果可以,应该怎么去做呢?
Hi
Thanks for sharing your work, could you please guide as to how to train on custom data?
Thanks in advance.
when I ran the project by python train.py
, it seemed to freeze when an epoch came to the end
Epoch 1/10
/Users/xxx/opt/anaconda3/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
100%|█████████████████████████| 1400/1400 [4:48:03<00:00, 12.35s/it, loss=0.051]
and then nothing happened in four or more hours, with the cpu and memory consumption still very high.
How could I solve this problem?
使用的是原始数据,但是用的是新的Pytorch和transformers , 但是f1值 只能达到50%左右。
而且这值 是micro还是macro?
Download the Google's BERT base model for Chinese from BERT-Base, Chinese (Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters), and decompress it.
按照默认参数运行程序之后,准确率较低,结果如下
`Starting evaluation...
Test metrics: loss: 00.12; f1: 68.16
precision recall f1-score support
ORG 46.25 65.73 54.30 1287
LOC 66.21 81.61 73.11 2790
PER 68.62 76.68 72.43 1372
avg / total 62.11 76.62 68.50 5449
`
请问是什么问题呢?
Hi lemonhu,
I am confused that the Chinese bert model labeling on each Chinese word or Chinese character?
And is there a Chinese wordpiece processing?
你好,我想请问下你experiment/base_model中的train.log对应的那次训练用的参数是params.json中的还是有调整过?
因为在你的train.log中epoch是1的时候,你的train_F1和val_F1都有0.9几了,我按照你的code一模一样跑了一遍(除了args中可能有不同外),我跑出来在前几个epoch的指标都只有0.4几。
batch size: 32
GPU: RTX 2080Ti
torch: 1.7.1
其他都一样,参数用的默认params.json里的,能分享一下你的超参嘛,谢谢
wordpiece是怎么处理的?没看到相关的代码。
您的项目本身是非常好的,文档的指引说明也很详细
我自己pytorch-pretrained-bert采用0.4.0,在项目的msra上f1是80左右,在自己的数据上就只有十几了
另外冒昧问一下这个pytorch-pretrained-bert的官方文档可否给个指引,以及这个和huggingface的关系是什么
Thanks for your contribution.
你好,首先感谢你把代码开放出来,代码写的很优雅,但是我有个疑问,我看你在处理分词的时候,用的是bert自带的tokenizer,也就是WordpieceTokenizer,但是这样会出现字符合并的现象,就ner这种任务,是不是需要做一些处理来保证分词之后的序列能和之前的序列长度保持一致??
After training the model using the default parameters:
python train.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model
I ran evaluation with
python evaluate.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model
and got the following results, which are very much lower than those described in the "detail results on the test set" from the repo README:
Test metrics: loss: 00.05; f1: 58.29
precision recall f1-score support
PER 35.36 90.09 50.79 1372
ORG 35.24 85.70 49.94 1287
LOC 53.90 92.11 68.01 2790
avg / total 44.83 90.09 59.41 5449
Any advice on how to reproduce the expected scores would be greatly appreciated.
For msra dataset, I realised that the total number of entities being evaluated is not the same as it is.
As you can see, For test data, the support (true entities) is:
But the true entities are (I also checked the dataset you created, which also match the counts below):
Can you please take a look into this problem ?
How do you deal with the very long sentences in the dataset?
博主你好,我最近也在使用bert,遇到了问题。我一个实验使用训练集有超百万字的规模,但是使用另一个只有几万字的训练集(ccks)时不能识别出一个实体。请问使用bert时,我遇到的问题时训练集规模吗?
Hi,
First I want to thank you for releasing this amazing work.
I am trying to use your work to train another model.
However, I have to deal with the following problem.
In data_loader
process, when using BERT tokenizer, a word in my training sentence has been splitted into 4 tokens:
Courtaulds => ['court', '##au', '##ld', '##s']
Can you suggest how to represent the corresponding tag of this word?
Because in the tag
file, we only have 1 tag per 1 word, when running the assertion to check the length of tag list and token list, it will raise an error.
Hi, In NER-BERT-pytorch, inappropriate dependency versioning constraints can cause risks.
Below are the dependencies and version constraints that the project is using
tensorflow>=1.11.0
torch>=0.4.1
tqdm
pytorch-pretrained-bert==0.4.0
apex
The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.
After further analysis, in this project,
The version constraint of dependency tqdm can be changed to >=4.36.0,<=4.64.0.
The version constraint of dependency pytorch-pretrained-bert can be changed to >=0.3.0,<=0.6.2.
The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.
The invocation of the current project includes all the following methods.
tqdm.trange
pytorch_pretrained_bert.BertForTokenClassification
torch.nn.DataParallel.to sentences.append metrics.classification_report print numpy.argmax list str loss.mean.backward self.__dict__.update collections.defaultdict.items tqdm.trange file.write ps.append data_loader.DataLoader.data_iterator zip json.dump chunks.append torch.nn.DataParallel.train filter set torch.nn.DataParallel any model tqdm.trange.set_postfix torch.nn.DataParallel.parameters rs.append logging.info self.load_tags.append self.tag2idx.get logging.getLogger.addHandler torch.tensor.to file_sentences.write metrics.items batch_output.detach.cpu.numpy.detach logging.getLogger.setLevel self.load_tags s.append logging.StreamHandler isinstance collections.defaultdict utils.load_checkpoint numpy.average train_and_evaluate torch.optim.Adam file_tags.write torch.nn.DataParallel.half torch.save get_entities train hasattr float words.append format enumerate utils.RunningAverage.update sum utils.RunningAverage logging.StreamHandler.setFormatter set.append pytorch_pretrained_bert.BertTokenizer.from_pretrained torch.cuda.device_count open batch_output.detach.cpu.numpy range utils.Params pytorch_pretrained_bert.BertConfig.from_json_file line.strip.split self.tokenizer.tokenize utils.set_logger model.load_state_dict torch.cuda.manual_seed_all pytorch_pretrained_bert.BertForTokenClassification pred_tags.extend ImportError e.d2.add torch.optim.lr_scheduler.LambdaLR build_tags batch_tags.to.numpy.to row_fmt.format torch.optim.Adam.backward min logging.getLogger evaluate.evaluate logging.FileHandler apex.optimizers.FusedAdam join json.load max shutil.copyfile chunk.split dataset.append random.seed model.classifier.named_parameters loss_avg argparse.ArgumentParser.parse_args hasattr.state_dict data_loader.DataLoader torch.nn.DataParallel.named_parameters apex.optimizers.FP16_Optimizer e.d1.add ValueError numpy.ones data_loader.DataLoader.load_data true_tags.extend logging.FileHandler.setFormatter argparse.ArgumentParser.add_argument self.tokenizer.convert_tokens_to_ids torch.load torch.optim.lr_scheduler.LambdaLR.step next torch.nn.DataParallel.zero_grad load_dataset save_dataset end_of_chunk torch.nn.DataParallel.eval line.strip.strip start_of_chunk optimizer_to_save.state_dict torch.manual_seed optimizer.load_state_dict f1s.append os.path.join pytorch_pretrained_bert.BertForTokenClassification.from_pretrained batch_tags.to.numpy self.load_sentences_tags logging.Formatter numpy.sum len.format torch.device line.strip os.path.isfile batch_data.gt set.update idx2tag.get os.mkdir evaluate utils.save_checkpoint argparse.ArgumentParser tag.strip loss.mean.mean loss.mean.item torch.nn.utils.clip_grad_norm_ random.shuffle torch.optim.Adam.step os.path.exists metrics.f1_score len torch.cuda.is_available torch.tensor batch_output.detach.cpu os.makedirs
@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.
Traceback (most recent call last):
File "run.py", line 138, in
run()
File "run.py", line 134, in run
train(train_loader, dev_loader, model, optimizer, scheduler, config.model_dir)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/train.py", line 47, in train
train_epoch(train_loader, model, optimizer, scheduler, epoch)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/train.py", line 22, in train_epoch
token_type_ids=None, attention_mask=batch_masks, labels=batch_labels)[0]
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/model.py", line 52, in forward
logits = self.classifier(lstm_output)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1600x2048 and 1024x31)
please help me😬
跑train.py,在训练中出现:
loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin from cache at /home/xinliu/.cache/torch/pytorch_transformers/b1b5e295889f2d0979ede9a78ad9cb5dc6a0e25ab7f9417b315f0a2c22f4683d
Weights of BertForTokenClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
Weights from pretrained model not used in BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
Starting training for 20 epoch(s)
Epoch 1/20
0%| | 0/1400 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 220, in
train_and_evaluate(model, train_data, val_data, optimizer, scheduler, params, args.model_dir, args.restore_file)
File "train.py", line 99, in train_and_evaluate
train(model, train_data_iterator, optimizer, scheduler, params)
File "train.py", line 64, in train
loss.backward()
AttributeError: 'tuple' object has no attribute 'backward'
看起来有两个问题:
(1)BertForTokenClassification模型并没有从pretrained 模型初始化权重,但是pretrain 模型是下载成功了啊
(2)AttributeError: 'tuple' object has no attribute 'backward'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.