lemonhu / ner-bert-pytorch Goto Github PK

PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.

License: MIT License

Python 100.00%

ner named-entity-recognition entity-extraction chinese-ner google-bert transformer msra information-extraction pytorch

ner-bert-pytorch's Issues

Segmentation fault (core dumped)

你好，我在使用python evaluation.py 进行评估时碰到这样的问题，请问这又是什么原因导致的呢？完整运行输出如下：
Loading the dataset...
loading vocabulary file bert-base-chinese-pytorch/vocab.txt
done.
Starting evaluation...
Segmentation fault (core dumped)

调用txt文件中的句子（大概有十几万条），预测结果有问题...

我在interactive.py中调用了txt中，每次读取一行来给一句话打标签，想请问以下一开始结果还挺正确，为什么到后面就逐渐将空字符识别成MISC、LOC什么的。。（‘’，‘MISC’）
再重新运行的时候还是以前的错误结果了
要重新下载一遍文件夹再去用单句的input输入，结果才会很好...
这是为什么呢...

请问能要一份bakeoff2006MSRA的原始数据吗

bert-base-chinese-pytorch 无法下载

你好,开发者们:
bert-base-chinese-pytorch 模型的pytorch dump 无法下载,能否单独发一下?谢谢

邮箱:[email protected]

请问如何用该模型预测我的数据？

训练中在train和val数据上f1正常，但是在test数据上f1得分特低

训练得分

2019-11-19 10:20:56,050:INFO: Epoch 14/20
2019-11-19 11:34:31,893:INFO: - Train metrics: loss: 00.00; f1: 96.55
2019-11-19 11:42:04,513:INFO: - Val metrics: loss: 00.01; f1: 95.67
2019-11-19 11:42:11,501:INFO: Best val f1: 98.54

测试得分

2019-11-19 14:29:30,180:INFO: Starting evaluation...
2019-11-19 14:36:44,370:INFO: - Test metrics: loss: 14.14; f1: 00.02
2019-11-19 14:36:53,070:INFO:              precision    recall  f1-score   support

       prop       0.01      0.63      0.02     42707
      parts       0.04      0.01      0.01     96256
      model       0.01      0.03      0.01     23660

avg / total       0.03      0.18      0.02    162623

训练和测试都是用的evaluate.py, 为什么会出现这种情况

how to use the trained model to predict raw data

你好，请问利用你的代码该怎样应用起来呢，例如给定新的语料，怎样找出NER实体呢？谢谢啦

这两个f1值为什么：不一样?

表格上的f1值和表格内的f1值不一样，是为什么？

大佬我想学习你的代码，这段代码是什么意思呀

$WJ~Q3J_IJK}0GD{P(BQ){~T$

能否在训练好的模型上继续 finetune

现在有两个数据集，两个数据集的样本数量大致为100:1。
但两个数据集的标记类型并不一致。
能否在大数据集上训练好之后，再在小数据集上进行 finetune。
如果可以，应该怎么去做呢？

我显卡不太行，把batch_size改成了5，其余参数没变，6个epoch后测试集上的F1只有67.51，然后模型自动停止训练，没有训练到20epoch，请问是我卡的原因吗？

Train on custom data

Hi
Thanks for sharing your work, could you please guide as to how to train on custom data?

Thanks in advance.

an epoch cannot finish

when I ran the project by python train.py, it seemed to freeze when an epoch came to the end

Epoch 1/10
/Users/xxx/opt/anaconda3/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
100%|█████████████████████████| 1400/1400 [4:48:03<00:00, 12.35s/it, loss=0.051]

and then nothing happened in four or more hours, with the cpu and memory consumption still very high.
How could I solve this problem?

f1值不能重现

使用的是原始数据，但是用的是新的Pytorch和transformers , 但是f1值只能达到50%左右。
而且这值是micro还是macro?

官方中文BERT权重的链接好像不对

Download the Google's BERT base model for Chinese from BERT-Base, Chinese (Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters), and decompress it.

这里应该是chinese_L-12_H-768_A-12

无法复现结果，准确率较低

按照默认参数运行程序之后，准确率较低，结果如下
`Starting evaluation...

Test metrics: loss: 00.12; f1: 68.16
precision recall f1-score support

  ORG      46.25     65.73     54.30      1287
  LOC      66.21     81.61     73.11      2790
  PER      68.62     76.68     72.43      1372

avg / total 62.11 76.62 68.50 5449
`
请问是什么问题呢？

chinese word or chinese character

Hi lemonhu,

I am confused that the Chinese bert model labeling on each Chinese word or Chinese character?
And is there a Chinese wordpiece processing?

Some questions about the performance

你好，我想请问下你experiment/base_model中的train.log对应的那次训练用的参数是params.json中的还是有调整过？
因为在你的train.log中epoch是1的时候,你的train_F1和val_F1都有0.9几了，我按照你的code一模一样跑了一遍(除了args中可能有不同外)，我跑出来在前几个epoch的指标都只有0.4几。

无法复现结果

batch size: 32
GPU: RTX 2080Ti
torch: 1.7.1
其他都一样，参数用的默认params.json里的，能分享一下你的超参嘛，谢谢

wordpiece是怎么处理的？

wordpiece是怎么处理的？没看到相关的代码。

没有使用高版本f1值也并不理想

您的项目本身是非常好的，文档的指引说明也很详细

我自己pytorch-pretrained-bert采用0.4.0，在项目的msra上f1是80左右，在自己的数据上就只有十几了

另外冒昧问一下这个pytorch-pretrained-bert的官方文档可否给个指引，以及这个和huggingface的关系是什么

how to make ner prediction after fine-tuning the bert model?

Thanks for your contribution.

AttributeError: 'NoneType' object has no attribute 'tokenize'

tokenizer

你好，首先感谢你把代码开放出来，代码写的很优雅，但是我有个疑问，我看你在处理分词的时候，用的是bert自带的tokenizer，也就是WordpieceTokenizer，但是这样会出现字符合并的现象，就ner这种任务，是不是需要做一些处理来保证分词之后的序列能和之前的序列长度保持一致？？

Can't reproduce F1 scores

After training the model using the default parameters:
python train.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model
I ran evaluation with
python evaluate.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model
and got the following results, which are very much lower than those described in the "detail results on the test set" from the repo README:

Test metrics: loss: 00.05; f1: 58.29
precision recall f1-score support

  PER      35.36     90.09     50.79      1372
  ORG      35.24     85.70     49.94      1287
  LOC      53.90     92.11     68.01      2790

avg / total 44.83 90.09 59.41 5449
Any advice on how to reproduce the expected scores would be greatly appreciated.

预训练的bin、config和txt文件地址打不开

Something wrong with the total number of entities being evaluated ?

For msra dataset, I realised that the total number of entities being evaluated is not the same as it is.
As you can see, For test data, the support (true entities) is:

But the true entities are (I also checked the dataset you created, which also match the counts below):

Can you please take a look into this problem ?

About Pre-processing of MSRA

How do you deal with the very long sentences in the dataset?

数据很少能跑吗？

博主你好，我最近也在使用bert，遇到了问题。我一个实验使用训练集有超百万字的规模，但是使用另一个只有几万字的训练集（ccks）时不能识别出一个实体。请问使用bert时，我遇到的问题时训练集规模吗？

How to solve the increasing of tokens when tokenize?

Hi,
First I want to thank you for releasing this amazing work.
I am trying to use your work to train another model.

However, I have to deal with the following problem.
In data_loader process, when using BERT tokenizer, a word in my training sentence has been splitted into 4 tokens:
Courtaulds => ['court', '##au', '##ld', '##s']
Can you suggest how to represent the corresponding tag of this word?
Because in the tag file, we only have 1 tag per 1 word, when running the assertion to check the length of tag list and token list, it will raise an error.

为什么新版本bert效果如此差呢，这bert似乎也太不稳定了...

Project dependencies may have API risk issues

Hi, In NER-BERT-pytorch, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

tensorflow>=1.11.0
torch>=0.4.1
tqdm
pytorch-pretrained-bert==0.4.0
apex

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project,
The version constraint of dependency tqdm can be changed to >=4.36.0,<=4.64.0.
The version constraint of dependency pytorch-pretrained-bert can be changed to >=0.3.0,<=0.6.2.

The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the tqdm

tqdm.trange

The calling methods from the pytorch-pretrained-bert

pytorch_pretrained_bert.BertForTokenClassification

The calling methods from the all methods

torch.nn.DataParallel.to
sentences.append
metrics.classification_report
print
numpy.argmax
list
str
loss.mean.backward
self.__dict__.update
collections.defaultdict.items
tqdm.trange
file.write
ps.append
data_loader.DataLoader.data_iterator
zip
json.dump
chunks.append
torch.nn.DataParallel.train
filter
set
torch.nn.DataParallel
any
model
tqdm.trange.set_postfix
torch.nn.DataParallel.parameters
rs.append
logging.info
self.load_tags.append
self.tag2idx.get
logging.getLogger.addHandler
torch.tensor.to
file_sentences.write
metrics.items
batch_output.detach.cpu.numpy.detach
logging.getLogger.setLevel
self.load_tags
s.append
logging.StreamHandler
isinstance
collections.defaultdict
utils.load_checkpoint
numpy.average
train_and_evaluate
torch.optim.Adam
file_tags.write
torch.nn.DataParallel.half
torch.save
get_entities
train
hasattr
float
words.append
format
enumerate
utils.RunningAverage.update
sum
utils.RunningAverage
logging.StreamHandler.setFormatter
set.append
pytorch_pretrained_bert.BertTokenizer.from_pretrained
torch.cuda.device_count
open
batch_output.detach.cpu.numpy
range
utils.Params
pytorch_pretrained_bert.BertConfig.from_json_file
line.strip.split
self.tokenizer.tokenize
utils.set_logger
model.load_state_dict
torch.cuda.manual_seed_all
pytorch_pretrained_bert.BertForTokenClassification
pred_tags.extend
ImportError
e.d2.add
torch.optim.lr_scheduler.LambdaLR
build_tags
batch_tags.to.numpy.to
row_fmt.format
torch.optim.Adam.backward
min
logging.getLogger
evaluate.evaluate
logging.FileHandler
apex.optimizers.FusedAdam
join
json.load
max
shutil.copyfile
chunk.split
dataset.append
random.seed
model.classifier.named_parameters
loss_avg
argparse.ArgumentParser.parse_args
hasattr.state_dict
data_loader.DataLoader
torch.nn.DataParallel.named_parameters
apex.optimizers.FP16_Optimizer
e.d1.add
ValueError
numpy.ones
data_loader.DataLoader.load_data
true_tags.extend
logging.FileHandler.setFormatter
argparse.ArgumentParser.add_argument
self.tokenizer.convert_tokens_to_ids
torch.load
torch.optim.lr_scheduler.LambdaLR.step
next
torch.nn.DataParallel.zero_grad
load_dataset
save_dataset
end_of_chunk
torch.nn.DataParallel.eval
line.strip.strip
start_of_chunk
optimizer_to_save.state_dict
torch.manual_seed
optimizer.load_state_dict
f1s.append
os.path.join
pytorch_pretrained_bert.BertForTokenClassification.from_pretrained
batch_tags.to.numpy
self.load_sentences_tags
logging.Formatter
numpy.sum
len.format
torch.device
line.strip
os.path.isfile
batch_data.gt
set.update
idx2tag.get
os.mkdir
evaluate
utils.save_checkpoint
argparse.ArgumentParser
tag.strip
loss.mean.mean
loss.mean.item
torch.nn.utils.clip_grad_norm_
random.shuffle
torch.optim.Adam.step
os.path.exists
metrics.f1_score
len
torch.cuda.is_available
torch.tensor
batch_output.detach.cpu
os.makedirs

@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.

bert-lstm-crf_run.py

Traceback (most recent call last):
File "run.py", line 138, in
run()
File "run.py", line 134, in run
train(train_loader, dev_loader, model, optimizer, scheduler, config.model_dir)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/train.py", line 47, in train
train_epoch(train_loader, model, optimizer, scheduler, epoch)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/train.py", line 22, in train_epoch
token_type_ids=None, attention_mask=batch_masks, labels=batch_labels)[0]
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ym/test2/CLU/BERT-LSTM-CRF/model.py", line 52, in forward
logits = self.classifier(lstm_output)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/ym/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1600x2048 and 1024x31)

please help me😬

AttributeError: 'tuple' object has no attribute 'backward'

跑train.py，在训练中出现：

loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin from cache at /home/xinliu/.cache/torch/pytorch_transformers/b1b5e295889f2d0979ede9a78ad9cb5dc6a0e25ab7f9417b315f0a2c22f4683d
Weights of BertForTokenClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
Weights from pretrained model not used in BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
Starting training for 20 epoch(s)
Epoch 1/20
0%| | 0/1400 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 220, in
train_and_evaluate(model, train_data, val_data, optimizer, scheduler, params, args.model_dir, args.restore_file)
File "train.py", line 99, in train_and_evaluate
train(model, train_data_iterator, optimizer, scheduler, params)
File "train.py", line 64, in train
loss.backward()
AttributeError: 'tuple' object has no attribute 'backward'

看起来有两个问题：
（1）BertForTokenClassification模型并没有从pretrained 模型初始化权重，但是pretrain 模型是下载成功了啊
（2）AttributeError: 'tuple' object has no attribute 'backward'

lemonhu / ner-bert-pytorch Goto Github PK

ner-bert-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org