guotong1988 / bert-gpu Goto Github PK

View Code? Open in Web Editor NEW

173.0 9.0 54.0 199 KB

multi-gpu pre-training in one machine for BERT from scratch without horovod (Data Parallelism)

License: Apache License 2.0

Python 100.00%

tensorflow bert nlp multi-gpu

bert-gpu's Issues

This is just for pretraining BERT?

I want to fine tune pretrained BERT through multi-gpu in estimator API. Can your code solve this problem? thx.

《How To Pre-train BERT In GPUs》

ValueError: You must specify an aggregation method to update a MirroredVariable in Tower Context.

Does anyone meet this error:

ValueError: You must specify an aggregation method to update a MirroredVariable in Tower Context.

my tensorflow version is 1.10.0, I wanna know whether the tensorflow version causes this error

wrong when run_pretraining_gpu_v2 with init_checkpoint

load bert_model as init_chechpoint when run run_pretraining_gpu_v2,get a error:
File”/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 368, in _collect_partitioned_variable
if name + "/part_0" in all_vars:
TypeError: unsupported operand type(s) for +: 'PerReplica' and 'str'
run run_pretraining.py with bert_model as init_checkpoint will be ok.

OOM error

could i ask you about which of the pretrained model of offical bert you used, cuz i use the wwm_uncased_L-24_H-1024_A-16 model, and easily got an error of OOM.

Slower than single GPU

Hi, thanks for your work. I'd say that there is an error when measuring time. The unit of time.time() is second, not ms.

https://github.com/guotong1988/BERT-multi-gpu/blob/14c3ca9aaafe71e1bc816677ca36a12c81530b1d/run_pretraining_gpu.py#L726

Then, I would say that this naive MirroredStrategy is slower than single GPU. According to my experiments on 4 Tesla V100 GPUs, this implementation takes around 88 seconds per 100 steps.

88.0555248260498  s
loss  22828.4058984375
------------
88.03012156486511  s
loss  17133.7074609375
------------
88.0766191482544  s
loss  13693.772515625
------------
88.04167246818542  s
loss  11408.907552083334

While the official BERT single-gpu version reports 2.9 steps per second, which means it takes around 34 seconds for 100 steps. Part of the log is as follows.

INFO:tensorflow:global_step/sec: 2.90046
INFO:tensorflow:examples/sec: 46.4073

You can insert a hook to record performance:

hook = tf.train.ProfilerHook(save_steps=100,
                                     output_dir=FLAGS.output_dir,
                                     show_memory=True)
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=[hook])

My experiment configuration is:

python3 run_pretraining_gpu.py \
  --input_file=data/tfrecord/sample.tfrecord \
  --output_dir=data/wiki_uncased_L-12_H-768_A-12 \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=${BERT_BASE_DIR}/bert_config.json \
  --init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
  --train_batch_size=16 \
  --max_seq_length=256 \
  --max_predictions_per_seq=38 \
  --num_train_steps=1000 \
  --num_warmup_steps=10 \
  --learning_rate=2e-5

During eval, getting "ValueError: model_fn should return an EstimatorSpec". During training, OK

Thank you for BERT-multi-gpu.

I'm running run_pretraining_gpu_v2.py on the provided dataset sample_text.txt.
The only change, I made was to the n_gpus flag (in may case, 3).

Training was fine. But I also have --do_eval=True (as below).

CUDA_VISIBLE_DEVICES=0,1,2 python run_pretraining_gpu_v2.py \
  --input_file=/tmp/tf_examples.tfrecord \
  --output_dir=/tmp/pretraining_output \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=./bert_config.json \
  --train_batch_size=32 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=20 \
  --num_warmup_steps=10 \
  --learning_rate=2e-5

The error below on TF 1.14.0


 I0705 09:37:15.292903 140495488808704 estimator.py:1147] Done calling model_fn.
I0705 09:37:15.293055 140495488808704 coordinator.py:219] Error reported to Coordinator: model_fn should return an EstimatorSpec.
Traceback (most recent call last):
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 911, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1150, in _call_model_fn
    raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.
Traceback (most recent call last):
  File "run_pretraining_gpu_v2.py", line 501, in <module>
    tf.app.run()
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_gpu_v2.py", line 487, in main
    input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 477, in evaluate
    name=name)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 517, in _actual_eval
    return _evaluate()
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 501, in _evaluate
    self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1498, in _evaluate_build_graph
    self._call_model_fn_eval_distributed(input_fn, self.config))
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1586, in _call_model_fn_eval_distributed
    args=(features, labels, ModeKeys.EVAL, config)))
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1555, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 693, in _call_for_each_replica
    fn, args, kwargs)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 195, in _call_for_each_replica
    coord.join(threads)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 911, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1150, in _call_model_fn
    raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.

运行create_pretraining_data.py报错

模型学不到东西

使用run_pretraining_gpu.py在6块卡上训练了50万轮之后，使用官方eval接口，mlm_acc为0 。说明模型没有学习到任何东西啊，使用到下游任务，也没有任何提高，跟随机初始化bert的特征抽取能力差不多。

Cannot reload pre-trained model

I got this error "Not found: Key bert/embeddings/LayerNorm/beta not found in checkpoint".

experiment result

eval on quora pairs dataset，text similarity task:
my BERT with loss 2.2：AUC 96.1，ACC 89.5
official BERT：AUC 96.9，ACC 91.2

num_train_steps是一块卡还是多块卡的step？

因为观察到单卡的steps/sec会比多卡还要快？所以疑惑是不是对num_train_steps理解有误。

举例来说，如果num_train_steps设为1000，有4块卡，那实际总共是跑了1000 steps还是4000 steps呢？

Output model files compatible with Official Bert's pre-trained models?

Hi, I tried to pre-train a Bert model with this project. I find the output of the model is not compatible with the official Bert's pre-trained model. Is it easy to make it compatible?

For example, I can use pytorch_transformers to read the official Bert's pre-trained models, but when I do this same for the model trained by this project, I get some errors about some shape sizes are not the same.

RuntimeError: Error(s) in loading state_dict for BertForMultiLabelSequenceClassification:
	size mismatch for bert.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).

Should line 74-75 in optimization_gpu.py be comment out?

This issue is about global steps calculation. Should line 74-75 in optimization_gpu.py be comment out?

I don't think they should be comment out since the global step update is still not done inside of 'apply_gradients', right?

run_pretraining_gpu.py not working

Hi, could anyone one help me with this problem, after i changed the parameters above, then I run the run_pretraining_gpu.py and the thing is like this. As you can see, just stop here even after a long time. Please help, thanks

train_batch_size and time required to pretrain

When I set train_batch_size to 8 and experiment with 8 GPUs, the batch_size will be 64 overall.
Then I think the learning speed should be faster than when I did batch_size 8 with a single GPU, but when I actually learned it, it takes a similar amount of time.
Is something going wrong?

XLNet support

由 zihangdai/xlnet#32

XLNet更是一种提升预训练效率、节省预训练时间的方法

由于BERTs都是预训练越久越好，改些加些loss之类感觉效果不差太多，还是应该尝试XLNet

loss do not decrease...

I wonder why is the reshaping necessary?

I thought this line is unnecessary and can lead to errors triggered by this line. Am I missing something?

model_fn should return an EstimatorSpec.

File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "run_pretraining_gpu_v2.py", line 494, in main
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
name=name)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 520, in _actual_eval
return _evaluate()
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 504, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1508, in _evaluate_build_graph
self._call_model_fn_eval_distributed(input_fn, self.config))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1596, in _call_model_fn_eval_distributed
args=(features, labels, ModeKeys.EVAL, config)))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1810, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 662, in _call_for_each_replica
fn, args, kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
coord.join(threads)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 880, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1153, in _call_model_fn
raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.

has like this Error when at the end of train model and can`t find the "eval_results.txt" file

TensorFlow2 support

difference between run_pretraining_v2.py with run_pretraining.py

I want to know the difference between run_pretraining_v2.py and run_pretraining.py. Could you please explain it?

Can not train on multi-GPUs

Although I edit the n_gpus in the code and it seems well,
when I check the GPU condition, I find only one card be used.

I will be very appreciate that if you can help me on this issue.

关于多GPU训练的一些疑问咨询？

1.我看了一下closed issues , 我被绕晕了。我现在正在使用3张卡跑这个代码，我算了一下不能减少训练时间，和单卡一样也要运行11小时左右。

这里多GPU训练是说能够加速loss的下降吗？我在其他场景使用多GPU都是可以减少训练时间的

2.这个num_train_steps=10000，是说3张卡一共训练10000步长还是单张卡的step呢？

train 10W steps结束后，do_eva阶段出现错误

do_train完成后，开启do_eval的时候，出现这个错误

an error like this : Segmentation fault (core dumped),Is the configuration wrong?

GPT support

Suffer the Error: tensorflow.python.framework.errors_impl.InvalidArgumentError

When I run gpu v2 with num_gpu = 8, using the default sample.tfrecord, there is a error raised:

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node NcclAllReduce (defined at data/shuxiaobo/BERT-multi-gpu/run_pretraining_gpu_v2.py:480) with these attrs: [num_devices=6, reduction="sum", shared_name="c0", T=DT_FLOAT]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[NcclAllReduce]]

Anyone has any idea about this?

请问如何进一步实现梯度累计的功能？

感谢您的贡献，受益匪浅，在大模型预训练中，可能需要更大的batch_size（roberta的batch_size达到8000），多GPU也难以满足，所以想知道在BERT-GPU中应该如何实现梯度累计呢？

so many bugs in run_pretraining.py and run_pretraining_v2.py

There are so many bugs in run_pretraining.py and run_pretraining_v2.py. Could you please provide codes that can run successfully?

Question about "init_checkpoint" and "output_dir" checkpint

Dear Author @guotong1988 :

I hava a question about the difference between "init_checkpoint" and model checkpoint save in "output_dir", when I want to continue to train a model which fintuned on bert model, I was confused on "init_checkpoint" and "output_dir", I found the code will init the model with "init_checkpoint" and then restore model in "output_dir"
Could you please help me to figure out the difference of them?
Thanks very much!

【Try】1-GPU pretrain with big learning rate for 100W-step, then 1-GPU pretrain with small learning rate for another 100W-step.

compare to 4-GPU for downstream tasks.

guotong1988 / bert-gpu Goto Github PK

bert-gpu's Issues

Recommend Projects

Recommend Topics

Recommend Org