guotong1988 / bert-gpu Goto Github PK

View Code? Open in Web Editor NEW

171.0 9.0 54.0 199 KB

multi-gpu pre-training in one machine for BERT from scratch without horovod (Data Parallelism)

License: Apache License 2.0

Python 100.00%

tensorflow bert nlp multi-gpu

bert-gpu's Introduction

BERT MULTI-GPU PRE-TRAIN ON ONE MACHINE WITHOUT HOROVOD (Data Parallelism)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

REASONABLE / PRINCIPLE

More gpu means more data in a batch (, batch size is larger). And the gradients of a batch data is averaged for back-propagation.

If the sum learning rate of one batch is fixed, then the learning rate of one data is smaller, when batch size is larger.

If the learning rate of one data is fixed, then the sum learning rate of one batch is larger, when batch size is larger.

Conclusion: More gpu --> Larger sum learning rate of one batch --> Faster training.

WHATS NEW

Using 1-GPU (100 batch size) vs using 4-GPU (400 batch size) for the same learning rate (0.00001) and same pre-training steps (1,000,000) will be no difference of 0.1% in downstream task accuracy.

REQUIREMENT

python 3

tensorflow 1.14 - 1.15

TRAINING

0, edit the input and output file name in create_pretraining_data.py and run_pretraining_gpu.py

1, run create_pretraining_data.py

2, run run_pretraining_gpu_v2.py

PARAMETERS

Edit n_gpus in run_pretraining_gpu_v2.py

DATA

In sample_text.txt, sentence is end by \n, paragraph is splitted by empty line.

EXPERIMENT RESULT ON DOWNSTREAM TASKS

Quora question pairs English dataset,

Official BERT: ACC 91.2, AUC 96.9

This BERT with pretrain loss 2.05: ACC 90.1, AUC 96.3

NOTE

1)

For HierarchicalCopyAllReduce MirroredStrategy, global_step/sec shows the sum of multi gpus' steps.

2)

batch_size is the batch_size per GPU, not the global_batch_size

bert-gpu's People

Contributors

Stargazers

Watchers

Forkers

zhihaolzh yc-wind batteryhp liu-nlper qinqiang1990 allensmile keep-steady wipen yifengtao aliuyb kelly2016 khronosplus yueyedeai kaiminggao bikong2 hzrpku simona081 rocdog tonylibing colinsongf liuchao16 kunlinhou tanvidadu huminghe hong-feng leowood euphoriayan seeker1943 liulannister pemywei chris-cyliu dwtcourses joytianya frostjsy zhangleiqss juyoung4 tppppppppp tianxieeryang albertbj haojiepan1 unlearner wangxl1998 dmizu gavin90s cytsinghua kindaq wanzixiao victory1210 hustsheng ssbuild lilothar chenwanyuan arpan-shrivastava wangbaochao

bert-gpu's Issues

num_train_steps是一块卡还是多块卡的step？

因为观察到单卡的steps/sec会比多卡还要快？所以疑惑是不是对num_train_steps理解有误。

举例来说，如果num_train_steps设为1000，有4块卡，那实际总共是跑了1000 steps还是4000 steps呢？

Can not train on multi-GPUs

Although I edit the n_gpus in the code and it seems well,
when I check the GPU condition, I find only one card be used.

I will be very appreciate that if you can help me on this issue.

Output model files compatible with Official Bert's pre-trained models?

Hi, I tried to pre-train a Bert model with this project. I find the output of the model is not compatible with the official Bert's pre-trained model. Is it easy to make it compatible?

For example, I can use pytorch_transformers to read the official Bert's pre-trained models, but when I do this same for the model trained by this project, I get some errors about some shape sizes are not the same.

RuntimeError: Error(s) in loading state_dict for BertForMultiLabelSequenceClassification:
	size mismatch for bert.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
	size mismatch for bert.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for bert.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).

TensorFlow2 support

XLNet support

由 zihangdai/xlnet#32

XLNet更是一种提升预训练效率、节省预训练时间的方法

由于BERTs都是预训练越久越好，改些加些loss之类感觉效果不差太多，还是应该尝试XLNet

I wonder why is the reshaping necessary?

I thought this line is unnecessary and can lead to errors triggered by this line. Am I missing something?

运行create_pretraining_data.py报错

ImportError: No module named 'tensorflow.python.distribute.cross_device_ops'

I have used TensorFlow-gpu-1.12.0 to run run_pretraining_gpu_v2.py, but got this problem. Is the version I am using wrong?

Cannot reload pre-trained model

I got this error "Not found: Key bert/embeddings/LayerNorm/beta not found in checkpoint".

OOM error

could i ask you about which of the pretrained model of offical bert you used, cuz i use the wwm_uncased_L-24_H-1024_A-16 model, and easily got an error of OOM.

model_fn should return an EstimatorSpec.

File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "run_pretraining_gpu_v2.py", line 494, in main
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
name=name)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 520, in _actual_eval
return _evaluate()
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 504, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1508, in _evaluate_build_graph
self._call_model_fn_eval_distributed(input_fn, self.config))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1596, in _call_model_fn_eval_distributed
args=(features, labels, ModeKeys.EVAL, config)))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1810, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 662, in _call_for_each_replica
fn, args, kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
coord.join(threads)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 880, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1153, in _call_model_fn
raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.

has like this Error when at the end of train model and can`t find the "eval_results.txt" file

Slower than single GPU

Hi, thanks for your work. I'd say that there is an error when measuring time. The unit of time.time() is second, not ms.

https://github.com/guotong1988/BERT-multi-gpu/blob/14c3ca9aaafe71e1bc816677ca36a12c81530b1d/run_pretraining_gpu.py#L726

Then, I would say that this naive MirroredStrategy is slower than single GPU. According to my experiments on 4 Tesla V100 GPUs, this implementation takes around 88 seconds per 100 steps.

88.0555248260498  s
loss  22828.4058984375
------------
88.03012156486511  s
loss  17133.7074609375
------------
88.0766191482544  s
loss  13693.772515625
------------
88.04167246818542  s
loss  11408.907552083334

While the official BERT single-gpu version reports 2.9 steps per second, which means it takes around 34 seconds for 100 steps. Part of the log is as follows.

INFO:tensorflow:global_step/sec: 2.90046
INFO:tensorflow:examples/sec: 46.4073

You can insert a hook to record performance:

hook = tf.train.ProfilerHook(save_steps=100,
                                     output_dir=FLAGS.output_dir,
                                     show_memory=True)
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=[hook])

My experiment configuration is:

python3 run_pretraining_gpu.py \
  --input_file=data/tfrecord/sample.tfrecord \
  --output_dir=data/wiki_uncased_L-12_H-768_A-12 \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=${BERT_BASE_DIR}/bert_config.json \
  --init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
  --train_batch_size=16 \
  --max_seq_length=256 \
  --max_predictions_per_seq=38 \
  --num_train_steps=1000 \
  --num_warmup_steps=10 \
  --learning_rate=2e-5

Should line 74-75 in optimization_gpu.py be comment out?

This issue is about global steps calculation. Should line 74-75 in optimization_gpu.py be comment out?

I don't think they should be comment out since the global step update is still not done inside of 'apply_gradients', right?

During eval, getting "ValueError: model_fn should return an EstimatorSpec". During training, OK

Thank you for BERT-multi-gpu.

I'm running run_pretraining_gpu_v2.py on the provided dataset sample_text.txt.
The only change, I made was to the n_gpus flag (in may case, 3).

Training was fine. But I also have --do_eval=True (as below).

CUDA_VISIBLE_DEVICES=0,1,2 python run_pretraining_gpu_v2.py \
  --input_file=/tmp/tf_examples.tfrecord \
  --output_dir=/tmp/pretraining_output \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=./bert_config.json \
  --train_batch_size=32 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=20 \
  --num_warmup_steps=10 \
  --learning_rate=2e-5

The error below on TF 1.14.0


 I0705 09:37:15.292903 140495488808704 estimator.py:1147] Done calling model_fn.
I0705 09:37:15.293055 140495488808704 coordinator.py:219] Error reported to Coordinator: model_fn should return an EstimatorSpec.
Traceback (most recent call last):
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 911, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1150, in _call_model_fn
    raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.
Traceback (most recent call last):
  File "run_pretraining_gpu_v2.py", line 501, in <module>
    tf.app.run()
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_gpu_v2.py", line 487, in main
    input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 477, in evaluate
    name=name)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 517, in _actual_eval
    return _evaluate()
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 501, in _evaluate
    self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1498, in _evaluate_build_graph
    self._call_model_fn_eval_distributed(input_fn, self.config))
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1586, in _call_model_fn_eval_distributed
    args=(features, labels, ModeKeys.EVAL, config)))
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1555, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 693, in _call_for_each_replica
    fn, args, kwargs)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 195, in _call_for_each_replica
    coord.join(threads)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 911, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1150, in _call_model_fn
    raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.

train_batch_size and time required to pretrain

When I set train_batch_size to 8 and experiment with 8 GPUs, the batch_size will be 64 overall.
Then I think the learning speed should be faster than when I did batch_size 8 with a single GPU, but when I actually learned it, it takes a similar amount of time.
Is something going wrong?

This is just for pretraining BERT?

I want to fine tune pretrained BERT through multi-gpu in estimator API. Can your code solve this problem? thx.

模型学不到东西

使用run_pretraining_gpu.py在6块卡上训练了50万轮之后，使用官方eval接口，mlm_acc为0 。说明模型没有学习到任何东西啊，使用到下游任务，也没有任何提高，跟随机初始化bert的特征抽取能力差不多。

ModuleNotFoundError: No module named 'tensorflow.python.distribute.cross_device_ops

tensorflow version 1.12.0 ,import error ModuleNotFoundError: No module named 'tensorflow.python.distribute.cross_device_ops

train 10W steps结束后，do_eva阶段出现错误

do_train完成后，开启do_eval的时候，出现这个错误

Question about "init_checkpoint" and "output_dir" checkpint

Dear Author @guotong1988 :

I hava a question about the difference between "init_checkpoint" and model checkpoint save in "output_dir", when I want to continue to train a model which fintuned on bert model, I was confused on "init_checkpoint" and "output_dir", I found the code will init the model with "init_checkpoint" and then restore model in "output_dir"
Could you please help me to figure out the difference of them?
Thanks very much!

请问如何进一步实现梯度累计的功能？

感谢您的贡献，受益匪浅，在大模型预训练中，可能需要更大的batch_size（roberta的batch_size达到8000），多GPU也难以满足，所以想知道在BERT-GPU中应该如何实现梯度累计呢？

run_pretraining_gpu.py not working

Hi, could anyone one help me with this problem, after i changed the parameters above, then I run the run_pretraining_gpu.py and the thing is like this. As you can see, just stop here even after a long time. Please help, thanks

loss do not decrease...

关于多GPU训练的一些疑问咨询？

1.我看了一下closed issues , 我被绕晕了。我现在正在使用3张卡跑这个代码，我算了一下不能减少训练时间，和单卡一样也要运行11小时左右。

这里多GPU训练是说能够加速loss的下降吗？我在其他场景使用多GPU都是可以减少训练时间的

2.这个num_train_steps=10000，是说3张卡一共训练10000步长还是单张卡的step呢？

ValueError: You must specify an aggregation method to update a MirroredVariable in Tower Context.

Does anyone meet this error:

ValueError: You must specify an aggregation method to update a MirroredVariable in Tower Context.

my tensorflow version is 1.10.0, I wanna know whether the tensorflow version causes this error

Suffer the Error: tensorflow.python.framework.errors_impl.InvalidArgumentError

When I run gpu v2 with num_gpu = 8, using the default sample.tfrecord, there is a error raised:

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node NcclAllReduce (defined at data/shuxiaobo/BERT-multi-gpu/run_pretraining_gpu_v2.py:480) with these attrs: [num_devices=6, reduction="sum", shared_name="c0", T=DT_FLOAT]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[NcclAllReduce]]

Anyone has any idea about this?

wrong when run_pretraining_gpu_v2 with init_checkpoint

load bert_model as init_chechpoint when run run_pretraining_gpu_v2,get a error:
File”/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 368, in _collect_partitioned_variable
if name + "/part_0" in all_vars:
TypeError: unsupported operand type(s) for +: 'PerReplica' and 'str'
run run_pretraining.py with bert_model as init_checkpoint will be ok.

【Try】1-GPU pretrain with big learning rate for 100W-step, then 1-GPU pretrain with small learning rate for another 100W-step.

compare to 4-GPU for downstream tasks.