guotong1988 / bert-gpu Goto Github PK
View Code? Open in Web Editor NEWmulti-gpu pre-training in one machine for BERT from scratch without horovod (Data Parallelism)
License: Apache License 2.0
multi-gpu pre-training in one machine for BERT from scratch without horovod (Data Parallelism)
License: Apache License 2.0
I want to fine tune pretrained BERT through multi-gpu in estimator API. Can your code solve this problem? thx.
Does anyone meet this error:
ValueError: You must specify an aggregation method to update a MirroredVariable in Tower Context.
my tensorflow version is 1.10.0, I wanna know whether the tensorflow version causes this error
load bert_model as init_chechpoint when run run_pretraining_gpu_v2,get a error:
File”/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 368, in _collect_partitioned_variable
if name + "/part_0" in all_vars:
TypeError: unsupported operand type(s) for +: 'PerReplica' and 'str'
run run_pretraining.py with bert_model as init_checkpoint will be ok.
could i ask you about which of the pretrained model of offical bert you used, cuz i use the wwm_uncased_L-24_H-1024_A-16 model, and easily got an error of OOM.
Hi, thanks for your work. I'd say that there is an error when measuring time. The unit of time.time()
is second, not ms.
Then, I would say that this naive MirroredStrategy
is slower than single GPU. According to my experiments on 4
Tesla V100 GPUs, this implementation takes around 88
seconds per 100 steps.
88.0555248260498 s
loss 22828.4058984375
------------
88.03012156486511 s
loss 17133.7074609375
------------
88.0766191482544 s
loss 13693.772515625
------------
88.04167246818542 s
loss 11408.907552083334
While the official BERT single-gpu version reports 2.9
steps per second, which means it takes around 34
seconds for 100
steps. Part of the log is as follows.
INFO:tensorflow:global_step/sec: 2.90046
INFO:tensorflow:examples/sec: 46.4073
You can insert a hook to record performance:
hook = tf.train.ProfilerHook(save_steps=100,
output_dir=FLAGS.output_dir,
show_memory=True)
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=[hook])
My experiment configuration is:
python3 run_pretraining_gpu.py \
--input_file=data/tfrecord/sample.tfrecord \
--output_dir=data/wiki_uncased_L-12_H-768_A-12 \
--do_train=True \
--do_eval=True \
--bert_config_file=${BERT_BASE_DIR}/bert_config.json \
--init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
--train_batch_size=16 \
--max_seq_length=256 \
--max_predictions_per_seq=38 \
--num_train_steps=1000 \
--num_warmup_steps=10 \
--learning_rate=2e-5
Thank you for BERT-multi-gpu
.
I'm running run_pretraining_gpu_v2.py
on the provided dataset sample_text.txt
.
The only change, I made was to the n_gpus
flag (in may case, 3).
Training was fine. But I also have --do_eval=True
(as below).
CUDA_VISIBLE_DEVICES=0,1,2 python run_pretraining_gpu_v2.py \
--input_file=/tmp/tf_examples.tfrecord \
--output_dir=/tmp/pretraining_output \
--do_train=True \
--do_eval=True \
--bert_config_file=./bert_config.json \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=10 \
--learning_rate=2e-5
The error below on TF 1.14.0
I0705 09:37:15.292903 140495488808704 estimator.py:1147] Done calling model_fn.
I0705 09:37:15.293055 140495488808704 coordinator.py:219] Error reported to Coordinator: model_fn should return an EstimatorSpec.
Traceback (most recent call last):
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 911, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1150, in _call_model_fn
raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.
Traceback (most recent call last):
File "run_pretraining_gpu_v2.py", line 501, in <module>
tf.app.run()
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_pretraining_gpu_v2.py", line 487, in main
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 477, in evaluate
name=name)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 517, in _actual_eval
return _evaluate()
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 501, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1498, in _evaluate_build_graph
self._call_model_fn_eval_distributed(input_fn, self.config))
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1586, in _call_model_fn_eval_distributed
args=(features, labels, ModeKeys.EVAL, config)))
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1555, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 693, in _call_for_each_replica
fn, args, kwargs)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 195, in _call_for_each_replica
coord.join(threads)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 911, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/auro/anaconda3/envs/tf-py2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1150, in _call_model_fn
raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.
使用run_pretraining_gpu.py在6块卡上训练了50万轮之后,使用官方eval接口,mlm_acc为0 。说明模型没有学习到任何东西啊,使用到下游任务,也没有任何提高,跟随机初始化bert的特征抽取能力差不多。
I got this error "Not found: Key bert/embeddings/LayerNorm/beta not found in checkpoint".
eval on quora pairs dataset,text similarity task:
my BERT with loss 2.2:AUC 96.1,ACC 89.5
official BERT:AUC 96.9,ACC 91.2
因为观察到单卡的steps/sec会比多卡还要快?所以疑惑是不是对num_train_steps理解有误。
举例来说,如果num_train_steps设为1000,有4块卡,那实际总共是跑了1000 steps还是4000 steps呢?
Hi, I tried to pre-train a Bert model with this project. I find the output of the model is not compatible with the official Bert's pre-trained model. Is it easy to make it compatible?
For example, I can use pytorch_transformers
to read the official Bert's pre-trained models, but when I do this same for the model trained by this project, I get some errors about some shape sizes are not the same.
RuntimeError: Error(s) in loading state_dict for BertForMultiLabelSequenceClassification:
size mismatch for bert.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for bert.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
size mismatch for bert.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for bert.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
size mismatch for bert.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for bert.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
size mismatch for bert.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for bert.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
size mismatch for bert.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for bert.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
size mismatch for bert.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for bert.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([256, 128]) from checkpoint, the shape in current model is torch.Size([128, 256]).
This issue is about global steps calculation. Should line 74-75 in optimization_gpu.py be comment out?
I don't think they should be comment out since the global step update is still not done inside of 'apply_gradients', right?
When I set train_batch_size to 8 and experiment with 8 GPUs, the batch_size will be 64 overall.
Then I think the learning speed should be faster than when I did batch_size 8 with a single GPU, but when I actually learned it, it takes a similar amount of time.
Is something going wrong?
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "run_pretraining_gpu_v2.py", line 494, in main
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
name=name)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 520, in _actual_eval
return _evaluate()
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 504, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1508, in _evaluate_build_graph
self._call_model_fn_eval_distributed(input_fn, self.config))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1596, in _call_model_fn_eval_distributed
args=(features, labels, ModeKeys.EVAL, config)))
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1810, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 662, in _call_for_each_replica
fn, args, kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
coord.join(threads)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 880, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/GaoGuanglai/Namuhan_81/anaconda3/envs/bert-gpu-tf1.15-py3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1153, in _call_model_fn
raise ValueError('model_fn should return an EstimatorSpec.')
ValueError: model_fn should return an EstimatorSpec.
has like this Error when at the end of train model and can`t find the "eval_results.txt" file
I want to know the difference between run_pretraining_v2.py and run_pretraining.py. Could you please explain it?
Although I edit the n_gpus
in the code and it seems well,
when I check the GPU condition, I find only one card be used.
I will be very appreciate that if you can help me on this issue.
When I run gpu v2 with num_gpu = 8, using the default sample.tfrecord, there is a error raised:
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node NcclAllReduce (defined at data/shuxiaobo/BERT-multi-gpu/run_pretraining_gpu_v2.py:480) with these attrs: [num_devices=6, reduction="sum", shared_name="c0", T=DT_FLOAT]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
<no registered kernels>
[[NcclAllReduce]]
Anyone has any idea about this?
感谢您的贡献,受益匪浅,在大模型预训练中,可能需要更大的batch_size(roberta的batch_size达到8000),多GPU也难以满足,所以想知道在BERT-GPU中应该如何实现梯度累计呢?
There are so many bugs in run_pretraining.py and run_pretraining_v2.py. Could you please provide codes that can run successfully?
Dear Author @guotong1988 :
I hava a question about the difference between "init_checkpoint" and model checkpoint save in "output_dir", when I want to continue to train a model which fintuned on bert model, I was confused on "init_checkpoint" and "output_dir", I found the code will init the model with "init_checkpoint" and then restore model in "output_dir"
Could you please help me to figure out the difference of them?
Thanks very much!
You didn't override the apply_gradients
function in class AdamWeightDecayOptimizer
, Normally the global step
update is done inside of apply_gradients
.
I have used TensorFlow-gpu-1.12.0 to run run_pretraining_gpu_v2.py, but got this problem. Is the version I am using wrong?
compare to 4-GPU for downstream tasks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.