Comments (4)
We do have some local patches for multi cards but even the current TOT should not have overhead this big.
You can probably run a profiler to see what is bottlenecking it.
We can also help investigate the problem if you provide more information.
from gc-dpr.
Hi @luyug, Thank you for your help.
I loaded data and then ran two steps to see the time.
This is the head of the profile when I using two GPU (two 2080Ti, 11G).
ncalls tottime percall cumtime percall filename:lineno(function)
3 25.258 8.419 25.258 8.419 decoder.py:343(raw_decode)
82 9.238 0.113 9.238 0.113 :0(run_backward)
1156049 9.164 0.000 15.516 0.000 module.py:774(__setattr__)
2898444/367168 6.542 0.000 7.415 0.000 module.py:1215(named_modules)
2114 4.500 0.002 4.500 0.002 :0(acquire)
3265133/3265132 3.398 0.000 3.397 0.000 :0(get)
883 3.255 0.004 5.865 0.007 :0(read)
310 3.243 0.010 3.243 0.010 :0(normal_)
65 2.610 0.040 2.610 0.040 :0(utf_8_decode)
2471319 2.546 0.000 2.548 0.000 :0(isinstance)
2092 2.455 0.001 2.457 0.001 :0(to)
504 2.263 0.004 2.263 0.004 :0(_scatter)
168 1.831 0.011 29.583 0.176 replicate.py:78(replicate)
145338 1.607 0.000 12.096 0.000 module.py:1376(_replicate_for_data_parallel)
187 1.333 0.007 1.333 0.007 :0(_cuda_isDriverSufficient)
955800 1.077 0.000 1.077 0.000 :0(items)
136794 1.036 0.000 8.640 0.000 module.py:1048(_named_members)
v.s. The profile by running on single GPU:
ncalls tottime percall cumtime percall filename:lineno(function)
3 25.166 8.389 25.166 8.389 decoder.py:343(raw_decode)
82 3.695 0.045 3.695 0.045 :0(run_backward)
857 3.231 0.004 5.836 0.007 :0(read)
310 3.229 0.010 3.229 0.010 :0(normal_)
74 2.605 0.035 2.605 0.035 :0(utf_8_decode)
1838 2.414 0.001 2.415 0.001 :0(to)
172 1.585 0.009 1.585 0.009 :0(_cuda_isDriverSufficient)
292 0.869 0.003 0.869 0.003 :0(uniform_)
15362 0.724 0.000 0.724 0.000 :0(matmul)
36480/160 0.507 0.000 3.585 0.022 module.py:710(_call_impl)
398 0.387 0.001 0.387 0.001 :0(copy_)
412 0.387 0.001 0.387 0.001 :0(_set_from_file)
11680 0.272 0.000 1.047 0.000 functional.py:1655(linear)
1 0.235 0.235 31.222 31.222 __init__.py:274(load)
27907 0.199 0.000 0.350 0.000 module.py:774(__setattr__)
**************** CONFIGURATION ****************
adam_betas --> (0.9, 0.999)
adam_eps --> 1e-08
batch_size --> 128
checkpoint_file_name --> dpr_biencoder
ctx_chunk_size --> 8
dev_batch_size --> 16
dev_file --> data/retriever/nq-dev.json
device --> cuda
distributed_world_size --> 1
do_lower_case --> True
dropout --> 0.1
encoder_model_type --> hf_bert
eval_per_epoch --> 1
fix_ctx_encoder --> False
fp16 --> True
fp16_opt_level --> O1
global_loss_buf_sz --> 2097152
grad_cache --> True
gradient_accumulation_steps --> 1
hard_negatives --> 1
learning_rate --> 2e-05
local_rank --> -1
log_batch_step --> 100
max_grad_norm --> 2.0
model_file --> None
n_gpu --> 1
no_cuda --> False
num_train_epochs --> 40.0
other_negatives --> 0
output_dir --> model
pretrained_file --> None
pretrained_model_cfg --> bert-base-uncased
projection_dim --> 0
q_chunk_size --> 16
seed --> 12345
sequence_length --> 256
shuffle_positive_ctx --> False
train_file --> data/retriever/nq-train.json
train_files_upsample_rates --> None
train_rolling_loss_step --> 100
val_av_rank_bsz --> 128
val_av_rank_hard_neg --> 30
val_av_rank_max_qs --> 1000
val_av_rank_other_neg --> 30
val_av_rank_start_epoch --> 30
warmup_steps --> 1237
weight_decay --> 0.0
from gc-dpr.
A few things
- This is python profiler, I suppose? Can you run the Pytorch profiler? That gives cuda kernel time as well.
- How did you launch the script? Are you using DDP? The code was adjusted assuming DDP since DP is in general discouraged by Pytorch in all cases. I am not sure what will happen if you do DP.
- Try turn off AMP. You will definitely see slower single card training but maybe we can learn more from the multi/single card time ratio.
- Based on the number you have here, it seems backward was consuming quite a lot more time. (We can know better with a more detailed profile.) I do have a local patch that reduces the number of gradient sync in backward, but again, 6x time is definitely not expected. I have not idea what that
__setattr__
is doing.
from gc-dpr.
Ah, I launched with DP. Running with DDP works!
Thanks for your help!
from gc-dpr.
Related Issues (11)
- 关于Reader_train部分GC技术使用问题
- pip install . 之后的报错
- How to see the difference between DPR and GC-DPR?
- 太赞了 HOT 1
- surrogate = surrogate * (trainer.distributed_factor / 8.) HOT 1
- TPU support? HOT 1
- multilingual-bert issue? HOT 4
- Multiply by distributed_factor/8. HOT 2
- fine tuning existing dpr model
- coCondenser hyperparameter
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gc-dpr.