Hi, Is the current version of encoder training with GC support multiple GPUs?

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

A few things This is python profiler, I suppose? Can you run t

Multi GPU support about gc-dpr HOT 4 CLOSED

luyug commented on May 29, 2024

Multi GPU support

from gc-dpr.

Comments (4)

luyug commented on May 29, 2024

We do have some local patches for multi cards but even the current TOT should not have overhead this big.

You can probably run a profiler to see what is bottlenecking it.

We can also help investigate the problem if you provide more information.

from gc-dpr.

MXueguang commented on May 29, 2024

Hi @luyug, Thank you for your help.
I loaded data and then ran two steps to see the time.

This is the head of the profile when I using two GPU (two 2080Ti, 11G).

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   25.258    8.419   25.258    8.419 decoder.py:343(raw_decode)
       82    9.238    0.113    9.238    0.113 :0(run_backward)
  1156049    9.164    0.000   15.516    0.000 module.py:774(__setattr__)
2898444/367168    6.542    0.000    7.415    0.000 module.py:1215(named_modules)
     2114    4.500    0.002    4.500    0.002 :0(acquire)
3265133/3265132    3.398    0.000    3.397    0.000 :0(get)
      883    3.255    0.004    5.865    0.007 :0(read)
      310    3.243    0.010    3.243    0.010 :0(normal_)
       65    2.610    0.040    2.610    0.040 :0(utf_8_decode)
  2471319    2.546    0.000    2.548    0.000 :0(isinstance)
     2092    2.455    0.001    2.457    0.001 :0(to)
      504    2.263    0.004    2.263    0.004 :0(_scatter)
      168    1.831    0.011   29.583    0.176 replicate.py:78(replicate)
   145338    1.607    0.000   12.096    0.000 module.py:1376(_replicate_for_data_parallel)
      187    1.333    0.007    1.333    0.007 :0(_cuda_isDriverSufficient)
   955800    1.077    0.000    1.077    0.000 :0(items)
   136794    1.036    0.000    8.640    0.000 module.py:1048(_named_members)

v.s. The profile by running on single GPU:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   25.166    8.389   25.166    8.389 decoder.py:343(raw_decode)
       82    3.695    0.045    3.695    0.045 :0(run_backward)
      857    3.231    0.004    5.836    0.007 :0(read)
      310    3.229    0.010    3.229    0.010 :0(normal_)
       74    2.605    0.035    2.605    0.035 :0(utf_8_decode)
     1838    2.414    0.001    2.415    0.001 :0(to)
      172    1.585    0.009    1.585    0.009 :0(_cuda_isDriverSufficient)
      292    0.869    0.003    0.869    0.003 :0(uniform_)
    15362    0.724    0.000    0.724    0.000 :0(matmul)
36480/160    0.507    0.000    3.585    0.022 module.py:710(_call_impl)
      398    0.387    0.001    0.387    0.001 :0(copy_)
      412    0.387    0.001    0.387    0.001 :0(_set_from_file)
    11680    0.272    0.000    1.047    0.000 functional.py:1655(linear)
        1    0.235    0.235   31.222   31.222 __init__.py:274(load)
    27907    0.199    0.000    0.350    0.000 module.py:774(__setattr__)

**************** CONFIGURATION **************** 
adam_betas                     -->   (0.9, 0.999)
adam_eps                       -->   1e-08
batch_size                     -->   128
checkpoint_file_name           -->   dpr_biencoder
ctx_chunk_size                 -->   8
dev_batch_size                 -->   16
dev_file                       -->   data/retriever/nq-dev.json
device                         -->   cuda
distributed_world_size         -->   1
do_lower_case                  -->   True
dropout                        -->   0.1
encoder_model_type             -->   hf_bert
eval_per_epoch                 -->   1
fix_ctx_encoder                -->   False
fp16                           -->   True
fp16_opt_level                 -->   O1
global_loss_buf_sz             -->   2097152
grad_cache                     -->   True
gradient_accumulation_steps    -->   1
hard_negatives                 -->   1
learning_rate                  -->   2e-05
local_rank                     -->   -1
log_batch_step                 -->   100
max_grad_norm                  -->   2.0
model_file                     -->   None
n_gpu                          -->   1
no_cuda                        -->   False
num_train_epochs               -->   40.0
other_negatives                -->   0
output_dir                     -->   model
pretrained_file                -->   None
pretrained_model_cfg           -->   bert-base-uncased
projection_dim                 -->   0
q_chunk_size                   -->   16
seed                           -->   12345
sequence_length                -->   256
shuffle_positive_ctx           -->   False
train_file                     -->   data/retriever/nq-train.json
train_files_upsample_rates     -->   None
train_rolling_loss_step        -->   100
val_av_rank_bsz                -->   128
val_av_rank_hard_neg           -->   30
val_av_rank_max_qs             -->   1000
val_av_rank_other_neg          -->   30
val_av_rank_start_epoch        -->   30
warmup_steps                   -->   1237
weight_decay                   -->   0.0

from gc-dpr.

luyug commented on May 29, 2024

A few things

This is python profiler, I suppose? Can you run the Pytorch profiler? That gives cuda kernel time as well.
How did you launch the script? Are you using DDP? The code was adjusted assuming DDP since DP is in general discouraged by Pytorch in all cases. I am not sure what will happen if you do DP.
Try turn off AMP. You will definitely see slower single card training but maybe we can learn more from the multi/single card time ratio.
Based on the number you have here, it seems backward was consuming quite a lot more time. (We can know better with a more detailed profile.) I do have a local patch that reduces the number of gradient sync in backward, but again, 6x time is definitely not expected. I have not idea what that __setattr__ is doing.

from gc-dpr.

MXueguang commented on May 29, 2024

Ah, I launched with DP. Running with DDP works!
Thanks for your help!

from gc-dpr.

Multi GPU support about gc-dpr HOT 4 CLOSED

Comments (4)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent