Code Monkey home page Code Monkey logo

Comments (4)

luyug avatar luyug commented on May 29, 2024

We do have some local patches for multi cards but even the current TOT should not have overhead this big.

You can probably run a profiler to see what is bottlenecking it.

We can also help investigate the problem if you provide more information.

from gc-dpr.

MXueguang avatar MXueguang commented on May 29, 2024

Hi @luyug, Thank you for your help.
I loaded data and then ran two steps to see the time.

This is the head of the profile when I using two GPU (two 2080Ti, 11G).

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   25.258    8.419   25.258    8.419 decoder.py:343(raw_decode)
       82    9.238    0.113    9.238    0.113 :0(run_backward)
  1156049    9.164    0.000   15.516    0.000 module.py:774(__setattr__)
2898444/367168    6.542    0.000    7.415    0.000 module.py:1215(named_modules)
     2114    4.500    0.002    4.500    0.002 :0(acquire)
3265133/3265132    3.398    0.000    3.397    0.000 :0(get)
      883    3.255    0.004    5.865    0.007 :0(read)
      310    3.243    0.010    3.243    0.010 :0(normal_)
       65    2.610    0.040    2.610    0.040 :0(utf_8_decode)
  2471319    2.546    0.000    2.548    0.000 :0(isinstance)
     2092    2.455    0.001    2.457    0.001 :0(to)
      504    2.263    0.004    2.263    0.004 :0(_scatter)
      168    1.831    0.011   29.583    0.176 replicate.py:78(replicate)
   145338    1.607    0.000   12.096    0.000 module.py:1376(_replicate_for_data_parallel)
      187    1.333    0.007    1.333    0.007 :0(_cuda_isDriverSufficient)
   955800    1.077    0.000    1.077    0.000 :0(items)
   136794    1.036    0.000    8.640    0.000 module.py:1048(_named_members)

v.s. The profile by running on single GPU:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   25.166    8.389   25.166    8.389 decoder.py:343(raw_decode)
       82    3.695    0.045    3.695    0.045 :0(run_backward)
      857    3.231    0.004    5.836    0.007 :0(read)
      310    3.229    0.010    3.229    0.010 :0(normal_)
       74    2.605    0.035    2.605    0.035 :0(utf_8_decode)
     1838    2.414    0.001    2.415    0.001 :0(to)
      172    1.585    0.009    1.585    0.009 :0(_cuda_isDriverSufficient)
      292    0.869    0.003    0.869    0.003 :0(uniform_)
    15362    0.724    0.000    0.724    0.000 :0(matmul)
36480/160    0.507    0.000    3.585    0.022 module.py:710(_call_impl)
      398    0.387    0.001    0.387    0.001 :0(copy_)
      412    0.387    0.001    0.387    0.001 :0(_set_from_file)
    11680    0.272    0.000    1.047    0.000 functional.py:1655(linear)
        1    0.235    0.235   31.222   31.222 __init__.py:274(load)
    27907    0.199    0.000    0.350    0.000 module.py:774(__setattr__)

**************** CONFIGURATION **************** 
adam_betas                     -->   (0.9, 0.999)
adam_eps                       -->   1e-08
batch_size                     -->   128
checkpoint_file_name           -->   dpr_biencoder
ctx_chunk_size                 -->   8
dev_batch_size                 -->   16
dev_file                       -->   data/retriever/nq-dev.json
device                         -->   cuda
distributed_world_size         -->   1
do_lower_case                  -->   True
dropout                        -->   0.1
encoder_model_type             -->   hf_bert
eval_per_epoch                 -->   1
fix_ctx_encoder                -->   False
fp16                           -->   True
fp16_opt_level                 -->   O1
global_loss_buf_sz             -->   2097152
grad_cache                     -->   True
gradient_accumulation_steps    -->   1
hard_negatives                 -->   1
learning_rate                  -->   2e-05
local_rank                     -->   -1
log_batch_step                 -->   100
max_grad_norm                  -->   2.0
model_file                     -->   None
n_gpu                          -->   1
no_cuda                        -->   False
num_train_epochs               -->   40.0
other_negatives                -->   0
output_dir                     -->   model
pretrained_file                -->   None
pretrained_model_cfg           -->   bert-base-uncased
projection_dim                 -->   0
q_chunk_size                   -->   16
seed                           -->   12345
sequence_length                -->   256
shuffle_positive_ctx           -->   False
train_file                     -->   data/retriever/nq-train.json
train_files_upsample_rates     -->   None
train_rolling_loss_step        -->   100
val_av_rank_bsz                -->   128
val_av_rank_hard_neg           -->   30
val_av_rank_max_qs             -->   1000
val_av_rank_other_neg          -->   30
val_av_rank_start_epoch        -->   30
warmup_steps                   -->   1237
weight_decay                   -->   0.0

from gc-dpr.

luyug avatar luyug commented on May 29, 2024

A few things

  • This is python profiler, I suppose? Can you run the Pytorch profiler? That gives cuda kernel time as well.
  • How did you launch the script? Are you using DDP? The code was adjusted assuming DDP since DP is in general discouraged by Pytorch in all cases. I am not sure what will happen if you do DP.
  • Try turn off AMP. You will definitely see slower single card training but maybe we can learn more from the multi/single card time ratio.
  • Based on the number you have here, it seems backward was consuming quite a lot more time. (We can know better with a more detailed profile.) I do have a local patch that reduces the number of gradient sync in backward, but again, 6x time is definitely not expected. I have not idea what that __setattr__ is doing.

from gc-dpr.

MXueguang avatar MXueguang commented on May 29, 2024

Ah, I launched with DP. Running with DDP works!
Thanks for your help!

from gc-dpr.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.