I run the code, but only got 90+ tflops. INFO train.py:317 in record

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How can I get a training throughput of over 180 TFLOPS ? about internlm HOT 4 CLOSED

internlm commented on May 11, 2024

How can I get a training throughput of over 180 TFLOPS ?

from internlm.

Comments (4)

sunpengsdu commented on May 11, 2024

Hi @crazyofapple , can you provide more details about your platform? In our platform, we use up to 128 GPU nodes connected by 4*100Gbps RoCE, and each node has 8 GPUs connected by NVLINK.

from internlm.

crazyofapple commented on May 11, 2024

Inter: 2 HDR100 IB 200G, Intra: 8 gpus w/ PCIE

from internlm.

sunpengsdu commented on May 11, 2024

The main performance bottleneck is the intra-node communication via PCIE. We did two experiments:

On a single GPU node with NVLINK. The training log is following:

2023-07-10 14:26:28,977 INFO train.py:317 in record_current_batch_training_metrics -- tflops=188.02533140299252,step=36,loss=5.459033012390137,tgs (tokens/gpu/second)=4233.73,lr=7.6e-06,loss_scale=65536.0,grad_norm=12.540833573326264,micro_num=4,num_consumed_tokens=4849664,inf_nan_skip_batches=0,num_samples_in_batch=15,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.72

On a single GPU node without MVLINK. The training log is following:

2023-07-10 14:34:49,024 INFO train.py:317 in record_current_batch_training_metrics -- tflops=99.1021732624673,step=18,loss=6.766777038574219,tgs (tokens/gpu/second)=2231.46,lr=4.000000000000001e-06,loss_scale=65536.0,grad_norm=12.957902089555239,micro_num=4,num_consumed_tokens=2490368,inf_nan_skip_batches=0,num_samples_in_batch=15,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=5.76

Since the optimizer needs a lot of allreduce/broadcast communication, it is quite important to ensure high communication bandwidth between GPUs in a node.

from internlm.

crazyofapple commented on May 11, 2024

thx

from internlm.

Recommend Projects

How can I get a training throughput of over 180 TFLOPS ? about internlm HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent