8*A100-80G: Traceback (most recent call last):[02:06<01:20, 13.44s/it, pg=-.011

It can run well within the nvidia-docker <div class="snippet-clipboard-content not

PPO OOM about openrlhf HOT 4 CLOSED

openllmai commented on June 11, 2024

PPO OOM

from openrlhf.

Comments (4)

hijkzzz commented on June 11, 2024

It can run well within the nvidia-docker

Train epoch [1/1]: 100%|██████████████████████████████████████████| 256/256 [12:13<00:00,  2.87s/it, pg=-.107, cri=0.0423, vals=0.761, rm=0.649, ret=0.648, glen=405, tlen=499, kl=0.000267]
{'policy_loss': -0.10136487577983644, 'critic_loss': 0.05111422556001344, 'values': 0.7335318821569672, 'kl': 3.989041533495717e-05, 'reward': 0.7770038396120071, 'return': 0.7769353888725163, 'response_length': 250.283203125, 'total_length': 385.9375}
Episode [1/1]:   3%|███▎                                                                                                                           | 664/25000 [1:50:43<54:13:56,  8.02s/it]

from openrlhf.

catqaq commented on June 11, 2024

I tried it twice and ended up in the same place.
first：
Train epoch [1/1]: 100%|█| 16/16 [03:45<00:00, 14.10s/it, pg=-.479, cri=0.0654, vals=0.262, kl=0, rm=0.328, ret=0.328, glen=203,
{'pg': -0.32111476035788655, 'cri': 0.059775998117402196, 'vals': 0.15955006331205368, 'kl': 0.0, 'rm': 0.28977665305137634, 'ret': 0.28977665305137634, 'glen': 143.6103515625, 'tlen': 288.83203125, 'k_coef': 0.01}
Episode [1/1]: 2%|█▍ | 255/12500 [29:52<17:39:40, 5.19s/it]

retry:
Train epoch [1/1]: 100%|█| 16/16 [03:45<00:00, 14.09s/it, pg=-.557, cri=0.0692, vals=0.104, kl=0, rm=0.391, ret=0.391, glen=121,
{'pg': -0.3215101254172623, 'cri': 0.05977673456072807, 'vals': 0.1602994620334357, 'kl': 0.0, 'rm': 0.28977665305137634, 'ret': 0.28977665305137634, 'glen': 143.6103515625, 'tlen': 288.83203125, 'k_coef': 0.01}
Episode [1/1]: 2%|█▍ | 255/12500 [29:46<15:52:26, 4.67s/it]

from openrlhf.

dabney777 commented on June 11, 2024

Could you try to skip this batch/episode?

from openrlhf.

catqaq commented on June 11, 2024

solved by set PYTORCH_CUDA_ALLOC_CONF

from openrlhf.

Recommend Projects

PPO OOM about openrlhf HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent