8*A100-80G:
Traceback (most recent call last):[02:06<01:20, 13.44s/it, pg=-.0119, cri=0.0702, vals=-.0352, kl=0, rm=0.0909, ret=0.0909, glen=1
File "../train_ppo.py", line 239, in
train(args)
File "../train_ppo.py", line 164, in train
trainer.fit(prompts_dataloader,
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/openllama2/trainer/ppo_trainer.py", line 143, in fit
status = self.ppo_train()
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/openllama2/trainer/ppo_trainer.py", line 166, in ppo_train
status = self.training_step(experience)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/openllama2/trainer/ppo_trainer.py", line 209, in training_step
self.strategy.backward(actor_loss, self.actor, self.actor_optim)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/openllama2/utils/deepspeed.py", line 81, in backward
model.backward(loss)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 2; 79.35 GiB total capacity; 66.47 GiB already allocated; 3.87 GiB free; 72.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-08-31 02:44:16,573] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10230
[2023-08-31 02:44:22,676] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10231
[2023-08-31 02:44:29,025] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10232
[2023-08-31 02:44:29,025] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10233
[2023-08-31 02:44:37,723] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10235
[2023-08-31 02:44:45,725] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10237
[2023-08-31 02:44:54,060] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10239
[2023-08-31 02:45:01,505] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10241
[2023-08-31 02:45:08,532] [ERROR] [launch.py:321:sigkill_handler]
['/opt/conda/envs/llama2/bin/python3', '-u', '../train_ppo.py', '--local_rank=7', '--pretrain', './models/Llama-2-7b-hf', '--critic_pretrain', './models/Llama-2-7b-hf', '--reward_model_path', './ckpt/7b_llama/rm_model.pt', '--sft_model_path', './ckpt/7b_llama/sft_model.pt', '--save_path', './ckpt/7b_llama', '--micro_train_batch_size', '1', '--train_batch_size', '128', '--micro_rollout_batch_size', '1', '--rollout_batch_size', '1024', '--max_epochs', '1', '--prompt_max_len', '1024', '--generate_max_len', '1024', '--zero_stage', '2', '--bf16', '--actor_learning_rate', '5e-7', '--critic_learning_rate', '9e-6', '--inference_tp_size', '1', '--init_kl_coef', '0.01', '--prompt_data', 'yahma/alpaca-cleaned,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward', '--prompt_data_probs', '0.3,0.6,0.1', '--normalize_reward', '--adam_offload', '--gradient_checkpointing'] exits with return code = 1