Hi team, Great work, but I have a question to consult. I used <code class="not

<div class="highlight highlight-text-adblock notranslate position-relative overflow-auto" dir="auto"

Inquiry regarding the feasibility of fine-tuning LLaMA2-7B with a single A100 about openrlhf HOT 4 CLOSED

callanwu commented on June 5, 2024

Inquiry regarding the feasibility of fine-tuning LLaMA2-7B with a single A100

from openrlhf.

Comments (4)

callanwu commented on June 5, 2024

[2023-12-01 05:53:04,949] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
[2023-12-01 05:53:15,433] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-12-01 05:53:15,436] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-12-01 05:53:15,436] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2023-12-01 05:53:15,453] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-12-01 05:53:15,453] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-12-01 05:53:15,453] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:146:__init__] Reduce bucket size 500,000,000
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:147:__init__] Allgather bucket size 500,000,000
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:148:__init__] CPU Offload: True
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:149:__init__] Round robin gradient partitioning: False
[2023-12-01 05:53:49,783] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-01 05:53:49,784] [INFO] [utils.py:803:see_memory_usage] MA 12.86 GB         Max_MA 12.86 GB         CA 12.86 GB         Max_CA 13 GB
[2023-12-01 05:53:49,784] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 52.58 GB, percent = 41.9%
[2023-12-01 05:54:12,066] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2440
[2023-12-01 05:54:12,099] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', '../train_sft.py', '--local_rank=0', '--max_len', '200', '--dataset', 'OpenOrca', '--dataset_probs', '1.0', '--train_batch_size', '50', '--micro_train_batch_size', '1', '--max_samples', '500', '--pretrain', 'Llama-2-7b-hf', '--save_path', './ckpt/7b_llama', '--save_steps', '-1', '--logging_steps', '1', '--eval_steps', '-1', '--ckpt_path', './ckpt/7b_llama/checkpoints_sft', '--zero_stage', '2', '--max_epochs', '1', '--bf16', '--flash_attn', '--learning_rate', '5e-6', '--adam_offload', '--gradient_checkpointing'] exits with return code = -9

from openrlhf.

callanwu commented on June 5, 2024

And i use the provided nvidia-docker and modified train_sft_llama.sh is

set -x

read -r -d '' training_commands <<EOF
../train_sft.py \
    --max_len 200 \
    --dataset OpenOrca \
    --dataset_probs 1.0 \
    --train_batch_size 50 \
    --micro_train_batch_size 1 \
    --max_samples 500 \
    --pretrain Llama-2-7b-hf \
    --save_path ./ckpt/7b_llama \
    --save_steps -1 \
    --logging_steps 1 \
    --eval_steps -1 \
    --ckpt_path ./ckpt/7b_llama/checkpoints_sft \
    --zero_stage 2 \
    --max_epochs 1 \
    --bf16 \
    --flash_attn \
    --learning_rate 5e-6 \
    --adam_offload \
    --gradient_checkpointing
EOF
    # --wandb [WANDB_TOKENS]

if [[ ${1} != "slurm" ]]; then
    export PATH=$HOME/.local/bin/:$PATH
    CUDA_VISIBLE_DIVICES=0 deepspeed $training_commands
fi

I have downloaded dataset and model locally.

from openrlhf.

hijkzzz commented on June 5, 2024

I have tested it, and it works well on my side, perhaps because your CPU memory is small.
Ideally your CPU should have at least 128GB of RAM。

from openrlhf.

callanwu commented on June 5, 2024

thx for your reply:)

from openrlhf.

Inquiry regarding the feasibility of fine-tuning LLaMA2-7B with a single A100 about openrlhf HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent