Code Monkey home page Code Monkey logo

Comments (4)

callanwu avatar callanwu commented on June 5, 2024
[2023-12-01 05:53:04,949] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
[2023-12-01 05:53:15,433] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-12-01 05:53:15,436] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-12-01 05:53:15,436] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2023-12-01 05:53:15,453] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-12-01 05:53:15,453] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-12-01 05:53:15,453] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:146:__init__] Reduce bucket size 500,000,000
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:147:__init__] Allgather bucket size 500,000,000
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:148:__init__] CPU Offload: True
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:149:__init__] Round robin gradient partitioning: False
[2023-12-01 05:53:49,783] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-01 05:53:49,784] [INFO] [utils.py:803:see_memory_usage] MA 12.86 GB         Max_MA 12.86 GB         CA 12.86 GB         Max_CA 13 GB
[2023-12-01 05:53:49,784] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 52.58 GB, percent = 41.9%
[2023-12-01 05:54:12,066] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2440
[2023-12-01 05:54:12,099] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', '../train_sft.py', '--local_rank=0', '--max_len', '200', '--dataset', 'OpenOrca', '--dataset_probs', '1.0', '--train_batch_size', '50', '--micro_train_batch_size', '1', '--max_samples', '500', '--pretrain', 'Llama-2-7b-hf', '--save_path', './ckpt/7b_llama', '--save_steps', '-1', '--logging_steps', '1', '--eval_steps', '-1', '--ckpt_path', './ckpt/7b_llama/checkpoints_sft', '--zero_stage', '2', '--max_epochs', '1', '--bf16', '--flash_attn', '--learning_rate', '5e-6', '--adam_offload', '--gradient_checkpointing'] exits with return code = -9

from openrlhf.

callanwu avatar callanwu commented on June 5, 2024

And i use the provided nvidia-docker and modified train_sft_llama.sh is

set -x

read -r -d '' training_commands <<EOF
../train_sft.py \
    --max_len 200 \
    --dataset OpenOrca \
    --dataset_probs 1.0 \
    --train_batch_size 50 \
    --micro_train_batch_size 1 \
    --max_samples 500 \
    --pretrain Llama-2-7b-hf \
    --save_path ./ckpt/7b_llama \
    --save_steps -1 \
    --logging_steps 1 \
    --eval_steps -1 \
    --ckpt_path ./ckpt/7b_llama/checkpoints_sft \
    --zero_stage 2 \
    --max_epochs 1 \
    --bf16 \
    --flash_attn \
    --learning_rate 5e-6 \
    --adam_offload \
    --gradient_checkpointing
EOF
    # --wandb [WANDB_TOKENS]

if [[ ${1} != "slurm" ]]; then
    export PATH=$HOME/.local/bin/:$PATH
    CUDA_VISIBLE_DIVICES=0 deepspeed $training_commands
fi

I have downloaded dataset and model locally.

from openrlhf.

hijkzzz avatar hijkzzz commented on June 5, 2024

I have tested it, and it works well on my side, perhaps because your CPU memory is small.
Ideally your CPU should have at least 128GB of RAM。

from openrlhf.

callanwu avatar callanwu commented on June 5, 2024

thx for your reply:)

from openrlhf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.