Comments (4)
[2023-12-01 05:53:04,949] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
[2023-12-01 05:53:15,433] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-12-01 05:53:15,436] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-12-01 05:53:15,436] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2023-12-01 05:53:15,453] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-12-01 05:53:15,453] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-12-01 05:53:15,453] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:146:__init__] Reduce bucket size 500,000,000
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:147:__init__] Allgather bucket size 500,000,000
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:148:__init__] CPU Offload: True
[2023-12-01 05:53:15,453] [INFO] [stage_1_and_2.py:149:__init__] Round robin gradient partitioning: False
[2023-12-01 05:53:49,783] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-01 05:53:49,784] [INFO] [utils.py:803:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 12.86 GB Max_CA 13 GB
[2023-12-01 05:53:49,784] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 52.58 GB, percent = 41.9%
[2023-12-01 05:54:12,066] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2440
[2023-12-01 05:54:12,099] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', '../train_sft.py', '--local_rank=0', '--max_len', '200', '--dataset', 'OpenOrca', '--dataset_probs', '1.0', '--train_batch_size', '50', '--micro_train_batch_size', '1', '--max_samples', '500', '--pretrain', 'Llama-2-7b-hf', '--save_path', './ckpt/7b_llama', '--save_steps', '-1', '--logging_steps', '1', '--eval_steps', '-1', '--ckpt_path', './ckpt/7b_llama/checkpoints_sft', '--zero_stage', '2', '--max_epochs', '1', '--bf16', '--flash_attn', '--learning_rate', '5e-6', '--adam_offload', '--gradient_checkpointing'] exits with return code = -9
from openrlhf.
And i use the provided nvidia-docker and modified train_sft_llama.sh
is
set -x
read -r -d '' training_commands <<EOF
../train_sft.py \
--max_len 200 \
--dataset OpenOrca \
--dataset_probs 1.0 \
--train_batch_size 50 \
--micro_train_batch_size 1 \
--max_samples 500 \
--pretrain Llama-2-7b-hf \
--save_path ./ckpt/7b_llama \
--save_steps -1 \
--logging_steps 1 \
--eval_steps -1 \
--ckpt_path ./ckpt/7b_llama/checkpoints_sft \
--zero_stage 2 \
--max_epochs 1 \
--bf16 \
--flash_attn \
--learning_rate 5e-6 \
--adam_offload \
--gradient_checkpointing
EOF
# --wandb [WANDB_TOKENS]
if [[ ${1} != "slurm" ]]; then
export PATH=$HOME/.local/bin/:$PATH
CUDA_VISIBLE_DIVICES=0 deepspeed $training_commands
fi
I have downloaded dataset and model locally.
from openrlhf.
I have tested it, and it works well on my side, perhaps because your CPU memory is small.
Ideally your CPU should have at least 128GB of RAM。
from openrlhf.
thx for your reply:)
from openrlhf.
Related Issues (20)
- PPO采用zero 3 stage后产生time out error HOT 2
- maybe data bug with dpo trainer HOT 1
- QLORA model loading error HOT 5
- 我们正在对比DSchat跟OpenRLHF的性能以便完成选型工作,能否提供下修复后的DSChat代码,从而复现社区提供的性能对比数据 HOT 7
- Avoid monkey patching vLLM HOT 1
- Claim your paper on HF HOT 1
- [Question] EOS in reward model dataset HOT 3
- action_log_probs重复计算 HOT 2
- Support Llama-3 models HOT 1
- Incompatibility with Qwen HOT 2
- Suggestion on the configurations HOT 1
- Strange Kill of Critic Model HOT 5
- 使用Deepseek-lite训练DPO,显示expected mat1 and mat2 to have the same type, but got: float != c10: : BFLoat16 HOT 2
- Will 2 x GPU setups be supported HOT 1
- Dummy token for prompts in HH datasets HOT 2
- Does this codebase consider using "torch.compile"? HOT 2
- wrong action_log_probs returned? HOT 1
- 可以增加支持SimPO吗
- zero3 training error HOT 1
- Failed to update weights to vLLM
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openrlhf.