Comments (11)
I forgot that, 22GB-23GB GPU Memory usge is base on amp-bfloat16,cpu can NOT compute hafl float data type, fp32 instead,it may use more then 23GB RAM.
from chatlm-mini-chinese.
What's your training arguments? e.g. batch_size
, max_seq_len
and gradient_accumulation_steps
.
Using the default training arguments approximately use 22GB-23GB GPU Memory for training model, DO NOT include datasets and python scripts' usage.
As your are using CPU
for training and available RAM is 22.47 GB
, it may lead to system memory swap and training with cpu is toooooo slow, it looks like stucking of your system.
from chatlm-mini-chinese.
I forgot that, 22GB-23GB GPU Memory usge is base on amp-bfloat16,cpu can NOT compute hafl float data type, fp32 instead,it may use more then 23GB RAM.
I have solved this problem. Thank you for your reply, but I have encountered a new problem: during my training process, I always encountered TypeError when reading data: 'NoneType' object is not subscribable. This error caused the program to exit abnormally. I have roughly checked my data and there is no empty data
from chatlm-mini-chinese.
The error throwed by this file, line 102, as below picture showed.
It looks like data field prompt
or response
being none type, prompt
or response
's data type must be string and length > 0.
Check your dataset file if any NONE TYPE object (for example, loading your dataset with pandas
, use pd.dropna()
to drop None
data), be sure your dataset is clean.
from chatlm-mini-chinese.
The error throwed by this file, line 102, as below picture showed.
It looks like data field
prompt
orresponse
being none type,prompt
orresponse
's data type must be string and length > 0. Check your dataset file if any NONE TYPE object (for example, loading your dataset withpandas
, usepd.dropna()
to dropNone
data), be sure your dataset is clean.
Indeed, there was an issue with my data. When I was training with multiple GPUs, there was an issue with NCCL
from chatlm-mini-chinese.
How many gpus do you hava? What's your linux kernel version? For running with torch >= 2.1
and accelerate
, linux kernel version should be >= 5.5.
This code is running well on my machine with 2 gpus. It looks like your environment problem.
You should google it or see this accelerate/issues/2174#issuecomment-1821295563.
from chatlm-mini-chinese.
How many gpus do you hava? What's your linux kernel version? For running with
torch >= 2.1
andaccelerate
, linux kernel version should be >= 5.5.This code is running well on my machine with 2 gpus. It looks like your environment problem.
You should google it or see this accelerate/issues/2174#issuecomment-1821295563.
I trained on four A10 graphics cards, and the Linux kernel version is 5.15.0-91-generic. Is there any other way to solve this besides upgrading my kernel version? Upgrading the version involves too much, and besides that, I tried to configure the environment variable NCCL_ P2P_ DISABLE=1 and NCCL_ IB_ DisaBLE=1, used to disable P2P but still does not work
from chatlm-mini-chinese.
Kernel 5.15 is ok. Does it works with one gpu? Have you run accelerate config
command to configure your machine? have you try this code? --num_processes
means how many gpu you want to use.
# this project trainer
accelerate launch --multi_gpu --num_processes 4 ./train.py train
# huggingface trainer
accelerate launch --multi_gpu --num_processes 4 pre_train.py
If Some NCCL operations have failed or timed out
error only happen in evaluate stage, check if there is any available gpu memory at that time (mabe some of your evaluate data sequence is too long, lead to cuda OOM), or try to skip the evaluate stage, do traing only.
from chatlm-mini-chinese.
Kernel 5.15 is ok. Does it works with one gpu? Have you run
accelerate config
command to configure your machine? have you try this code?--num_processes
means how many gpu you want to use.# this project trainer accelerate launch --multi_gpu --num_processes 4 ./train.py train # huggingface trainer accelerate launch --multi_gpu --num_processes 4 pre_train.pyIf
Some NCCL operations have failed or timed out
error only happen in evaluate stage, check if there is any available gpu memory at that time (mabe some of your evaluate data sequence is too long, lead to cuda OOM), or try to skip the evaluate stage, do traing only.
Yes, I ran the program using the instructions above, and the program did indeed run out of GPUs during the evaluation phase. Setting the number of GPUs to 2 using the same instructions also caused the same problem. I am puzzled why I set the batch to be very small and the GPUs to be OOM. How can I modify the code to skip the evaluation phase and only focus on training
from chatlm-mini-chinese.
- Some of your evaluate data sequence is too long, it will pad to max length in a batch while runing evaluate, may lead to cuda OOM.
- How to skip the evaluation stage?
- if you are using this project trainer, comment or delete model/trainer.py#Line373 to model/trainer.py#Line400, model checkpoint will save in
model/trainer.py#L344
- if your are using huggingface trainer
seteval_dataset=None
, in pre_train.py#Line108
from chatlm-mini-chinese.
- Some of your evaluate data sequence is too long, it will pad to max length in a batch while runing evaluate, may lead to cuda OOM.
- How to skip the evaluation stage?
- if you are using this project trainer, comment or delete model/trainer.py#Line373 to model/trainer.py#Line400, model checkpoint will save in
model/trainer.py#L344
- if your are using huggingface trainer
seteval_dataset=None
, in pre_train.py#Line108
It is indeed this problem. I will comment out the evaluation code and he can train normally. Thank you for your answer
from chatlm-mini-chinese.
Related Issues (20)
- 微调后预测三元组不正确原因 HOT 5
- 用train.py出现shape的mismatch HOT 10
- sft微调时报错 HOT 4
- 如何提取中间层的输出? HOT 2
- 考虑出一个支持llama的版本吗 HOT 1
- RuntimeError: No executable batch size found, reached zero HOT 2
- 如何加载sft后的模型? HOT 1
- train_3.5M_CN数据处理问题 HOT 1
- 这个模型好像没有长文对话的能力,该如何训练它让它有这个能力? HOT 1
- 请问这些预训练数据加起来有多少token呀 HOT 2
- 非常不错的开源项目 HOT 1
- 预训练数据集必须是{“prompt”: "response":}的格式么? HOT 2
- Some NCCL operations have failed or timed out. HOT 5
- sft_train HOT 1
- 是否考虑将预训练的模型和仅stf后的模型也上传的平台呢 HOT 1
- 这种只能通过问答对的方式,有没有办法MLM的方式学习知识体系。 HOT 1
- 预训练,用了160万数据,共2G句子对,使用A40的48G显存,无论使用1/2/3/4卡,都会报OOM HOT 1
- 可以用a卡训练吗 HOT 1
- tokenizer的字典中有不少token带有下划线,请问这种是什么意思 HOT 1
- 4080显卡,基本跑不了多少数据,过万条训练数据就报错 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatlm-mini-chinese.