The system I am running is Windows <a target="_blank" rel="noopener noreferrer" hr

What's your training arguments? e.g. batch_size , <co

Kernel 5.15 is ok. Does it works with one gpu? Have you run <code class="notranslate

Kernel 5.15 is ok. Does it works with one gpu? Have you run <code class="

Why do I get stuck loading the dataset after running it about chatlm-mini-chinese HOT 11 CLOSED

charent commented on July 29, 2024

Why do I get stuck loading the dataset after running it

from chatlm-mini-chinese.

Comments (11)

charent commented on July 29, 2024 1

I forgot that, 22GB-23GB GPU Memory usge is base on amp-bfloat16，cpu can NOT compute hafl float data type, fp32 instead，it may use more then 23GB RAM.

from chatlm-mini-chinese.

charent commented on July 29, 2024

What's your training arguments? e.g. batch_size , max_seq_len and gradient_accumulation_steps.

Using the default training arguments approximately use 22GB-23GB GPU Memory for training model, DO NOT include datasets and python scripts' usage.

As your are using CPU for training and available RAM is 22.47 GB, it may lead to system memory swap and training with cpu is toooooo slow, it looks like stucking of your system.

from chatlm-mini-chinese.

anyiz commented on July 29, 2024

I forgot that, 22GB-23GB GPU Memory usge is base on amp-bfloat16，cpu can NOT compute hafl float data type, fp32 instead，it may use more then 23GB RAM.

I have solved this problem. Thank you for your reply, but I have encountered a new problem: during my training process, I always encountered TypeError when reading data: 'NoneType' object is not subscribable. This error caused the program to exit abnormally. I have roughly checked my data and there is no empty data

from chatlm-mini-chinese.

charent commented on July 29, 2024

The error throwed by this file, line 102, as below picture showed.

It looks like data field prompt or response being none type, prompt or response 's data type must be string and length > 0.
Check your dataset file if any NONE TYPE object (for example, loading your dataset with pandas, use pd.dropna() to drop None data), be sure your dataset is clean.

from chatlm-mini-chinese.

anyiz commented on July 29, 2024

The error throwed by this file, line 102, as below picture showed. It looks like data field prompt or response being none type, prompt or response 's data type must be string and length > 0. Check your dataset file if any NONE TYPE object (for example, loading your dataset with pandas, use pd.dropna() to drop None data), be sure your dataset is clean.

Indeed, there was an issue with my data. When I was training with multiple GPUs, there was an issue with NCCL

from chatlm-mini-chinese.

charent commented on July 29, 2024

How many gpus do you hava? What's your linux kernel version? For running with torch >= 2.1 and accelerate , linux kernel version should be >= 5.5.

This code is running well on my machine with 2 gpus. It looks like your environment problem.

You should google it or see this accelerate/issues/2174#issuecomment-1821295563.

from chatlm-mini-chinese.

anyiz commented on July 29, 2024

How many gpus do you hava? What's your linux kernel version? For running with torch >= 2.1 and accelerate , linux kernel version should be >= 5.5.

This code is running well on my machine with 2 gpus. It looks like your environment problem.

You should google it or see this accelerate/issues/2174#issuecomment-1821295563.

I trained on four A10 graphics cards, and the Linux kernel version is 5.15.0-91-generic. Is there any other way to solve this besides upgrading my kernel version? Upgrading the version involves too much, and besides that, I tried to configure the environment variable NCCL_ P2P_ DISABLE=1 and NCCL_ IB_ DisaBLE=1, used to disable P2P but still does not work

from chatlm-mini-chinese.

charent commented on July 29, 2024

Kernel 5.15 is ok. Does it works with one gpu? Have you run accelerate config command to configure your machine? have you try this code? --num_processes means how many gpu you want to use.

# this project trainer
accelerate launch --multi_gpu --num_processes 4 ./train.py train

# huggingface trainer
accelerate launch --multi_gpu --num_processes 4 pre_train.py

If Some NCCL operations have failed or timed out error only happen in evaluate stage, check if there is any available gpu memory at that time (mabe some of your evaluate data sequence is too long, lead to cuda OOM), or try to skip the evaluate stage, do traing only.

from chatlm-mini-chinese.

anyiz commented on July 29, 2024

Kernel 5.15 is ok. Does it works with one gpu? Have you run accelerate config command to configure your machine? have you try this code? --num_processes means how many gpu you want to use.
# this project trainer
accelerate launch --multi_gpu --num_processes 4 ./train.py train

# huggingface trainer
accelerate launch --multi_gpu --num_processes 4 pre_train.py
If Some NCCL operations have failed or timed out error only happen in evaluate stage, check if there is any available gpu memory at that time (mabe some of your evaluate data sequence is too long, lead to cuda OOM), or try to skip the evaluate stage, do traing only.

Yes, I ran the program using the instructions above, and the program did indeed run out of GPUs during the evaluation phase. Setting the number of GPUs to 2 using the same instructions also caused the same problem. I am puzzled why I set the batch to be very small and the GPUs to be OOM. How can I modify the code to skip the evaluation phase and only focus on training

from chatlm-mini-chinese.

charent commented on July 29, 2024

Some of your evaluate data sequence is too long, it will pad to max length in a batch while runing evaluate, may lead to cuda OOM.
How to skip the evaluation stage?

if you are using this project trainer, comment or delete model/trainer.py#Line373 to model/trainer.py#Line400, model checkpoint will save in model/trainer.py#L344
if your are using huggingface trainer
set eval_dataset=None， in pre_train.py#Line108

from chatlm-mini-chinese.

anyiz commented on July 29, 2024

Some of your evaluate data sequence is too long, it will pad to max length in a batch while runing evaluate, may lead to cuda OOM.

How to skip the evaluation stage?

if you are using this project trainer, comment or delete model/trainer.py#Line373 to model/trainer.py#Line400, model checkpoint will save in model/trainer.py#L344

if your are using huggingface trainer
set eval_dataset=None， in pre_train.py#Line108

It is indeed this problem. I will comment out the evaluation code and he can train normally. Thank you for your answer

from chatlm-mini-chinese.

Why do I get stuck loading the dataset after running it about chatlm-mini-chinese HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent