Code Monkey home page Code Monkey logo

Comments (11)

charent avatar charent commented on July 29, 2024 1

I forgot that, 22GB-23GB GPU Memory usge is base on amp-bfloat16,cpu can NOT compute hafl float data type, fp32 instead,it may use more then 23GB RAM.

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

What's your training arguments? e.g. batch_size , max_seq_len and gradient_accumulation_steps.

Using the default training arguments approximately use 22GB-23GB GPU Memory for training model, DO NOT include datasets and python scripts' usage.

As your are using CPU for training and available RAM is 22.47 GB, it may lead to system memory swap and training with cpu is toooooo slow, it looks like stucking of your system.

from chatlm-mini-chinese.

anyiz avatar anyiz commented on July 29, 2024

I forgot that, 22GB-23GB GPU Memory usge is base on amp-bfloat16,cpu can NOT compute hafl float data type, fp32 instead,it may use more then 23GB RAM.

I have solved this problem. Thank you for your reply, but I have encountered a new problem: during my training process, I always encountered TypeError when reading data: 'NoneType' object is not subscribable. This error caused the program to exit abnormally. I have roughly checked my data and there is no empty data
bug

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

The error throwed by this file, line 102, as below picture showed.
image
It looks like data field prompt or response being none type, prompt or response 's data type must be string and length > 0.
Check your dataset file if any NONE TYPE object (for example, loading your dataset with pandas, use pd.dropna() to drop None data), be sure your dataset is clean.

from chatlm-mini-chinese.

anyiz avatar anyiz commented on July 29, 2024

The error throwed by this file, line 102, as below picture showed. image It looks like data field prompt or response being none type, prompt or response 's data type must be string and length > 0. Check your dataset file if any NONE TYPE object (for example, loading your dataset with pandas, use pd.dropna() to drop None data), be sure your dataset is clean.

Indeed, there was an issue with my data. When I was training with multiple GPUs, there was an issue with NCCL
bug
peizhi

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

How many gpus do you hava? What's your linux kernel version? For running with torch >= 2.1 and accelerate , linux kernel version should be >= 5.5.

This code is running well on my machine with 2 gpus. It looks like your environment problem.

You should google it or see this accelerate/issues/2174#issuecomment-1821295563.

from chatlm-mini-chinese.

anyiz avatar anyiz commented on July 29, 2024

How many gpus do you hava? What's your linux kernel version? For running with torch >= 2.1 and accelerate , linux kernel version should be >= 5.5.

This code is running well on my machine with 2 gpus. It looks like your environment problem.

You should google it or see this accelerate/issues/2174#issuecomment-1821295563.

I trained on four A10 graphics cards, and the Linux kernel version is 5.15.0-91-generic. Is there any other way to solve this besides upgrading my kernel version? Upgrading the version involves too much, and besides that, I tried to configure the environment variable NCCL_ P2P_ DISABLE=1 and NCCL_ IB_ DisaBLE=1, used to disable P2P but still does not work
bug

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

Kernel 5.15 is ok. Does it works with one gpu? Have you run accelerate config command to configure your machine? have you try this code? --num_processes means how many gpu you want to use.

# this project trainer
accelerate launch --multi_gpu --num_processes 4 ./train.py train

# huggingface trainer
accelerate launch --multi_gpu --num_processes 4 pre_train.py

If Some NCCL operations have failed or timed out error only happen in evaluate stage, check if there is any available gpu memory at that time (mabe some of your evaluate data sequence is too long, lead to cuda OOM), or try to skip the evaluate stage, do traing only.

from chatlm-mini-chinese.

anyiz avatar anyiz commented on July 29, 2024

Kernel 5.15 is ok. Does it works with one gpu? Have you run accelerate config command to configure your machine? have you try this code? --num_processes means how many gpu you want to use.

# this project trainer
accelerate launch --multi_gpu --num_processes 4 ./train.py train

# huggingface trainer
accelerate launch --multi_gpu --num_processes 4 pre_train.py

If Some NCCL operations have failed or timed out error only happen in evaluate stage, check if there is any available gpu memory at that time (mabe some of your evaluate data sequence is too long, lead to cuda OOM), or try to skip the evaluate stage, do traing only.

Yes, I ran the program using the instructions above, and the program did indeed run out of GPUs during the evaluation phase. Setting the number of GPUs to 2 using the same instructions also caused the same problem. I am puzzled why I set the batch to be very small and the GPUs to be OOM. How can I modify the code to skip the evaluation phase and only focus on training
bug2

bug1

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024
  1. Some of your evaluate data sequence is too long, it will pad to max length in a batch while runing evaluate, may lead to cuda OOM.
  2. How to skip the evaluation stage?

from chatlm-mini-chinese.

anyiz avatar anyiz commented on July 29, 2024
  1. Some of your evaluate data sequence is too long, it will pad to max length in a batch while runing evaluate, may lead to cuda OOM.
  2. How to skip the evaluation stage?

It is indeed this problem. I will comment out the evaluation code and he can train normally. Thank you for your answer

from chatlm-mini-chinese.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.