Error log: /mnt/tet/OpenChatKit-main/training --model-name /mnt/tet/OpenChatKit-m

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ran pretriaing with RedPajama-3B successfully (myconda) root

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

When do model offline Training , met below issue about openchatkit HOT 5 CLOSED

yxy123 commented on July 18, 2024

When do model offline Training , met below issue

from openchatkit.

Comments (5)

orangetin commented on July 18, 2024

How many GPUs are you running this on? IIRC, this is an issue when running on <8 GPUs with world-size=8

from openchatkit.

yxy123 commented on July 18, 2024

With 2 GPU; Modify training parameters as below:
--num-layers 4 --embedding-dim 4096
--world-size 2 --pipeline-group-size 2 --data-group-size 1 \

It seems passed without above error, but meet memory issues as below:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.74 GiB total capacity; 14.26 GiB already allocated; 12.69 MiB free; 14.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Running enviroment: GPU
Graphics card memory
32 GB
GPU bandwidth
Memory available
120 GB
Hard drive Available Space
400 G

from openchatkit.

orangetin commented on July 18, 2024

@yxy123 , you ran out of GPU memory. Try decreasing the batch-size or using a smaller model (like togethercomputer/RedPajama-INCITE-Chat-3B-v1).

Let me know if that works.

Closing issue as the original error is fixed.

from openchatkit.

yxy123 commented on July 18, 2024

Ran pretriaing with RedPajama-3B successfully
(myconda) root@qd32bL:/mnt/oldKit/OpenChatKit-main/pretrained/RedPajama-3B# ls
prepare.py togethercomputer_RedPajama-INCITE-Chat-3B-v1
Then ran training finetune_RedPajama-INCITE-Chat-3B-v1.sh , modify some parameters as below:
Don't know if there's any parameters also need to change in finetune_RedPajama-INCITE-Chat-3B-v1.sh.

ARGS="--model-name ${BASE_MODEL}
--tokenizer-name ${BASE_MODEL}
--project-name together
--model-type gptneox
--optimizer adam
--seed 42
--load-pretrained-model true
--task-name
"${DATASETS}"
--checkpoint-path ${CHECKPOINT_PATH}
--total-steps ${TOTAL_STEPS} --warmup-steps 0 --train-warmup-steps 0
--checkpoint-steps ${CHECKPOINT_STEPS}
--lr 1e-5 --seq-length 2048 --batch-size 32 --micro-batch-size 1 --gradient-accumulate-step 1
--dist-url tcp://127.0.0.1:7033
--num-layers 4 --embedding-dim 2560
*--world-size 1 --pipeline-group-size 1 --data-group-size 1 *
--job-id 0 --net-interface ${netif}
--fp16
--dp-backend nccl
--dp-mode allreduce
--pp-mode gpipe --profiling no-profiling"

(trap 'kill 0' SIGINT;
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0
&
#python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \

& \

Ran bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh, also met memory issue as below:

Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 15.74 GiB total capacity; 13.21 GiB already allocated; 284.69 MiB free; 14.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from openchatkit.

yxy123 commented on July 18, 2024

@orangetin could you help to check when ran bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh, also met memory issue as above?

from openchatkit.

When do model offline Training , met below issue about openchatkit HOT 5 CLOSED

Comments (5)

& \

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent