Comments (5)
How many GPUs are you running this on? IIRC, this is an issue when running on <8 GPUs with world-size=8
from openchatkit.
With 2 GPU; Modify training parameters as below:
--num-layers 4 --embedding-dim 4096
--world-size 2 --pipeline-group-size 2 --data-group-size 1 \
It seems passed without above error, but meet memory issues as below:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.74 GiB total capacity; 14.26 GiB already allocated; 12.69 MiB free; 14.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Running enviroment: GPU
Graphics card memory
32 GB
GPU bandwidth
Memory available
120 GB
Hard drive Available Space
400 G
from openchatkit.
@yxy123 , you ran out of GPU memory. Try decreasing the batch-size or using a smaller model (like togethercomputer/RedPajama-INCITE-Chat-3B-v1).
Let me know if that works.
Closing issue as the original error is fixed.
from openchatkit.
-
Ran pretriaing with RedPajama-3B successfully
(myconda) root@qd32bL:/mnt/oldKit/OpenChatKit-main/pretrained/RedPajama-3B# ls
prepare.py togethercomputer_RedPajama-INCITE-Chat-3B-v1 -
Then ran training finetune_RedPajama-INCITE-Chat-3B-v1.sh , modify some parameters as below:
Don't know if there's any parameters also need to change in finetune_RedPajama-INCITE-Chat-3B-v1.sh.
ARGS="--model-name ${BASE_MODEL}
--tokenizer-name ${BASE_MODEL}
--project-name together
--model-type gptneox
--optimizer adam
--seed 42
--load-pretrained-model true
--task-name
"${DATASETS}"
--checkpoint-path ${CHECKPOINT_PATH}
--total-steps ${TOTAL_STEPS} --warmup-steps 0 --train-warmup-steps 0
--checkpoint-steps ${CHECKPOINT_STEPS}
--lr 1e-5 --seq-length 2048 --batch-size 32 --micro-batch-size 1 --gradient-accumulate-step 1
--dist-url tcp://127.0.0.1:7033
--num-layers 4 --embedding-dim 2560
*--world-size 1 --pipeline-group-size 1 --data-group-size 1 *
--job-id 0 --net-interface ${netif}
--fp16
--dp-backend nccl
--dp-mode allreduce
--pp-mode gpipe --profiling no-profiling"
(trap 'kill 0' SIGINT;
python
&
#python
& \
- Ran bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh, also met memory issue as below:
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 15.74 GiB total capacity; 13.21 GiB already allocated; 284.69 MiB free; 14.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
from openchatkit.
@orangetin could you help to check when ran bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh, also met memory issue as above?
from openchatkit.
Related Issues (20)
- We couldn't connect to 'https://huggingface.co' HOT 1
- Add CodeAlpaca-20k dataset to improve coding skills.
- -
- When use one Gpu do model training, met one issue. HOT 3
- Environment Issues On Mac HOT 1
- Example script for continued pre-training? HOT 2
- How to disable AWS_ACCESS_KEY_ID when fine tuning? HOT 2
- LOST in the MIDDLE
- how many card days to Fine-tuning Llama-2-7B-32K-beta
- An error occurred while fine-tuning the model. HOT 3
- Cannot setup environment HOT 1
- Training on BookSum HOT 1
- how to train Fine-tuning Llama-2-7B-32K-beta?
- How to start the combined server/ send commands over HTTP?
- API is not working when inferenced with streamlit
- NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.
- ModuleNotFoundError: No module named 'flash_attn' HOT 1
- What is minimum resource requirement to fine-tuning Llama-2-7B-32K-beta model.
- H
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openchatkit.