Comments (11)
This code doesn't work on multi-GPU yet; I'm still running it on my single RTX 4090. Might adapt to multi-GPU in a bit to speed up training.
(Also please git pull
if you haven't recently; I fixed a bug in the dataset generation code)
from alpaca-lora.
Just accelerate launch finetune.py
.
It works.
from alpaca-lora.
The PEFT code needs to be adapted to make better use of accelerate. I think there are some examples of how to do it in the huggingface/peft repo but I can't test them as I don't have a multi-GPU setup myself.
from alpaca-lora.
Might adapt to multi-GPU in a bit to speed up training.
Could you point me to the reason why this is not working on multiple GPUs? I.e., which part is breaking it? The LoRA stuff or the 8-bit stuff? Something else? Thanks!
from alpaca-lora.
The PEFT code needs to be adapted to make better use of accelerate. I think there are some examples of how to do it in the huggingface/peft repo but I can't test them as I don't have a multi-GPU repo myself.
That would be neat. Kaggle already offers 2xT4 with 2x16GB RAM, which would be probably quite slow but probably enough to train 13B model.
from alpaca-lora.
this may work for you: https://discord.com/channels/1086739839761776660/1087706061022187641/1087944493564698674
This link seems to be a private link. What is it about?
from alpaca-lora.
I use the following command to run normally on multi-gpu
CUDA_VISIBLE_DEVICES=0,1,2,3
accelerate launch finetune.py`
But actually only gpu 0 and 1 are used.
I don't know why
from alpaca-lora.
I use only one GPU, add some codes in beginning
gpu_list = [7]
gpu_list_str = ','.join(map(str, gpu_list))
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device
assign the one GPU to torch
from alpaca-lora.
I use the following command to run normally on multi-gpu
CUDA_VISIBLE_DEVICES=0,1,2,3
accelerate launch finetune.py` But actually only gpu 0 and 1 are used. I don't know why
Try to enable all the GPUs by running accelerate config
from alpaca-lora.
fwiw solution for me was to not use torchrun
to launch the script. I was having an issue where single-GPU training worked fine, but with multi-GPU training after a single update step, the model would freeze -- gpu-util was at 100% but no more updates happened. Getting rid of torchrun
and simply calling the python script solved it and seems to use DDP fine.
would be great to have more guidance on what kinds of setups work for launching multi-GPU jobs. would be happy to contribute information about my setup for that as well :)
from alpaca-lora.
fwiw solution for me was to not use
torchrun
to launch the script. I was having an issue where single-GPU training worked fine, but with multi-GPU training after a single update step, the model would freeze -- gpu-util was at 100% but no more updates happened. Getting rid oftorchrun
and simply calling the python script solved it and seems to use DDP fine.would be great to have more guidance on what kinds of setups work for launching multi-GPU jobs. would be happy to contribute information about my setup for that as well :)
Are you able to do multi-GPU inferencing?
from alpaca-lora.
Related Issues (20)
- generate error HOT 1
- can't load tokenizer HOT 2
- Load_in_8bit causing issues: Out of memory error with 44Gb VRAM in my GPU or device_map error HOT 1
- AttributeError: module 'gradio' has no attribute 'inputs' HOT 18
- When I set load_in_8bit=true, some errors occurred....
- is there any flag to mark the model is safetensors or pickle format?
- Errors of tuning on 70B LLAMA 2, does alpaca-lora support 70B llama 2 tuning work?
- safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization HOT 15
- generate error after hit submit btn
- The weights are not updated HOT 1
- LAION Open Assistant data is already released
- Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported
- Is it possible to combine alpaca-lora with RAG
- Is there a way to check if this training is all done?
- failed to run on colab: ModulesToSaveWrapper has no attribute `embed_tokens`
- Finetune scenarios
- decapoda-research/llama-7b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' HOT 2
- Single GPU vs multiple GPUs stack (parallel)
- Why this error? ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules need to be offloaded: base_model.model.model.layers.3, base_model.model.model.layers.4, base_model.model.model.layers.5, base_model.model.model.layers.6, base_model.model.model.layers.7, base_model.model.model.layers.8, base_model.model.model.layers.9, base_model.model.model.layers.10, base_model.model.model.la
- InvalidHeaderDeserialization
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alpaca-lora.