Comments (13)
@Vincent131499 @taishiciR Hi!
I was able to reproduce the issue and I've come up with the fix. PEFT LoRA is fully functional in #68 and will be merged into main soon enough. The only problem is that it only works with peft>=0.3.0dev0
, which is the latest dev version (so install with pip install git+https://github.com/huggingface/peft.git@main
).
You can checkout lora
branch and work with LoRA right away.
P.S. The only model not supporting LoRA right now is vanilla gpt2
since it uses convolutions instead of Linear layers and it really breaks everything.
from tensor_parallel.
@taishiciR Hi!
Are you using device_map
?
You can create a device map from a meta model like this: device_map = tensor_parallel.infer_sharded_device_map(model)
and then pass it to accelerate.load_checkpoint_in_model
as device_map=device_map
. Without it accelerate
doesn't know where to dispatch tensors and produces the following error.
It can also be caused by tied weights but I've only seen it for T5 model. LLaMA should definitely work.
from tensor_parallel.
@Vincent131499 Hi!
Simply calling tensor_parallel
on model with already loaded adapters should just work. Auto configuration will take care of adapters. Exactly what problems have you ran into?
Predefined tensor_parallel
configs for transformers
models might not work properly but they'll not be triggered on PeftModel anyway. I'll try and come up with an elegant solution for this later.
from tensor_parallel.
It would be better to support shard-loading base model and lora directly to gpu, if that's possible.
from tensor_parallel.
@sgsdxzy It's supported.
You can create a tensor_parallel
model from a meta model and then dispatch the weights however you like.
For example this demo does it like this:
- Initialize a meta model with
accelerate.init_empty_weights()
- Create a
TensorParallelPreTrainedModel
from the meta model - Load model shards and convert them with
tensor_parallel.convert_state_dict
- Dispatch them with
accelerate.load_checkpoint_in_model
You could add LoRA
to the meta model after step 1 either as real weights or meta weights to be initialized later.
from tensor_parallel.
I'm not sure how LoRA
is natively dispatched but dispatching it by hand should always work.
from tensor_parallel.
hi @BlackSamorez Any plan to support PEFT LoRA models?
For I came acorss the same problem using other tp strategy, when Shard init a model , the lora layer just do not work well in foward step.
Are there any examples for fine-tuning llama with lora?
look forward to your reply ,thanks!
from tensor_parallel.
When I tried to adapt tensorParallism to 【llama with lora】 in https://github.com/tloen/alpaca-lora/blob/main/finetune.py ,
I came across runtime error , in huggingface transformers/trainer.py,
Look in detail there be no attr for a TensorParallelPreTrainedModel?
AttributeError: 'TensorParallelPreTrainedModel' object has no attribute '_is_int8_training_enabled'
hi~ any work round for this issue @BlackSamorez
Besides
https://www.kaggle.com/code/blacksamorez/tensor-parallel-int8-llm -- I had both tried the example above with main/Lallma branch,it just do not work ( change the model to llama-7b-hf + lora )
from tensor_parallel.
@BlackSamorez @Vincent131499 @taishiciR I think you can just merge the lora before applying TP, that would provide the best performance:
lora_model = PeftModel.from_pretrained(base_model, lora_path)
# merge weights - new merging method from peft, requires > 0.2.0
lora_model = lora_model.merge_and_unload()
APPLY TP HERE
Though you need to load the entire model first, you can load in cpu and let TP dispatch it.
from tensor_parallel.
@sgsdxzy If we're talking about inference then yes - merging is the way to go. But for adapter training merging is, obviously, not an option.
from tensor_parallel.
@Vincent131499 @taishiciR Hi! I was able to reproduce the issue and I've come up with the fix. PEFT LoRA is fully functional in #68 and will be merged into main soon enough. The only problem is that it only works with
peft>=0.3.0dev0
, which is the latest dev version (so install withpip install git+https://github.com/huggingface/peft.git@main
). You can checkoutlora
branch and work with LoRA right away.P.S. The only model not supporting LoRA right now is vanilla
gpt2
since it uses convolutions instead of Linear layers and it really breaks everything.
@BlackSamorez hi ~
With the updated dependencies , I tried the example again following the steps
For example this demo does it like this:
1)Initialize a meta model with accelerate.init_empty_weights()
2.1)get_peft_model(model, lora_config) # warpper llama 7b with lora
2.2)Create a TensorParallelPreTrainedModel from the meta model
3)Load model shards and convert them with tensor_parallel.convert_state_dict
4)Dispatch them with accelerate.load_checkpoint_in_model
I came across an error from accelerate/utils/modeling.py:935 (accelerate 0.18.0) : wrapped_model.module_shards.0.model.layers.0.self_attn.q_proj.weight doesn't have any device set.
the device_map of the model turn out to be: 'wrapped_model.module_shards.0.base_model.model.model.layers.0.self_attn.q_proj.tp_wrapped_module.weight': device(type='cuda', index=0)
Can u please check this ?
from tensor_parallel.
thank you~
from tensor_parallel.
Since there seems to be no more activity, I'm closing this issue.
Feel free to reopen it if anything comes up.
from tensor_parallel.
Related Issues (20)
- model.generate() with inputs_embeds HOT 3
- tensor_parallel method distributed=True HOT 2
- distributed TP model forward output's requires_grad is False HOT 5
- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
- 2x slowdown using TP
- Would it suitable for the multi-GPU parallel inference for llama2?
- Support of 8-bit and 4-bit quantization HOT 1
- Segmentation fault (core dumped)
- RuntimeError: NCCL Error 3: internal error HOT 1
- Max Recursion Error when using with lora HOT 2
- ValueError: Model parameters were moved to incorrect devices, did call on model.cuda() or model.to(device)? If so, please avoid doing that
- AttributeError: object has no attribute 'devices'
- Out of GPU memory for two A10 GPUs HOT 1
- How to use the model in a scenario where it is stored in the Safetenors format?
- No output when using tensor_parallel HOT 1
- TensorParallel object has no attribute save_pretrained
- No implement of generate() when using models from hugging face.
- Can I use tensor_parallel to inference for a GPTQ quantized model?
- Now, does tensor_parallel no longer support the huggingface trainer?
- tensor_parallel int4 LLM is not working since release v2.0.0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensor_parallel.