Comments (2)
Here's a simple piece of code which reproduce the issue
from transformers import AutoModelForCausalLM, AutoTokenizer
import threading
import tensor_parallel as tp
import torch
checkpoint = "bigcode/starcoderbase"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16,low_cpu_mem_usage=True, trust_remote_code=True)
model = tp.tensor_parallel(model, sharded=False)
def task():
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], clean_up_tokenization_spaces=False))
thread1 = threading.Thread(target=task)
thread2 = threading.Thread(target=task)
thread1.start()
thread2.start()
thread1.join()
thread2.join()
from tensor_parallel.
@zoubaihan sadly, tensor_parallel models can't be used concurrently because otherwise the communications break.
For example, the code @linfan provided showcases how it happens when two concurrent calls reach a round of communication both trying to broadcast the the first part of some tensor, overwriting each other and leaving the second part None
, but both thinking that the broadcast is complete since two parts were communicated in total.
For your specific case I'd recommend to find a library to batch requests and thus utilize the resources more efficiently. But forward calls themselves should not be concurrent.
tldr: no concurrency, use locks
from tensor_parallel.
Related Issues (20)
- How to use trained models? HOT 3
- Support LLaMA Models, including HuggingFace-adapted variants HOT 7
- Slow inference performance for large Llama models compared to naive MP HOT 22
- How to load lora weights? HOT 13
- set distributed=True, return AttributeError: 'NoneType' object HOT 2
- cuda memory not evenly distributed between devices HOT 6
- What is the difference between this project and autotp of deepspeed? HOT 1
- Huggingface Accelerate HOT 1
- Great work! and can this work with deepspeedzero?
- Torch version requirement HOT 4
- Error in README.Md, hence not able to load model with limited memory. HOT 5
- Not work with 4bit quant HOT 6
- Support for PEFT LoRA and 4-bit quantization HOT 6
- Request to fix the content about parallelformers in README. HOT 1
- Can I parallelize just one large layer? HOT 1
- Does tensor_parallel support multi-node tensor parallel training? HOT 3
- Does tensor_parallel support data parallel and tensor parallel hybrid training?
- Question on custom models HOT 23
- TypeError when multi-thread inference using tensor_parallel HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensor_parallel.