When I use tensor_parallel to make my model inference

Here's a simple piece of code which reproduce the issue <div class="highlight high

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Does tensor_parallel support the model inference concurrently or in multi-threads? about tensor_parallel HOT 2 CLOSED

blacksamorez commented on May 20, 2024

Does tensor_parallel support the model inference concurrently or in multi-threads?

from tensor_parallel.

Comments (2)

linfan commented on May 20, 2024

Here's a simple piece of code which reproduce the issue

from transformers import AutoModelForCausalLM, AutoTokenizer
import threading
import tensor_parallel as tp
import torch

checkpoint = "bigcode/starcoderbase"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16,low_cpu_mem_usage=True, trust_remote_code=True)
model = tp.tensor_parallel(model, sharded=False)

def task():
    inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
    outputs = model.generate(inputs)
    print(tokenizer.decode(outputs[0], clean_up_tokenization_spaces=False))

thread1 = threading.Thread(target=task)
thread2 = threading.Thread(target=task)

thread1.start()
thread2.start()

thread1.join()
thread2.join()

from tensor_parallel.

BlackSamorez commented on May 20, 2024

@zoubaihan sadly, tensor_parallel models can't be used concurrently because otherwise the communications break.
For example, the code @linfan provided showcases how it happens when two concurrent calls reach a round of communication both trying to broadcast the the first part of some tensor, overwriting each other and leaving the second part None, but both thinking that the broadcast is complete since two parts were communicated in total.
For your specific case I'd recommend to find a library to batch requests and thus utilize the resources more efficiently. But forward calls themselves should not be concurrent.

tldr: no concurrency, use locks

from tensor_parallel.

Recommend Projects

Does tensor_parallel support the model inference concurrently or in multi-threads? about tensor_parallel HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent