Comments (4)
Thanks for your questions!
-
Correct.
-
The max batch size for each model instance will be 8.
-
It is for demonstration purposes only. It is mainly meant to show that you can send responses after returning from
execute
function. It could be a workaround to the memory usage issue you brought up in (1) (i.e. you can have a single model instance with multiple threads inside to avoid extra memory usage). -
That's correct.
-
The warning is mainly about not accepting too many requests and spawning a lot of threads (i.e. having a limit on the number of requests that you want to accept). I think it is not required to set the
daemon
flag. You don't need to join the threads before returningNone
. As shown in the example, thethreads
continue to send responses after returningNone
from the execute function. -
I think it is hard to have a general recommendation for your scenario since it is dependent on a lot of parameters. It would be good to experiment with these different options to see which one performs best for your model. Would love to learn more which route worked best for you.
Let us know if you have any additional questions!
from server.
Hi @Tabrizian , thank you very much for your detailed reply and insights! I have a few follow-up questions:
- Regarding quetsion 5, if I don't join the threads before returning
None
inexecute()
, then in order to prevent too many requests from being accepted and to prevent too many threads from being spawned, I would then have to keep track of the number of requests that are still on-going/the number of threads that are in still in flight, is that correct? Is theself.inflight_thread_count
member variable in thesquare_model.py
file (https://github.com/triton-inference-server/python_backend/blob/main/examples/decoupled/square_model.py) a way to keep track of this? - In general, regarding the amount of memory a model needs during inference, the amount of memory inference takes is not just the size of the weights but also the size of the intermediate layer tensor output/activation right? In other words, do I have to leave some additional space outside of the raw weights memory size for the intermediate layer tensor output/activation during inference?
Thanks a lot for your time and help again!
from server.
-
correct. You can also probably use semaphores so that you block until you have acquired enough resources: https://docs.python.org/3/library/threading.html#semaphore-objects
-
Yes, it would be good to run your model thorough model analyzer to better understand the memory requirements.
from server.
- correct. You can also probably use semaphores so that you block until you have acquired enough resources: https://docs.python.org/3/library/threading.html#semaphore-objects
- Yes, it would be good to run your model thorough model analyzer to better understand the memory requirements.
I see, thank you very much for your answers and insights!
from server.
Related Issues (20)
- How to enable nsys when starting a Triton server using Python API HOT 1
- Multiple CPU instances results in decreasing of infer speed. HOT 1
- Inference in Triton ensemble model is much slower than single model in Triton HOT 3
- repeated answer:When I use vllm with Qwen-7b-chat the generated text is x lnot end until the maength, with the repeated answer HOT 1
- Calling index search inside Triton python backend HOT 1
- TensorRT 8.6.3.1 package in Python PyPy for Triton Nvidia Inference Server version > 24.01 HOT 3
- Unable to use pytoch library with libtorch backend when using triton inference server In-Process python API HOT 5
- model analyser stucks HOT 2
- Question to huggingface model using triton
- Questions about input and output shape in model configuration when batch size is 1 HOT 3
- Model Management HOT 1
- passing input data HOT 1
- Python backend status zombie but Tritonserver `v2/health` still return 200 OK HOT 1
- Onnxruntime backend doesn't load model when container is running on Ubuntu HOT 1
- Cant build python+onnx+ternsorrtllm backends r24.04 HOT 3
- increase chunk size for streaming with tensorrtllm_backend
- trt accelerator
- Can't build the Docker image r24.04 on Azure Nvidia VMI HOT 3
- Build error when building new image on top of the `nvcr.io/nvidia/tritonserver:24.04-py3-sdk` container image from NGC HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from server.