Hi, I'm currently working on optimizing Triton inference performance

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Prefeed multiple input batches to the inference pipeline about dali_backend HOT 7 CLOSED

fversaci commented on June 29, 2024

Prefeed multiple input batches to the inference pipeline

from dali_backend.

Comments (7)

banasraf commented on June 29, 2024 1

@fversaci
Hey. This approach should be easier to achieve. We support similar scenario with the video input (single input file results in multiple output batches). This will require using the decoupled model (docs). Let me experiment a bit to see what needs to be adjusted to make it work in this case

from dali_backend.

fversaci commented on June 29, 2024 1

Hi all,
I wanted to provide an update on our use case. Since there is currently no general prefeeding available for the DALI backend in Triton, we have implemented internal prefetching in our plugin. We take the original batch we receive from Triton (e.g., bs=4096), split it into mini-batches (e.g., bs=256), and apply prefetching to these mini-batches.

If anyone is experiencing a similar issue, our code is available in this repository.

from dali_backend.

JanuszL commented on June 29, 2024 1

@fversaci - very good work. Thank you for sheering its results.

from dali_backend.

fversaci commented on June 29, 2024 1

The code is now in the dev branch, along with some (minimal) documentation:
https://github.com/fversaci/cassandra-dali-plugin/tree/dev/examples/triton

from dali_backend.

banasraf commented on June 29, 2024

Hey @fversaci
Unfortunately, currently there's no way of prefeeding data to inputs in DALI backend. Internally we have an assumption that we don't process upcoming requests untill we send responses for all the previous ones.

We can lift that limitation and we would like to do that, because it might improve performance in various scenarios. However, this will require a significant rework of the backend, so it's hard to predict when are we going to be able to tackle this.

If you haven't experimented with this already, you might want to check the performance when you increase the number of model instances (docs). Maybe higher parallelism would help to hide the cost of fetching the data.

from dali_backend.

fversaci commented on June 29, 2024

Hi @banasraf,

Thank you for the information (and your availability in general). I will definitely try increasing the number of model instances to see how it improves the throughput.

Regarding the issue with prefeeding Triton-DALI pipelines, I have been considering a temporary solution, while it's still not possible to prefeed them. We could provide a mega-batch (e.g., 1024 UUIDs) to the pipeline and our module could then split it into mini-batches (e.g., 8 mini-batches of size 128), and handle the prefeeding internally.

However, our current code implementing this approach is not functioning properly, since Triton expects to receive a single batch of the same size as the input batch:

E1026 13:09:54.096342 959 dali_model_instance.cc:40] Cannot split a shape list with 128 samples to list shapes of total 1024 samples.

Do you think this issue is easier to fix compared to the general prefeeding problem? In other words, can Triton-DALI handle multi-part answers to queries?

To see or test our code:

git clone https://github.com/fversaci/cassandra-dali-plugin.git -b triton
cd cassandra-dali-plugin
docker build -t cassandra-dali-plugin -f Dockerfile.triton .   # this might take some time
docker run --cap-add=sys_admin --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name cass-dali cassandra-dali-plugin
# within the container
./start-and-fill-db.sh
./start-triton.sh   # don't close the container
# new shell within the host
docker exec -ti cass-dali fish
# within the container
python3 client-triton.py

from dali_backend.

fversaci commented on June 29, 2024

Hi @banasraf,

Do you have any updates on adapting the decoupled model to our specific use case?

Meanwhile I have modified our code so that:

It now has three client implementations to play with: client-http-triton.py, client-grpc-triton.py, client-grpc-stream-triton.py
The model produces a reduced output instead of the full tensors. This means that the bottleneck during testing is no longer on the Python clients, but rather in the Triton server pipeline. As a result, the throughput is much higher than before.
I set the default max_batch_size in models/dali_cassandra/config.pbtxt to 256, which matches the
size offered by the clients. When changing max_batch_size to, e.g., 512, the CassandraTriton plugin automatically splits the large batches into smaller ones, which causes this error to be produced:

Cannot split a shape list with 256 samples to list shapes of total 512 samples.

The plugin now logs the input size of each batch it receives and the current status of its internal prefetching mechanism.

Thanks!

from dali_backend.

Prefeed multiple input batches to the inference pipeline about dali_backend HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent