Hi, I'm struggling to use dali-backend to increase inference performance. Pure

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Need to increase performance dali-backend,about triton-inference-server/dali_backend

Comments (13)

szalpal commented on July 4, 2024

Hi @kismetro!

As for your first code snippet, as you noticed yourself, you are conducting all the operations on the CPU. While we try to keep CPU operators with decent performance, optimizing them is not our priority. If you would like to earn a true performance boost, please use GPU as much as you can.

Regarding the second question:

Q. Is there anything I missed using dali-backend with tritonserver for inference?

The reason for this error is that tritonserver container can't see the GPU in your system. Could you please verify, if you are running nvcr.io/nvidia/tritonserver:22.02-py3 with --gpus all flag?

If you verify, that the flag is there and the problem persists, please run tritonserver container and check the GPU-hello-world command inside:

[host] $ nvidia-smi
[host] $ docker run -it --gpus all nvcr.io/nvidia/tritonserver:22.02-py3 bash
[container] $ nvidia-smi

The nvidia-smi tables in the host system and the container should be roughly the same.

If all of these seem to be fine, please let me know, we'll figure out what to do next.

Cheers!

from dali_backend.

kismetro commented on July 4, 2024

Hi, @szalpal!

Thanks for support.
All gpu features are enabled. 5 onnx models are deployed to triton model path using azure blob storage. It is tested also local model path. It seems to be run well using gpu resources. I just wanted to replace preprocessing onnx among 5 onnx model with dali-backend mode to enable image decoder.
Here is my docker run command.

docker run --gpus all --rm --net=dev-net  -p 8000:8000 -p 8001:8001 -p 8002:8002 -e AZURE_STORAGE_ACCOUNT=$AZURE_STORAGE_ACCOUNT -e AZURE_STORAGE_KEY=$AZURE_STORAGE_KEY --name triton_dev nvcr.io/nvidia/tritonserver:22.02-py3 tritonserver --model-repository=$AZURE_MODEL_REPO_PATH --log-verbose=0

Also, please check results of nvidia-smi on host and inside docker. Both are seem to be fine.

tritonserver container:

host

I'm look forwarding next support!

from dali_backend.

szalpal commented on July 4, 2024

@kismetro

Just to cross out all the obvious and easy checks, can you confirm, that you are using device_id >= 0, and not device_id=None?

One idea to increase performance of the CPU pipeline is to replace these 4 invocations:

    B_norm_c0 = fn.normalize(image_B, mean=0.485, stddev=0.229)
    B_norm_c1 = fn.normalize(image_B, mean=0.456, stddev=0.224)
    B_norm_c2 = fn.normalize(image_B, mean=0.406, stddev=0.225)
    image_B2 = fn.stack( B_norm_c0, B_norm_c1, B_norm_c2 )

With fn.crop_mirror_normalize operator. You can try it out, although I doubt it might match onnxruntime performance.

from dali_backend.

kismetro commented on July 4, 2024

@szalpal !

Thanks for the next guide!
I spent time a little to verify equivalent values. Here is my update code based on your comment.
It seems to be also working well.

@pipeline_def(batch_size=max_batch_size, num_threads=4, device_id=0)
def pipe_dali_all_enc():
    input = fn.external_source(source="cpu", name="input_8n", dtype=types.UINT8, parallel=True)
    input = fn.decoders.image(input, output_type=types.DALIImageType.RGB) / 255.0 # HWC layout
  
    image_B = fn.resize(input, size=[640, 360], image_type=types.DALIImageType.RGB)
    image_B = fn.crop_mirror_normalize(image_B, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], \
output_layout="CHW")
    
    input_A = fn.normalize(input[:,:,0], mean=0.5, stddev=1/0.225)
    input_A = fn.expand_dims(input_A, axes=(0))

    return image_B, input_A

However, I'm still facing difficulties with the gpu enabled feature.
Modifying deviced_id - 0, 1, empty none - is set to pipeline_def.
More error information when "device_id=0" with "mixed" at decoders.images is set for pipe as below:

model_repository_manager.cc:1152] failed to load 'preprocess_dali_all_enc' version 1: 
Unknown: DALI Backend error: [/opt/dali/dali/pipeline/pipeline.cc:276] 
Assert on "device_id_ != CPU_ONLY_DEVICE_ID || device == "cpu"" failed: 
Cannot add a mixed operator decoders__Image with a GPU output, device_id should not be CPU_ONLY_DEVICE_ID.

I hope next support to using gpu performance on dali-backend!

Thanks!

from dali_backend.

szalpal commented on July 4, 2024

@kismetro ,

I tried to reproduce the error with the information you provided and unfortunately I wasn't able to. Below I'm attaching two more ideas I have, which you can test to see if this will resolve your problem:

Could you verify, that the serialized dali model in fact has the mixed image decoder? E.g.:

/tmp/gh_issue_121/model_repository/dali_all_enc/1$ ls
model.dali
/tmp/gh_issue_121/model_repository/dali_all_enc/1$ ack mixed model.dali
devicestring*mixed@*	__Image_10*�

I'm attaching all the code I put together to reproduce your error. Could you please check the code below against what you have to seek for any discrepancies? If both (your and mine) are the same, would you be able to provide a simple repro, so I can investigate the problem further? Thanks in advance!

all_enc.py

import nvidia.dali as dali
from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types

max_batch_size=256
filename="/tmp/gh_issue_121/model_repository/dali_all_enc/1/model.dali"

@pipeline_def(batch_size=max_batch_size, num_threads=4, device_id=0)
def pipe_dali_all_enc():
    input = fn.external_source(source="cpu", name="input_8n", dtype=types.UINT8, parallel=True)
    input = fn.decoders.image(input, device="mixed", output_type=types.DALIImageType.RGB) / 255.0 # HWC layout

    image_B = fn.resize(input, size=[640, 360], image_type=types.DALIImageType.RGB)
    image_B = fn.crop_mirror_normalize(image_B, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], \
output_layout="CHW")

    input_A = fn.normalize(input[:,:,0], mean=0.5, stddev=1/0.225)
    input_A = fn.expand_dims(input_A, axes=(0))

    return image_B, input_A

pipe_dali_all_enc().serialize(filename=filename)

/tmp/gh_issue_121/model_repository/dali_all_enc/config.pbtxt

name: "dali_all_enc"
backend: "dali"
max_batch_size: 256
input [
  {
    name: "input_8n"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]

output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ -1,-1,-1 ]
  },
  {
    name: "DALI_OUTPUT_1"
    data_type: TYPE_FP32
    dims: [ -1,-1,-1 ]
  }
]

$ mkdir -p /tmp/gh_issue_121/model_repository/dali_all_enc/1
$ python all_enc.py
$ MODEL_REPO=/tmp/gh_issue_121/model_repository && docker run -it --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 --privileged -v $MODEL_REPO:/models nvcr.io/nvidia/tritonserver:22.02-py3 tritonserver --model-repository=/models

[...everything works here...]
I0318 08:46:16.834130 1 server.cc:592]
+--------------+---------+--------+
| Model        | Version | Status |
+--------------+---------+--------+
| dali_all_enc | 1       | READY  |
+--------------+---------+--------+

from dali_backend.

kismetro commented on July 4, 2024

@szalpal ,

All your sample worked well in my gpu server. I ended up found mistakes caused errors in config.
Now the preprocessor touched higher performance. Do you have additional advices for below code?

@pipeline_def(batch_size=max_batch_size, num_threads=4, device_id=0)#, debug=True)
def pipe_dali_all_enc():
    input = fn.external_source(source="cpu", name="input_8n", dtype=types.UINT8, parallel=True)
    input = fn.decoders.image(input, device="mixed", output_type=types.DALIImageType.RGB) / 255.0 # HWC layout
  
    image_B = fn.resize(input, size=[640, 360], image_type=types.DALIImageType.RGB).gpu()
    image_B = fn.crop_mirror_normalize(image_B, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], output_layout="CHW").gpu()
    
    input_A = fn.normalize(input[:,:,0], mean=0.5, stddev=1/0.225).gpu()
    input_A = fn.expand_dims(input_A, axes=(0)).gpu()
    return input_A.gpu(), inputB.gpu()

Thanks for all your support!

But there are unexpected situation as below.

Q1. dimension match.

: I transmitted a png image using triton-sdk in python and c++ around 30KB. Its dimension was (300000,) that was worked well in cpu mode. But it showed critical batch dimension errors in gpu mode. Although I fixed dimension in my code from (300000, ) to (1, 300000) to verify value, also output got 1 more dimension. I am still �wonder about different execution.

Please advise on this situation.

Q2. model warm-up

: Below warm-up config caused unsupport format error in decoders.images function when tritonserver load models. Is there any trick to enabled warm-up for decoders.images function?

model_warmup [
  {
    name: "warming_up"
    batch_size: 1
    inputs: {
      key: "input_8n"
      value: {
        data_type: TYPE_UINT8
        dims: [ 8 ]
        zero_data: true
      }
    }
  }
]

from dali_backend.

szalpal commented on July 4, 2024

@kismetro ,

I'm glad I could help! Few remarks to the code above:

You don't need to put .gpu() after every operator. Moreover, you don't need to put it anywhere, if your device=mixed in the Image Decoder.
Consider tweaking with num_threads variable. In your pipeline, this would impact the CPU stage of GPU Image Decoder (yes, the Image Decoder does some work on the CPU). The intuition is that one thread processes one sample in a batch, so if you have batch_size=2, it's unnecessary to have num_threads=4. But, if you have batch_size=256, you can try to bump up the number of CPU threads, depending on your system capabilities.

Dimension mismatch

Can you paste the error? It's hard to take out what's the problem from your description.

Model warmup

In DALI, fn.decoders.image operator assumes that the input is a real sample. It can't be random, it can't be zeros, it needs to be a real image. Therefore you need to warm up the model using real data. Please refer to the input_data_file doc on how to do it.

from dali_backend.

kismetro commented on July 4, 2024

@szalpal ,

In case of num_threads, current my test client send an image. But our ensemble model with dynamic batch will be deploy to Azure for mass production service. I guess that 'num_threads' and dynamic batch effect to performance.

Its performance reaches almost equivalent with onnx runtime preprocessor that was built last year.

Dimension mismatch
In cpu mode, I just transmit png image - (300000, ). It worked well.
In gpu mode with same input, 'batch size error' was shown. I don't remember exact logs. But message was similar with batch_size<=input. So I just changed input shape from (300000,) to (1, 300000). Then works fine.
But I had to fix dimension in my code to parse output results.

Model warmup
That is what I wanted! Thanks!!

from dali_backend.

szalpal commented on July 4, 2024

@kismetro

So, do you face any problem right now with the dimension mismatch, or everything is working fine for you?

from dali_backend.

kismetro commented on July 4, 2024

@szalpal ,

After code modifications, It works on gpu.
But it cannot be run using same code in cpu mode because of dimension mismatch.
So, that`s why I asked that this is normal or not.

from dali_backend.

szalpal commented on July 4, 2024

@kismetro ,

I believe it is not an expected behaviour, however I was not able to reproduce it on my end. Would you be able to provide a simple repro? I'd appreciate it - this way we can resolve any bug that is there.

from dali_backend.

kismetro commented on July 4, 2024

@szalpal ,

There were so many code changes in client side - c++ sdk 2.7.0 and python sdk 2.19.0.
I have tried to reproduce that situation detaching partial of whole code. But I couldn't.
However, what is clear is that the triton server raised an error not on client.

I promise to check it from time to time and will report it as a new ticket.

Thanks for your great and kind support.

from dali_backend.

szalpal commented on July 4, 2024

Thank you @kismetro , please close this issue whenever you see fit.

from dali_backend.

Need to increase performance dali-backend about dali_backend HOT 13 CLOSED

Comments (13)

Q1. dimension match.

Q2. model warm-up

Dimension mismatch

Model warmup

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent