Comments (13)
Hi @kismetro!
As for your first code snippet, as you noticed yourself, you are conducting all the operations on the CPU. While we try to keep CPU operators with decent performance, optimizing them is not our priority. If you would like to earn a true performance boost, please use GPU as much as you can.
Regarding the second question:
Q. Is there anything I missed using dali-backend with tritonserver for inference?
The reason for this error is that tritonserver
container can't see the GPU in your system. Could you please verify, if you are running nvcr.io/nvidia/tritonserver:22.02-py3
with --gpus all
flag?
If you verify, that the flag is there and the problem persists, please run tritonserver
container and check the GPU-hello-world command inside:
[host] $ nvidia-smi
[host] $ docker run -it --gpus all nvcr.io/nvidia/tritonserver:22.02-py3 bash
[container] $ nvidia-smi
The nvidia-smi
tables in the host system and the container should be roughly the same.
If all of these seem to be fine, please let me know, we'll figure out what to do next.
Cheers!
from dali_backend.
Hi, @szalpal!
Thanks for support.
All gpu features are enabled. 5 onnx models are deployed to triton model path using azure blob storage. It is tested also local model path. It seems to be run well using gpu resources. I just wanted to replace preprocessing onnx among 5 onnx model with dali-backend mode to enable image decoder.
Here is my docker run command.
docker run --gpus all --rm --net=dev-net -p 8000:8000 -p 8001:8001 -p 8002:8002 -e AZURE_STORAGE_ACCOUNT=$AZURE_STORAGE_ACCOUNT -e AZURE_STORAGE_KEY=$AZURE_STORAGE_KEY --name triton_dev nvcr.io/nvidia/tritonserver:22.02-py3 tritonserver --model-repository=$AZURE_MODEL_REPO_PATH --log-verbose=0
Also, please check results of nvidia-smi on host and inside docker. Both are seem to be fine.
- tritonserver container:
- host
I'm look forwarding next support!
from dali_backend.
Just to cross out all the obvious and easy checks, can you confirm, that you are using device_id >= 0
, and not device_id=None
?
One idea to increase performance of the CPU pipeline is to replace these 4 invocations:
B_norm_c0 = fn.normalize(image_B, mean=0.485, stddev=0.229)
B_norm_c1 = fn.normalize(image_B, mean=0.456, stddev=0.224)
B_norm_c2 = fn.normalize(image_B, mean=0.406, stddev=0.225)
image_B2 = fn.stack( B_norm_c0, B_norm_c1, B_norm_c2 )
With fn.crop_mirror_normalize
operator. You can try it out, although I doubt it might match onnxruntime performance.
from dali_backend.
@szalpal !
Thanks for the next guide!
I spent time a little to verify equivalent values. Here is my update code based on your comment.
It seems to be also working well.
@pipeline_def(batch_size=max_batch_size, num_threads=4, device_id=0)
def pipe_dali_all_enc():
input = fn.external_source(source="cpu", name="input_8n", dtype=types.UINT8, parallel=True)
input = fn.decoders.image(input, output_type=types.DALIImageType.RGB) / 255.0 # HWC layout
image_B = fn.resize(input, size=[640, 360], image_type=types.DALIImageType.RGB)
image_B = fn.crop_mirror_normalize(image_B, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], \
output_layout="CHW")
input_A = fn.normalize(input[:,:,0], mean=0.5, stddev=1/0.225)
input_A = fn.expand_dims(input_A, axes=(0))
return image_B, input_A
However, I'm still facing difficulties with the gpu enabled feature.
Modifying deviced_id - 0, 1, empty none - is set to pipeline_def.
More error information when "device_id=0" with "mixed" at decoders.images is set for pipe as below:
model_repository_manager.cc:1152] failed to load 'preprocess_dali_all_enc' version 1:
Unknown: DALI Backend error: [/opt/dali/dali/pipeline/pipeline.cc:276]
Assert on "device_id_ != CPU_ONLY_DEVICE_ID || device == "cpu"" failed:
Cannot add a mixed operator decoders__Image with a GPU output, device_id should not be CPU_ONLY_DEVICE_ID.
I hope next support to using gpu performance on dali-backend!
Thanks!
from dali_backend.
I tried to reproduce the error with the information you provided and unfortunately I wasn't able to. Below I'm attaching two more ideas I have, which you can test to see if this will resolve your problem:
- Could you verify, that the serialized dali model in fact has the
mixed
image decoder? E.g.:
/tmp/gh_issue_121/model_repository/dali_all_enc/1$ ls
model.dali
/tmp/gh_issue_121/model_repository/dali_all_enc/1$ ack mixed model.dali
devicestring*mixed@* __Image_10*�
- I'm attaching all the code I put together to reproduce your error. Could you please check the code below against what you have to seek for any discrepancies? If both (your and mine) are the same, would you be able to provide a simple repro, so I can investigate the problem further? Thanks in advance!
all_enc.py
import nvidia.dali as dali
from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types
max_batch_size=256
filename="/tmp/gh_issue_121/model_repository/dali_all_enc/1/model.dali"
@pipeline_def(batch_size=max_batch_size, num_threads=4, device_id=0)
def pipe_dali_all_enc():
input = fn.external_source(source="cpu", name="input_8n", dtype=types.UINT8, parallel=True)
input = fn.decoders.image(input, device="mixed", output_type=types.DALIImageType.RGB) / 255.0 # HWC layout
image_B = fn.resize(input, size=[640, 360], image_type=types.DALIImageType.RGB)
image_B = fn.crop_mirror_normalize(image_B, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], \
output_layout="CHW")
input_A = fn.normalize(input[:,:,0], mean=0.5, stddev=1/0.225)
input_A = fn.expand_dims(input_A, axes=(0))
return image_B, input_A
pipe_dali_all_enc().serialize(filename=filename)
/tmp/gh_issue_121/model_repository/dali_all_enc/config.pbtxt
name: "dali_all_enc"
backend: "dali"
max_batch_size: 256
input [
{
name: "input_8n"
data_type: TYPE_UINT8
dims: [ -1 ]
}
]
output [
{
name: "DALI_OUTPUT_0"
data_type: TYPE_FP32
dims: [ -1,-1,-1 ]
},
{
name: "DALI_OUTPUT_1"
data_type: TYPE_FP32
dims: [ -1,-1,-1 ]
}
]
$ mkdir -p /tmp/gh_issue_121/model_repository/dali_all_enc/1
$ python all_enc.py
$ MODEL_REPO=/tmp/gh_issue_121/model_repository && docker run -it --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 --privileged -v $MODEL_REPO:/models nvcr.io/nvidia/tritonserver:22.02-py3 tritonserver --model-repository=/models
[...everything works here...]
I0318 08:46:16.834130 1 server.cc:592]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| dali_all_enc | 1 | READY |
+--------------+---------+--------+
from dali_backend.
@szalpal ,
All your sample worked well in my gpu server. I ended up found mistakes caused errors in config.
Now the preprocessor touched higher performance. Do you have additional advices for below code?
@pipeline_def(batch_size=max_batch_size, num_threads=4, device_id=0)#, debug=True)
def pipe_dali_all_enc():
input = fn.external_source(source="cpu", name="input_8n", dtype=types.UINT8, parallel=True)
input = fn.decoders.image(input, device="mixed", output_type=types.DALIImageType.RGB) / 255.0 # HWC layout
image_B = fn.resize(input, size=[640, 360], image_type=types.DALIImageType.RGB).gpu()
image_B = fn.crop_mirror_normalize(image_B, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], output_layout="CHW").gpu()
input_A = fn.normalize(input[:,:,0], mean=0.5, stddev=1/0.225).gpu()
input_A = fn.expand_dims(input_A, axes=(0)).gpu()
return input_A.gpu(), inputB.gpu()
Thanks for all your support!
But there are unexpected situation as below.
Q1. dimension match.
: I transmitted a png image using triton-sdk in python and c++ around 30KB. Its dimension was (300000,) that was worked well in cpu mode. But it showed critical batch dimension errors in gpu mode. Although I fixed dimension in my code from (300000, ) to (1, 300000) to verify value, also output got 1 more dimension. I am still �wonder about different execution.
Please advise on this situation.
Q2. model warm-up
: Below warm-up config caused unsupport format error in decoders.images function when tritonserver load models. Is there any trick to enabled warm-up for decoders.images function?
model_warmup [
{
name: "warming_up"
batch_size: 1
inputs: {
key: "input_8n"
value: {
data_type: TYPE_UINT8
dims: [ 8 ]
zero_data: true
}
}
}
]
from dali_backend.
I'm glad I could help! Few remarks to the code above:
- You don't need to put
.gpu()
after every operator. Moreover, you don't need to put it anywhere, if yourdevice=mixed
in the Image Decoder. - Consider tweaking with
num_threads
variable. In your pipeline, this would impact the CPU stage of GPU Image Decoder (yes, the Image Decoder does some work on the CPU). The intuition is that one thread processes one sample in a batch, so if you havebatch_size=2
, it's unnecessary to havenum_threads=4
. But, if you havebatch_size=256
, you can try to bump up the number of CPU threads, depending on your system capabilities.
Dimension mismatch
Can you paste the error? It's hard to take out what's the problem from your description.
Model warmup
In DALI, fn.decoders.image
operator assumes that the input is a real sample. It can't be random, it can't be zeros, it needs to be a real image. Therefore you need to warm up the model using real data. Please refer to the input_data_file doc on how to do it.
from dali_backend.
@szalpal ,
In case of num_threads
, current my test client send an image. But our ensemble model with dynamic batch will be deploy to Azure for mass production service. I guess that 'num_threads' and dynamic batch effect to performance.
Its performance reaches almost equivalent with onnx runtime preprocessor that was built last year.
Dimension mismatch
In cpu mode, I just transmit png image - (300000, ). It worked well.
In gpu mode with same input, 'batch size error' was shown. I don't remember exact logs. But message was similar with batch_size<=input
. So I just changed input shape from (300000,) to (1, 300000). Then works fine.
But I had to fix dimension in my code to parse output results.
Model warmup
That is what I wanted! Thanks!!
from dali_backend.
So, do you face any problem right now with the dimension mismatch, or everything is working fine for you?
from dali_backend.
@szalpal ,
After code modifications, It works on gpu.
But it cannot be run using same code in cpu mode because of dimension mismatch.
So, that`s why I asked that this is normal or not.
from dali_backend.
I believe it is not an expected behaviour, however I was not able to reproduce it on my end. Would you be able to provide a simple repro? I'd appreciate it - this way we can resolve any bug that is there.
from dali_backend.
@szalpal ,
There were so many code changes in client side - c++ sdk 2.7.0 and python sdk 2.19.0.
I have tried to reproduce that situation detaching partial of whole code. But I couldn't.
However, what is clear is that the triton server raised an error not on client.
I promise to check it from time to time and will report it as a new ticket.
Thanks for your great and kind support.
from dali_backend.
Thank you @kismetro , please close this issue whenever you see fit.
from dali_backend.
Related Issues (20)
- Can dali backend support default values or optional input? HOT 2
- Unexpected large memory needed for gpu resize HOT 4
- Error in thread 31: nvJPEG error (5): The user-provided allocator functions, for either memory allocation or for releasing the memory, returned a non-zero code. HOT 6
- Cannot compile dali_backend with older version of triton HOT 2
- how to provide batch input data for dali pipeline whicn input shapes [-1] HOT 1
- if I want to crop from different start point, how can I build pipe to do this? HOT 2
- Test issue
- Connecting InputOperator with no explicit inputs to Triton HOT 12
- Could not serialize dali.fn.python_function HOT 1
- when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" HOT 5
- dlopen libcuda.so failed!. Please install GPU dirverTraceback (most recent call last): HOT 4
- Prefeed multiple input batches to the inference pipeline HOT 7
- Unable to load numpy module in a DALI backend HOT 3
- DALI pipeline in Triton - formatting InferInput batch of images for UINT8 HOT 3
- 'NoneType' object has no attribute 'loader' when trying to load DALI model. HOT 15
- How to format client code for inception example HOT 14
- How to get list of image paths into dali pipeline? HOT 4
- How to use scalar inputs HOT 3
- Video Input larger than max
- Missing conda env. in 24.04 breaks autoserialization
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dali_backend.