triton-inference-server / server Goto Github PK

View Code? Open in Web Editor NEW

7.4K 7.4K 1.4K 33.27 MB

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

Home Page: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

License: BSD 3-Clause "New" or "Revised" License

Python 51.28% Shell 24.49% C++ 19.88% Smarty 0.42% CMake 1.49% Roff 0.97% Dockerfile 0.03% Java 1.43%

cloud datacenter deep-learning edge gpu inference machine-learning

server's Introduction

Triton Inference Server

📣 Triton Meetup at the NVIDIA Headquarters on April 30th 3:00 - 6:30 pm

We are excited to announce that we will be hosting our Triton user meetup at the NVIDIA Headquarters on April 30th 3:00 - 6:30 pm. Join us for this exclusive event where you will learn about the newest Triton features, get a glimpse into the roadmap, and connect with fellow users and the NVIDIA Triton engineering and product teams. Seating is limited and registration confirmation is required to attend - please register here to join the meetup. We can’t wait to welcome you and share what’s next for the Triton Inference Server.

Warning

LATEST RELEASE

You are currently on the main branch which tracks under-development progress towards the next release. The current release is version 2.44.0 and corresponds to the 24.03 container release on NVIDIA GPU Cloud (NGC).

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

Major features include:

Supports multiple deep learning frameworks
Supports multiple machine learning frameworks
Concurrent model execution
Dynamic batching
Sequence batching and implicit state management for stateful models
Provides Backend API that allows adding custom backends and pre/post processing operations
Supports writing custom backends in python, a.k.a. Python-based backends.
Model pipelines using Ensembling or Business Logic Scripting (BLS)
HTTP/REST and GRPC inference protocols based on the community developed KServe protocol
A C API and Java API allow Triton to link directly into your application for edge and other in-process use cases
Metrics indicating GPU utilization, server throughput, server latency, and more

New to Triton Inference Server? Make use of these tutorials to begin your Triton journey!

Join the Triton and TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more. Need enterprise support? NVIDIA global support is available for Triton Inference Server with the NVIDIA AI Enterprise software suite.

Serve a Model in 3 Easy Steps

# Step 1: Create the example model repository
git clone -b r24.03 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh

# Step 2: Launch triton from the NGC Triton container
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.03-py3 tritonserver --model-repository=/models

# Step 3: Sending an Inference Request
# In a separate console, launch the image_client example from the NGC Triton SDK container
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.03-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

# Inference should return the following
Image '/workspace/images/mug.jpg':
    15.346230 (504) = COFFEE MUG
    13.224326 (968) = CUP
    10.422965 (505) = COFFEEPOT

Please read the QuickStart guide for additional information regarding this example. The quickstart guide also contains an example of how to launch Triton on CPU-only systems. New to Triton and wondering where to get started? Watch the Getting Started video.

Examples and Tutorials

Check out NVIDIA LaunchPad for free access to a set of hands-on labs with Triton Inference Server hosted on NVIDIA infrastructure.

Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM are located in the NVIDIA Deep Learning Examples page on GitHub. The NVIDIA Developer Zone contains additional documentation, presentations, and examples.

Documentation

Build and Deploy

The recommended way to build and use Triton Inference Server is with Docker images.

Install Triton Inference Server with Docker containers (Recommended)
Install Triton Inference Server without Docker containers
Build a custom Triton Inference Server Docker container
Build Triton Inference Server from source
Build Triton Inference Server for Windows 10
Examples for deploying Triton Inference Server with Kubernetes and Helm on GCP, AWS, and NVIDIA FleetCommand
Secure Deployment Considerations

Using Triton

Preparing Models for Triton Inference Server

The first step in using Triton to serve your models is to place one or more models into a model repository. Depending on the type of the model and on what Triton capabilities you want to enable for the model, you may need to create a model configuration for the model.

Add custom operations to Triton if needed by your model
Enable model pipelining with Model Ensemble and Business Logic Scripting (BLS)
Optimize your models setting scheduling and batching parameters and model instances.
Use the Model Analyzer tool to help optimize your model configuration with profiling
Learn how to explicitly manage what models are available by loading and unloading models

Configure and Use Triton Inference Server

Read the Quick Start Guide to run Triton Inference Server on both GPU and CPU
Triton supports multiple execution engines, called backends, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, and more
Not all the above backends are supported on every platform supported by Triton. Look at the Backend-Platform Support Matrix to learn which backends are supported on your target platform.
Learn how to optimize performance using the Performance Analyzer and Model Analyzer
Learn how to manage loading and unloading models in Triton
Send requests directly to Triton with the HTTP/REST JSON-based or gRPC protocols

Client Support and Examples

A Triton client application sends inference and other requests to Triton. The Python and C++ client libraries provide APIs to simplify this communication.

Review client examples for C++, Python, and Java
Configure HTTP and gRPC client options
Send input data (e.g. a jpeg image) directly to Triton in the body of an HTTP request without any additional metadata

Extend Triton

Triton Inference Server's architecture is specifically designed for modularity and flexibility

Customize Triton Inference Server container for your use case
Create custom backends in either C/C++ or Python
Create decoupled backends and models that can send multiple responses for a request or not send any responses for a request
Use a Triton repository agent to add functionality that operates when a model is loaded and unloaded, such as authentication, decryption, or conversion
Deploy Triton on Jetson and JetPack
Use Triton on AWS Inferentia

Additional Documentation

Contributing

Contributions to Triton Inference Server are more than welcome. To contribute please review the contribution guidelines. If you have a backend, client, example or similar contribution that is not modifying the core of Triton, then you should file a PR in the contrib repo.

Reporting problems, asking questions

We appreciate any feedback, questions or bug reporting regarding this project. When posting issues in GitHub, follow the process outlined in the Stack Overflow document. Ensure posted examples are:

minimal – use as little code as possible that still produces the same problem
complete – provide all parts needed to reproduce the problem. Check if you can strip external dependencies and still show the problem. The less time we spend on reproducing problems the more time we have to fix it
verifiable – test the code you're about to provide to make sure it reproduces the problem. Remove all other problems that are not related to your request/question.

For issues, please use the provided bug report and feature request templates.

For questions, we recommend posting in our community GitHub Discussions.

For more information

Please refer to the NVIDIA Developer Triton page for more information.

server's People

Contributors

Stargazers

Watchers

Forkers

trojanxu sara4dev yellowoldodd papamadeleine2022 mbrukman shafiahmed bpinaya milittle roy-engineering huangjun6919 horance-liu pangge seovchinnikov vitalyli changya1990 seewoo79 clhne fendaq liushuchun happog someone9388 eundoosong fangjyshanghai tidesq dcrankshaw borisfom alecgunny dl-frameworks dalavancloud renaudwastaken bobbych vilmara cloud-robotics adityadhakal kklaudiakanyvision surfswift213us igordzreyev devhliu sgodithi1 simbazad tilaba aboutian gigony hephaex tony109060581 parano david-imagr hxl1990 fdecayed gaocegege yoosful yupengzte yyuzhongpv wangpuyang canyuchen macoredroid leo-xxx pmcgraw-lucidyne wunaideyipi shadowridgedev chiaqi connectionmaster fengzifrank pcasto ankitshah009 menguangwen-cn-0411 mywulin tomholmisto xx812 lensea ntzzc unsky dselivanov baishiruyue clancylian victorpopa1 kndoye saad502 labchal ioannisparaskevopoulos alfreds-nv flazer99 monovar azizunhakim aminsoroodi hhy5277 shaunstanislauslau awesome-archive anyvisionltd biztrology-kd ezhangle zhengfangwu jobegrabber srandmose ipie2050 learn901 tripti-singhal monstercy junqiangchen chirayug-nvidia

server's Issues

New labels file is not detected

I deployed a project which contains model foder (version 1), config file and labels_1.txt for the given version. When new version of the model is introduced, TRTIS automatically detects new version. However, if labels file for the new version is changed, TRTIS does not update "label_filename" unless I restart TRTIS.

To update labels I add labels_2.txt file to the project folder and update "label_filename" in the config file. Is it possible to detect these changes without restarting TRTIS.

perf_client has limits on concurrency

I cannot max out my two GPUs using one instance of perf_client, but using multiple instances I can.

A little bit more detail:
Using perf_client with GRPC and -a, the throughput does not increase when I go beyond -t 9.
But using four instances with -t 9 simultaneously for the same model, I get much higher total throughput. This is can be verified using trtis metrics.
Is this expected behavior? I can share the precise command that I use when I am back at my desk, but I cannot share the model that I use.

image_client libopencv_highgui.so error

When trying to launch image_client (after comping with docker build, and copying the executables to /tmp on host machine, I get the following error:

./image_client: error while loading shared libraries: libopencv_highgui.so.2.4: cannot open shared object file: No such file or directory

Happy to get help.
Thanks.

How to deploy models where the shape of output tensor is not known

I have a tensorflow frozen graph of a objection detection model, i am unclear about creating a config.pbtxt file for this model since i cannot determine the output shapes before hand and i cannot start the inference server without the "dim" specified. i wanted to know how can i create a config file for this

name: "NF1"
    platform: "tensorflow_graphdef"
    max_batch_size: 16
    
    input [
      {
        name: "image_tensor"
        data_type: TYPE_UINT8
        format: FORMAT_NHWC
        dims: [ 1024, 800, 3 ]
      }
    ]
    
    output [
      {
        name: "num_detections"
        data_type: TYPE_FP32
        dims: [ 300 ]
      },

      {
        name: "detection_boxes"
        data_type: TYPE_FP32
        dims: [ 300, 4  ]
      },

      {
        name: "detection_scores"
        data_type: TYPE_FP32
        dims: [ 300 ]        
      },

      {
        name: "detection_classes"
        data_type: TYPE_FP32
        dims: [ 300 ]        
      }
    ]
    instance_group [    
      {
        gpus: [ 0 ]
      },
      {
        gpus: [ 1 ]
      },
      {
        gpus: [ 2 ]
      },
      {
        gpus: [ 3 ]
      }                  
    ]    
    dynamic_batching {
      preferred_batch_size: [ 16 ]
      max_queue_delay_microseconds: 100
    }

this my config which does not work, i tried fixing the shape to the maximum proposals ie 300. Which i knew wouldn't work

Issue with simple_string model in the examples

Hi, when trying to run trtis nvidia-docker image, I get the following error:
nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v/mnt/workspace/jo/trtis/tensorrt-inference-server/docs/examples/model_repository:/models nvcr.io/nvidia/tensorrtserver:19.01-py3 trtserver --model-store=/models

[libprotobuf ERROR external/protobuf_archive/src/google/protobuf/text_format.cc:307] Error parsing text-format nvidia.inferenceserver.ModelConfig: 9:5: Unknown enumeration value of "TYPE_STRING" for field "data_type".
E0203 11:52:45.284641 1 server.cc:574] Can't parse /models/simple_string/config.pbtxt as text proto
E0203 11:52:47.226951 1 metrics.cc:238] failed to get energy consumption for GPU 0, NVML_ERROR 3

This is after I downloaded the models with the script.

Thanks

Best practice for custom backend with additional resources

I am building a custom backend which depends on an external model. This model is serialized to disk. What is the best way to access the model from my libcustom.so? I would like to put next to the libcustom.so in the model repository. How can I read it without knowing the path to the model repository?

Spammy log: failed to get energy consumption

I'm testing TRTIS on an AWS g3s instance which has an Nvidia M60 inside.

Every two seconds the logs emit:

E0220 07:59:01.944339 1 metrics.cc:238] failed to get energy consumption for GPU 0, NVML_ERROR 3
E0220 07:59:03.955762 1 metrics.cc:238] failed to get energy consumption for GPU 0, NVML_ERROR 3
E0220 07:59:05.963320 1 metrics.cc:238] failed to get energy consumption for GPU 0, NVML_ERROR 3
E0220 07:59:07.974456 1 metrics.cc:238] failed to get energy consumption for GPU 0, NVML_ERROR 3

I assume this GPU just doesn't support that metric.

Perhaps warn just once or never rather than repeating constantly?

(I did not have this issue when using a p3 instance type which has a V100).

Multiple GPU scheduling

I have 2 GTX 1080ti available and a single servable model in my models repository. When I run tensorrtserver:18.12-py3 it does discover both gpu's and an instance of my model is created for both gpu's. However, while testing server with many requests the load is only on first gpu, the second one is idle and not used. When switched back to tensorrtserver:18.09-py3 (which I used previously) TRTIS uses both gpu resources as expected.

New model deployment is not detected

When I add a new versions to an existing project trt server updates itself. However, when new project is deployed to models folder, trt server does not detect it. I have to restart docker image to be able to send request to new project. Is there an api to refresh trt server without restarting docker image?

kInvalidBinNum error

Currently I am running tensorrtserver:19.01-py3 and I got following error message and stopped working:
E0213 17:49:43.688371 1 bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)

What is the source of this error? Can it be avoided somewhat?
If possible, can you tell when nvcr.io/nvidia/tensorrtserver:19.02-py3 image will be available?

unexpected shape for input 'input' for model

i write a client code for my model. But i got this issue. need a help

tensorrtserver.api.InferenceServerException: [inference:0 0] unexpected shape for input 'input' for model

Does TRTIS support model parallelism?

Hi all, does TRTIS support model parallelism?, I mean, if a single model is copied in several GPUs to maximize inference throughput. Thanks,

TRTIS should support variable-sized input and output tensor dimensions

Currently TRTIS only allows the first dimension of an input/output tensor to be variable sized and only when that dimension represents batching. TRTIS should allow variable-sized dimensions in other cases since these are supported by some of the FWs (e.g. TensorFlow) and not having it limits which models can easily run on TRTIS.

How the tensorflow savedmodel worked in TRTIS?

Our tensorflow savedmodel exported by tf.saved_model.simple_save method and its contains the following files:
1/
saved_model.pb
variables

When we run the tensorflow savedmodel by TRTIS, we get the error as:
tensorrtserver.api.InferenceServerException: [inference:0 4] Attempting to use uninitialized value w1

So, how the tensorflow savedmodel (include variables file) worked in TRTIS?

Could you give me some explanation for that? Thank you.

image_client error

When run image_client -m resnet50_netdef -s INCEPTION examples/data/mug.jpg, It's failed!
Server: nvcr.io/nvidia/inferenceserver:18.08.1-py3 It's OK
Client: Branch:18.08
CUDA Version 9.0.176
Cards: K80

Error message:
failed sending infer request: [inference:0 30] INTERNAL - failed to run model 'resnet50_netdef': [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "gpu_0/res2_0_branch2c_bn" input: "gpu_0/res2_0_branch1_bn" output: "gpu_0/res2_0_branch2c_bn" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 2 } engine: ""

INTERNAL - unable to enqueue for inference

Hello, I have run my tensorrt model on tensorrt inference server, but when I invoke ctx->Run(&(results->back())) or ctx.GetAsyncRunResults(&(results->back()), request, true) to get model infer results，error occur:INTERNAL - unable to enqueue for inference rec_white_nbi_0_0_gpu0.
According to localhost:8000/api/status , model status is ready, as shown below.

Model Definition Files config.pbtxt as shown follow.

nvidia inference server log

I have looked inside cudnn.h for a description of the error codes, Cudnn Error in execute: 3 refers to bad param, Does it mean there exists some wrong params setting in my TRTIS client program？
my tensorrt inference client progream

I know the error is related to nvinfer1::IExecutionContext，but I don't know the cause of the error, if anyone know the cause, please help me, thank you!

the TRTIS could not load tf-trt frozen model

I build a tf-trt model with tf1.12 an tensorrt 5.0.2, i can run the model on tensorflow for inference.
But when i load it with nvcr.io/nvidia/tensorrtserver:18.12-py3, error as below:

E0121 11:17:43.947746 1 trt_logger.cc:38] DefaultLogger ../builder/cudnnBuilder2.cpp (1508) - Misc Error in buildEngine: -1 (Could not find tensor resnet_model/Pad in tensorScales.)
W0121 11:17:43.948121 1 trt_engine_op.cc:516] Engine creation for batch size 2 failed Internal: Failed to build TensorRT engine
W0121 11:17:43.948140 1 trt_engine_op.cc:287] Engine retrieval for batch size 1 failed. Running native segment for resnet_model/my_trt_op_0

Could you give me any suggestion?

tensorrtserver_clients docker image: ERROR: No supported GPU(s) detected to run this container

I all, after building and running the tensorrtserver_clients docker image I got the below message "ERROR: No supported GPU(s) detected to run this container". I have a server with 4xT4 GPUs

root@R7425-T4:~/tensorrt-inference-server-master# docker run -it --rm --net=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 tensorrtserver_clients

===============================
== TensorRT Inference Server ==
===============================

NVIDIA Release 19.03dev (build 5618942)

Copyright (c) 2018-2019, NVIDIA CORPORATION.  All rights reserved.
Copyright 2019 The TensorFlow Authors.  All rights reserved.
Copyright 2019 The TensorFlow Serving Authors.  All rights reserved.
Copyright (c) 2016-present, Facebook Inc. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.
ERROR: No supported GPU(s) detected to run this container

load error when signature changed

I load savedmodel with signature key "predict_object", error come out as fellow, it seems trtis only support "serving_default", is it a bug?
E0215 02:09:14.236638 1 aspired_versions_manager.cc:358] Servable {name: resnet_v1_50_graphdef version: 1} cannot be loaded: Invalid argument: unable to load model 'resnet_v1_50_graphdef', expected 'serving_default' signature

"authentication required" trying to build the server

Trying to build off of the master as per https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/build.html . Getting "authentication required" on some asset. Is there login or a preparation command I am missing?

tensorrt-inference-server>
docker build --pull -t tensorrtserver .
Sending build context to Docker daemon  228.5MB
Step 1/78 : ARG BASE_IMAGE=nvcr.io/nvidia/tensorrtserver:19.01-py3
Step 2/78 : ARG PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:19.01-py3
Step 3/78 : ARG TENSORFLOW_IMAGE=nvcr.io/nvidia/tensorflow:19.01-py3
Step 4/78 : FROM ${PYTORCH_IMAGE} AS trtserver_caffe2
19.01-py3: Pulling from nvidia/pytorch
7b8b6451c85f: Downloading  27.42MB/43.41MB
ab4d1096d9ba: Download complete
e6797d1788ac: Download complete
e25c5c290bde: Download complete
8d4f7f435046: Download complete
b21fc23db604: Waiting
bf299d8fff07: Pulling fs layer
292e4eacabdb: Pulling fs layer
1ee5923bb818: Pulling fs layer
59bd0d39f768: Pulling fs layer
e2ac17a09bf7: Pulling fs layer
cbf50c707aaa: Pulling fs layer
3dab4aa6bb15: Pulling fs layer
300ff60504b1: Pulling fs layer
2df886d82da1: Pulling fs layer
c0eee374cc6a: Pulling fs layer
062c6e56b989: Pulling fs layer
715865addf3f: Pulling fs layer
73562736a4b9: Pulling fs layer
92dbee10397c: Pulling fs layer
54f9714d7f99: Pulling fs layer
fffc25d04080: Pulling fs layer
b1ee8fbbba6a: Pulling fs layer
c08d75365123: Downloading
75bbc772987d: Waiting
45ab0b1a2e9f: Waiting
b0cd331d387e: Waiting
9b6b26955917: Waiting
e0f60e6afce1: Waiting
f305bcb0f8d0: Waiting
c7a2193f501c: Waiting
edd78a593297: Waiting
e536a143e6cd: Waiting
ba710a14404d: Waiting
73911d8f3298: Waiting
cf8085328534: Waiting
d33a8cebdda2: Waiting
e66d1e391b26: Waiting
e577c89cb450: Waiting
5dcc0df3d476: Waiting
8bd702c8a871: Waiting
39107bdec20e: Waiting
aff3f1c03514: Waiting
be213dac00f5: Waiting
82134687be00: Waiting
d28910a18fd3: Waiting
unauthorized: authentication required

the TRTIS can not load the trt model

I have successfully converted the frozen graphs( https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz) into uff by the script convert_to_uff in the docker container(nvcr.io/nvidia/tensorflow:18.11-py3[cuda10+cudnn7.4.1+TensorRT5])

i run inference by loading the uff with the TensorRT sample code, it's ok!

but when I run inference in TRTIS, i start TRTIS by starting the docker image(nvcr.io/nvidia/tensorrtserver:18.11-py3[cuda10+cudnn7.4.1]),

I get the error:
"Assertion `size >= bsize && "Mismatch between allocated memory size and expected size of serialized engine."

my model_repository is as below
plan_model/
├── 1
│ └── model.plan
├── config.pbtxt
└── inception_labels.txt

model.plan is the converted uff file,
config.pbxt is
name: "plan_model" platform: "tensorrt_plan" max_batch_size: 1 input [ { name: "input" data_type: TYPE_FP32 ####format: FORMAT_NHWC dims: [ 299, 299, 3 ] } ] output [ { name: "InceptionV3/Predictions/Reshape_1" data_type: TYPE_FP32 dims: [ 1001 ] label_filename: "inception_labels.txt" } ] instance_group [ { kind: KIND_GPU, count: 1 } ]

and GPU is the p100

Unable to collect inference metrics for nullptr servable

If I shutdown TRTIS while new requests are still coming in, I get a slew of "Unable to collect inference metrics for nullptr servable" messages near the end of the shutdown process.

This is just an annoyance. It doesn't seem to hurt anything.

I am using the http endpoint on tensorrtserver:19.02-py3 with a GTX 1060.

This gist shows the hanky panky:
https://gist.github.com/nieksand/007e70d49b2b6bba63144715d97f481e

My client script is just serially poking TRTIS: send single inference request, wait for the response, send the next/

Linking TRTIS as a library

Add the ability to bring TRTIS as a library into Bazel project.

Primarily, this will allow an external application easier access to the .protos needed to customize a client to talk to TRTIS.

gRPC example:

    name = "com_github_grpc_grpc",
    urls = [
        "https://github.com/grpc/grpc/archive/v1.16.1.tar.gz",
    ],
    strip_prefix = "grpc-1.16.1",
)

load("@com_github_grpc_grpc//bazel:grpc_deps.bzl", "grpc_deps")

This allows one to create a grpc dependency @com_github_grpc_grpc//:grpc++_unsecure.

The TRTIS example might be:

http_archive(
    name = "com_github_nvidia_trtis",
    urls = [
        "https://github.com/NVIDIA/tensorrt-inference-server/archive/v0.10.0.tar.gz",
    ],
    strip_prefix = "tensorrt-inference-server-0.10.0",
)

# this part is missing - there are now top-level dependencies - they are hardcoded into the workspace which is not accessible
load("@com_github_nvidia_trtis//baze:trtis_deps.bzl, "trtis_deps")

This would allow one to build the protos: @com_github_nvidia_trtis//:src/core/api_proto.

How to deploy serialized models?

My inference task will need go through a detection model first and then use the detection result as the input of classification model.
is there any way to put detection and classification in one request,so they can pass the data using share-memory? To avoid the overhead of passing data through http request.

TRTIS does not load model

Hello!

I have a model.plan file that does not load on the TRTIS server. It's supposed to be a relatively small model (a modification of a ResNet50. It was a .caffemodel file at first but I converted into a .plan file with this script.

import pretrainedmodels
import torch
import pretrainedmodels.utils as utils
from torch.nn import DataParallel, Sequential
from utils import ListImagesDataset, append
from torch.utils.data import DataLoader
import argparse
from tqdm import tqdm
import h5py

from torch.autograd import Variable
import torch.onnx
import torchvision
import tensorflow as tf
#import uff
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

datatype = trt.float32

# The Onnx path is used for Onnx models.
def build_engine_onnx(deploy_file, model_file, max_batch_size):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.CaffeParser() as parser:

        builder.max_workspace_size = 15 <<  20
        builder.max_batch_size = max_batch_size
        # Load the Onnx model and parse it in order to populate the TensorRT network.
        model_tensors = parser.parse(deploy=deploy_file, model=model_file, network=network, dtype=datatype)


        print(network.get_layer(network.num_layers-1).get_output(0).shape)
        network.mark_output(network.get_layer(network.num_layers - 1).get_output(0))
        return builder.build_cuda_engine(network)

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument("--model_name", default="resnet152Max", type=str)
    parser.add_argument("--deploy_file", default="model", type=str)
    parser.add_argument("--model_file", default="model", type=str)
    parser.add_argument("--output_dir", required=True, type=str)
    parser.add_argument("--num_workers", default=8, type=int)
    parser.add_argument("--batch_size", required=True, type=int)
    args = parser.parse_args()
    deploy_file = args.deploy_file
    model_file = args.model_file




    with build_engine_onnx(args.deploy_file, args.model_file, args.batch_size) as engine:

        with open(args.model_name+'.plan', 'wb') as f:
            print('ok')
            f.write(engine.serialize())

When I try to load it into TRTIS, I get :

===============================
== TensorRT Inference Server ==

NVIDIA Release 18.09 (build 688039)

I0205 16:59:32.507128 1 server.cc:631] Initializing TensorRT Inference Server
I0205 16:59:32.507207 1 server.cc:680] Reporting prometheus metrics on port 8002
I0205 16:59:33.342361 1 metrics.cc:129] found 8 GPUs supported power usage metric
I0205 16:59:33.348772 1 metrics.cc:139] GPU 0: Tesla V100-SXM2-16GB
I0205 16:59:33.360104 1 metrics.cc:139] GPU 1: Tesla V100-SXM2-16GB
I0205 16:59:33.366832 1 metrics.cc:139] GPU 2: Tesla V100-SXM2-16GB
I0205 16:59:33.373640 1 metrics.cc:139] GPU 3: Tesla V100-SXM2-16GB
I0205 16:59:33.381472 1 metrics.cc:139] GPU 4: Tesla V100-SXM2-16GB
I0205 16:59:33.388678 1 metrics.cc:139] GPU 5: Tesla V100-SXM2-16GB
I0205 16:59:33.396049 1 metrics.cc:139] GPU 6: Tesla V100-SXM2-16GB
I0205 16:59:33.403472 1 metrics.cc:139] GPU 7: Tesla V100-SXM2-16GB
I0205 16:59:33.404022 1 server.cc:884] Starting server 'inference:0' listening on
I0205 16:59:33.404050 1 server.cc:888] localhost:8001 for gRPC requests
I0205 16:59:33.404723 1 server.cc:898] localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
I0205 16:59:33.580886 1 server_core.cc:465] Adding/updating models.
I0205 16:59:33.580913 1 server_core.cc:520] (Re-)adding model: classifyNSFW
I0205 16:59:33.580919 1 server_core.cc:520] (Re-)adding model: resnext101_32x4d
I0205 16:59:33.681215 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: resnext101_32x4d version: 1}
I0205 16:59:33.681268 1 loader_harness.cc:66] Approving load for servable version {name: resnext101_32x4d version: 1}
I0205 16:59:33.681277 1 loader_harness.cc:74] Loading servable version {name: resnext101_32x4d version: 1}
I0205 16:59:33.781259 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: classifyNSFW version: 1}
I0205 16:59:33.781313 1 loader_harness.cc:66] Approving load for servable version {name: classifyNSFW version: 1}
I0205 16:59:33.781326 1 loader_harness.cc:74] Loading servable version {name: classifyNSFW version: 1}
I0205 16:59:33.824357 1 plan_bundle.cc:301] Creating instance classifyNSFW_0_0_gpu0 on GPU 0 (7.0) using model.plan
I0205 16:59:34.841770 1 logging.cc:39] Glob Size is 56 bytes.
I0205 16:59:34.843509 1 logging.cc:39] Added linear block of size 8589934597
I0205 16:59:34.843521 1 logging.cc:39] Added linear block of size 18446532056943951878
I0205 16:59:34.843525 1 logging.cc:39] Added linear block of size 47244640284
I0205 16:59:34.843528 1 logging.cc:39] Added linear block of size 154618822688
I0205 16:59:34.843531 1 logging.cc:39] Added linear block of size 18446744069414584508
I0205 16:59:34.843534 1 logging.cc:39] Added linear block of size 17179869216
I0205 16:59:34.843538 1 logging.cc:39] Added linear block of size 1651470960
I0205 16:59:34.843541 1 logging.cc:39] Added linear block of size 1305670057985
I0205 16:59:34.843544 1 logging.cc:39] Added linear block of size 773094113281
I0205 16:59:34.843547 1 logging.cc:39] Added linear block of size 17179869185
I0205 16:59:34.843550 1 logging.cc:39] Added linear block of size 38630843628
I0205 16:59:34.843554 1 logging.cc:39] Added linear block of size 17179869200

It seems like it tries to add some really big linear block, but I have no idea why...
Here is my .plan file:

classifyNSFW.plan.zip

Thank you!

Why TRTIS not support dims: [ -1 ] for input and output tensor?

As for Tensorflow, its shape dim can be define as -1, such as:
shape {
dim {
size: -1
}
}
,but why TRTIS not support dims: [ -1 ] ?

Problem with ssd

Hi, I'm trying to serve a TRT plan which is generated from the TensorRT ssd sample. It looks like something went wrong while deserializing plugins like PriorBox, Normalize and NMS.
E1213 11:48:24.882317 1602 logging.cc:44] getPluginCreator could not find plugin Normalize_TRT version 1 namespace E1213 11:48:24.882339 1602 logging.cc:44] Cannot deserialize plugin Normalize_TR T E1213 11:48:24.882427 1602 logging.cc:44] getPluginCreator could not find plugin PriorBox_TRT version 1 namespace E1213 11:48:24.882442 1602 logging.cc:44] Cannot deserialize plugin PriorBox_TRT E1213 11:48:24.882496 1602 logging.cc:44] getPluginCreator could not find plugin PriorBox_TRT version 1 namespace E1213 11:48:24.882515 1602 logging.cc:44] Cannot deserialize plugin PriorBox_TRT E1213 11:48:24.883639 1602 logging.cc:44] getPluginCreator could not find plugin PriorBox_TRT version 1 namespace E1213 11:48:24.883659 1602 logging.cc:44] Cannot deserialize plugin PriorBox_TRT E1213 11:48:24.884158 1602 logging.cc:44] getPluginCreator could not find plugin PriorBox_TRT version 1 namespace E1213 11:48:24.884172 1602 logging.cc:44] Cannot deserialize plugin PriorBox_TRT E1213 11:48:24.884213 1602 logging.cc:44] getPluginCreator could not find plugin PriorBox_TRT version 1 namespace E1213 11:48:24.884228 1602 logging.cc:44] Cannot deserialize plugin PriorBox_TRT E1213 11:48:24.885036 1602 logging.cc:44] getPluginCreator could not find plugin PriorBox_TRT version 1 namespace E1213 11:48:24.885051 1602 logging.cc:44] Cannot deserialize plugin PriorBox_TRT E1213 11:48:24.885119 1602 logging.cc:44] getPluginCreator could not find plugin NMS_TRT version 1 namespace E1213 11:48:24.885133 1602 logging.cc:44] Cannot deserialize plugin NMS_TRT

backtrace is like:
#0 0x00007f309e28772b in nvinfer1::rt::cuda::PluginV2Layer::allocateResources(nvinfer1::rt::CommonContext const&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5 [Current thread is 1 (Thread 0x7f2ca7f40700 (LWP 1812))] (gdb) bt #0 0x00007f309e28772b in nvinfer1::rt::cuda::PluginV2Layer::allocateResources(nvinfer1::rt::CommonContext const&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5 #1 0x00007f309e25367f in nvinfer1::rt::Engine::initialize() () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5 #2 0x00007f309e256d14 in nvinfer1::rt::Engine::deserialize(void const*, unsigned long, nvinfer1::IGpuAllocator&, nvinfer1::IPluginFactory*) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5 #3 0x00007f309e23fc43 in nvinfer1::Runtime::deserializeCudaEngine(void const*, unsigned long, nvinfer1::IPluginFactory*) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5 #4 0x00007f30b435d3d9 in nvidia::inferenceserver::PlanBundle::CreateExecutionContext(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<char, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::vector<char, std::allocator<char> > > > > const&) () #5 0x00007f30b435e34b in nvidia::inferenceserver::PlanBundle::CreateExecutionContexts(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<char, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::vector<char, std::allocator<char> > > > > const&) () #6 0x00007f30b43569ae in nvidia::inferenceserver::(anonymous namespace)::CreatePlanBundle(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_delete<nvidia::inferenceserver::PlanBundle> >*) () #7 0x00007f30b4355177 in std::_Function_handler<tensorflow::Status (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_delete<nvidia::inferenceserver::PlanBundle> >*), tensorflow::Status (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_delete<nvidia::inferenceserver::PlanBundle> >*)>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_delete<nvidia::inferenceserver::PlanBundle> >*&&) () #8 0x00007f30b435526c in std::_Function_handler<tensorflow::Status (std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_delete<nvidia::inferenceserver::PlanBundle> >*), tensorflow::serving::SimpleLoaderSourceAdapter<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, nvidia::inferenceserver::PlanBundle>::Convert(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<tensorflow::serving::Loader, std::default_delete<tensorflow::serving::Loader> >*)::{lambda(std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_delete<nvidia::inferenceserver::PlanBundle> >*)#1}>::_M_invoke(std::_Any_data const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_delete<nvidia::inferenceserver::PlanBundle> >*&&) () #9 0x00007f30b4356eaa in tensorflow::serving::SimpleLoader<nvidia::inferenceserver::PlanBundle>::Load() () #10 0x00007f30b43be159 in std::_Function_handler<tensorflow::Status (), tensorflow::serving::LoaderHarness::Load()::{lambda()#1}>::_M_invoke(std::_Any_data const&) () #11 0x00007f30b43c0577 in tensorflow::serving::Retry(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, long long, std::function<tensorflow::Status ()> const&, std::function<bool ()> const&) () #12 0x00007f30b43bf496 in tensorflow::serving::LoaderHarness::Load() () #13 0x00007f30b43bb7dd in tensorflow::serving::BasicManager::ExecuteLoad(tensorflow::serving::LoaderHarness*) () #14 0x00007f30b43bbbfc in tensorflow::serving::BasicManager::ExecuteLoadOrUnload(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, tensorflow::serving::LoaderHarness*) () #15 0x00007f30b43bd456 in tensorflow::serving::BasicManager::HandleLoadOrUnloadRequest(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, std::function<void (tensorflow::Status const&)>) () #16 0x00007f30b43bd54f in std::_Function_handler<void (), tensorflow::serving::BasicManager::LoadOrUnloadServable(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, std::function<void (tensorflow::Status const&)>)::{lambda()#2}>::_M_invoke(std::_Any_data const&) () #17 0x00007f30ba1769e9 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () #18 0x00007f30ba174b87 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () #19 0x00007f301fb3ac80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #20 0x00007f30207426ba in start_thread (arg=0x7f2ca7f40700) at pthread_create.c:333 #21 0x00007f301f5a941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

So is the serialized engine plan broken? Or is something wrong with my config. How to solve this problem?

HTTP 200 response on invalid output names

I am using the http endpoint on tensorrtserver:19.02-py3 with a TensorRT model.

If I pass the correct output tensor name to the http endpoint, everything is fine:

(Pdb) response.code
200
(Pdb) response.body
b'\x08\xd2\xb99\xf0\xb2\x1f=\x1eQd?!Dp=\xf6\xb52:$\xe2\x1a<model_name: "roof_condition"\nmodel_version: 1\nbatch_size: 1\noutput {\n  name: "import/softmax_output/Softmax"\n  raw {\n    dims: 1\n    dims: 1\n    dims: 6\n    batch_byte_size: 24\n  }\n}\n'
(Pdb) [v for v in response.headers.items()]
[('Nv-Inferresponse', 'model_name: "roof_condition" model_version: 1 batch_size: 1 output { name: "import/softmax_output/Softmax" raw { dims: 1 dims: 1 dims: 6 batch_byte_size: 24 } }'), ('Nv-Status', 'code: SUCCESS server_id: "inference:0" request_id: 10083'), ('Content-Type', 'application/octet-stream'), ('Date', 'Fri, 08 Mar 2019 10:55:42 GMT'), ('Content-Length', '207'), ('Connection', 'close')]

But if I ask for a non-existent output name like "potato":

TRTIS still hands back an http 200 response
The Nv-Status header still indicates SUCCESS.

(Pdb) response.code
200
(Pdb) response.body
b'model_name: "roof_condition"\nmodel_version: 1\nbatch_size: 1\n'
(Pdb) [v for v in response.headers.items()]
[('Nv-Inferresponse', 'model_name: "roof_condition" model_version: 1 batch_size: 1'), ('Nv-Status', 'code: SUCCESS server_id: "inference:0" request_id: 10082'), ('Content-Type', 'application/octet-stream'), ('Date', 'Fri, 08 Mar 2019 10:42:55 GMT'), ('Content-Length', '60'), ('Connection', 'close')]

I think TRTIS should return an error (e.g. http 4xx) when asked for a non-existent output name.

Potential GPU memory leak for TensorFlow models?

Hi,

I was testing with tensorrtserver:19.02-py3. However, I found the GPU memory usage became near 100% after running one TensorFlow savedmodel. The memory usage didn't go down even when I unloaded that model. Moreover, I have set --tf-gpu-memory-fraction=0.1 but it didn't help.

So is it possible that trtserver has GPU memory leak for TensorFlow models? Because the memory usage looks good with tensorrt plans.

Thanks

Cannot deserialize plugin RPROI_TRT

I0111 07:26:47.327792 1 server.cc:701] Initializing TensorRT Inference Server
I0111 07:26:47.327845 1 server.cc:751] Reporting prometheus metrics on port 8002
I0111 07:26:47.328767 1 metrics.cc:148] found 1 GPUs supporting NVML metrics
I0111 07:26:47.334173 1 metrics.cc:158] GPU 0: GeForce GTX 1080 Ti
I0111 07:26:47.334408 1 server.cc:1121] Starting server 'inference:0' listening on
I0111 07:26:47.334416 1 server.cc:1125] localhost:8001 for gRPC requests
I0111 07:26:47.334676 1 server.cc:1029] Building nvrpc server
I0111 07:26:47.334689 1 server.cc:1035] Register TensorRT GRPCService
I0111 07:26:47.334695 1 server.cc:1038] Register Infer RPC
I0111 07:26:47.334702 1 server.cc:1042] Register Status RPC
I0111 07:26:47.334706 1 server.cc:1046] Register Profile RPC
I0111 07:26:47.334709 1 server.cc:1050] Register Health RPC
I0111 07:26:47.334713 1 server.cc:1054] Register Executor
I0111 07:26:47.335855 1 server.cc:1135] localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 237] RAW: Entering the event loop ...
I0111 07:26:48.090098 1 logging.cc:49] Glob Size is 638866496 bytes.
I0111 07:26:48.162693 1 logging.cc:49] Added linear block of size 96000000
I0111 07:26:48.162704 1 logging.cc:49] Added linear block of size 96000000
I0111 07:26:48.162707 1 logging.cc:49] Added linear block of size 12032000
I0111 07:26:48.162726 1 logging.cc:49] Added linear block of size 221184
I0111 07:26:48.162729 1 logging.cc:49] Added linear block of size 110592
E0111 07:26:48.166348 1 logging.cc:43] getPluginCreator could not find plugin RPROI_TRT version 1 namespace
E0111 07:26:48.166355 1 logging.cc:43] Cannot deserialize plugin RPROI_TRT

Could you help me? Thank you.

Implement Pre-process "add-on" to reduce TRTIS communication bottleneck

Hello I am looking improve the performance of my client application which runs nets on TRTIS that take large inputs (think batches of 10 tensors of size (2048, 512, 8)) and outputs tensors (2048, 512, 9) by implementing a custom pre-process step. deadeyegoodwin mentioned in a reply to a user's question about submitting raw data via IPC that it is possible to implement a pre-process add-on to be built into TRTIS. Could anyone point me towards some resources so I can more quickly implement a first past at this feature?

Bellow is the comment from deadeyegoodwin that I refer to above. I am interested in item two because it seems to be the performance solution I can use:

In general I think your assessment is correct: I/O can be a performance limiter for some models and a primary way to fix this in many cases is to make the pre-processing local with the inference. Here are some variations we think about and where we stand as far as current support:

Pre-processing "service" running on same node as TensorRT Inference Server (TRTIS).
a. Use GRPC (or HTTP) to communicate from pre-processor -> TRTIS. Since communication is now local it may no longer be a bottleneck...
b. For even higher BW between pre-processor -> TRTIS, remove the GRPC/HTTP protocol overhead by implementing a custom/raw socket API. The internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. Another option here is a flatbuffer interface which we have also thought about but not done anything with as yet.
c. Use shared-memory as you suggest... this would likely require a custom TRTIS API to communicate the shared-memory reference so is similar to (b).
d. For maximum bandwidth you could share GPU memory between pre-processor and TRTIS and use that for communication. The pre-processor would leave the input tensors in GPU memory and just share the location (via CUDA IPC) with TRTIS. We want to add some functionality to TRTIS to support this but currently we have not.
Avoid communication completely by implementing the pre-processor within TRTIS. Again, the
internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. In general we are interested in generic pre-processor "add-ons" of this kind that we can incorporate into TRTIS as build-time options.

I would suggest that you start with (1a) and see how much benefit that gets you. We are generally interested in improving TRTIS in this area and so would welcome your experience and feedback as you experiment. If you think you could contribute something generally useful we would be very open to working with you on it, just be sure to include us in your plans early on so we can make sure we are all on the same page.

As for your question #3. Yes, for experimenting it is probably fastest to hack up the gRPC service to instead pass the reference instead of the actual data (but keep the rest of the request/response message the same). infer.cc is where the data (raw_input) is read out of the request message so you would need to change that to instead read from shared memory.

Originally posted by @deadeyegoodwin in #1 (comment)

cp: cannot stat 'bazel-bin/src/custom/addsub/libaddsub.so': No such file or directory

When I run:
docker build -t tensorrtserver_clients --target trtserver_build --build-arg "PYVER=2.7" --build-arg "BUILD_CLIENTS_ONLY=1" .

I have some problems：
INFO: Elapsed time: 94.683s, Critical Path: 14.74s
INFO: 1848 processes: 1848 local.
INFO: Build completed successfully, 2001 total actions
INFO: Build completed successfully, 2001 total actions
cp: cannot stat 'bazel-bin/src/custom/addsub/libaddsub.so': No such file or directory
The command '/bin/sh -c (cd /opt/tensorflow && ./nvbuild.sh --python$PYVER --configonly) && (cd tools && mv bazel.rc bazel.orig && cat bazel.orig /opt/tensorflow/.tf_configure.bazelrc > bazel.rc) && bash -c 'if [ "$BUILD_CLIENTS_ONLY" != "1" ]; then bazel build -c opt --config=cuda src/servers/trtserver src/custom/... src/clients/... src/test/...; else bazel build -c opt src/clients/...; fi' && (cd /opt/tensorrtserver && ln -s /workspace/qa qa) && mkdir -p /opt/tensorrtserver/bin && cp bazel-bin/src/clients/c++/image_client /opt/tensorrtserver/bin/. && cp bazel-bin/src/clients/c++/perf_client /opt/tensorrtserver/bin/. && cp bazel-bin/src/clients/c++/simple_client /opt/tensorrtserver/bin/. && mkdir -p /opt/tensorrtserver/lib && cp bazel-bin/src/clients/c++/librequest.so /opt/tensorrtserver/lib/. && cp bazel-bin/src/clients/c++/librequest.a /opt/tensorrtserver/lib/. && mkdir -p /opt/tensorrtserver/custom && cp bazel-bin/src/custom/addsub/libaddsub.so /opt/tensorrtserver/custom/. && mkdir -p /opt/tensorrtserver/pip && bazel-bin/src/clients/python/build_pip /opt/tensorrtserver/pip/. && bash -c 'if [ "$BUILD_CLIENTS_ONLY" != "1" ]; then cp bazel-bin/src/servers/trtserver /opt/tensorrtserver/bin/.; cp bazel-bin/src/test/caffe2plan /opt/tensorrtserver/bin/.; fi' && bazel clean --expunge && rm -rf /root/.cache/bazel && rm -rf /tmp/*' returned a non-zero code: 1

Can you help? Thank you

Warning about Dynamic Batching and thread count

The Dynamic Batching docs should warn about having a sufficient --http-thread-count.

with the default of 8, you can easily fail to hit the preferred_batch_size and will always hit the max_queue_delay_microseconds timeout.

That makes performance awful.

I experienced this with 7 models each with target batch size of 4. More than enough concurrent requests were inflight that each model should have filled batches within an msec or two. But verbose logging indicated that wasn't happening:

I0215 11:16:51.400883 1 plan_bundle.cc:450] Running foo_0_gpu0 with 2 request payloads
I0215 11:16:51.400908 1 plan_bundle.cc:450] Running bar_0_gpu0 with 1 request payloads

That makes sense if each http thread only services a single request at a time. You have 7 unfilled batches, 8 blocked request handling threads, and nothing more coming in to fill/release the batches until timeout.

Bumping --http-thread-count to 128 resolved the issue. Performance was excellent and verbose logging shows full batches being made:

I0215 11:26:14.336939 1 plan_bundle.cc:450] Running foo_0_gpu0 with 4 request payloads
I0215 11:26:14.341266 1 plan_bundle.cc:450] Running bar_0_gpu0 with 4 request payloads

I think a warning in the docs makes sense. Longer term, perhaps flip the http server from threads to an event loop architecture instead?

how to infer by http restful API

i sent request:
/api/infer/resnet50_netdef
POST
header:
content-type: application/octet-stream
NV-InferRequest: batch_size: 1 input { name: "input" byte_size: 602112 } output { name: "output" byte_size: 4000 cls { count: 3 } }
Body
mug.jpg

but it not work, 400 bad request.

NVIDIA Quadro M1000M GPU not supported by tensorrtserver:18.12-py3

After running tensorrtserver:18.12-py3 Docker container, I get the following output which indicates that my video card (NVIDIA Quadro M1000M GPU) is not supported, even though this video card is listed here https://developer.nvidia.com/cuda-gpus as being a CUDA-enabled piece of hardware.

Some environment details:
nvidia-docker version: 2.0.3
NVIDIA-SMI version: 410.79
Driver version: 410.79
CUDA version: 10.0
cudnnGetVersion() : 7402 , CUDNN_VERSION from cudnn.h : 7402 (7.4.2)

nvidia-docker run --rm --name trtserver -p 8000:8000 -p 8001:8001 \ -v /home/temp/tensorrt_models:/models nvcr.io/nvidia/tensorrtserver:18.12-py3 trtserver \ --model-store=/models

Ouput:

== TensorRT Inference Server ==

NVIDIA Release 18.12 (build 880120)

Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2018 The TensorFlow Authors. All rights reserved.
Copyright 2018 The TensorFlow Serving Authors. All rights reserved.
Copyright (c) 2016-present, Facebook Inc. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.
ERROR: Detected NVIDIA Quadro M1000M GPU, which is not supported by this container
ERROR: No supported GPU(s) detected to run this container

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for the inference server. NVIDIA recommends the use of the following flags:
nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

I0122 13:43:38.969179 1 server.cc:701] Initializing TensorRT Inference Server
I0122 13:43:38.969259 1 server.cc:751] Reporting prometheus metrics on port 8002
I0122 13:43:38.971350 1 metrics.cc:148] found 1 GPUs supporting NVML metrics
I0122 13:43:38.976922 1 metrics.cc:158] GPU 0: Quadro M1000M
I0122 13:43:38.977711 1 server.cc:1121] Starting server 'inference:0' listening on
I0122 13:43:38.977732 1 server.cc:1125] localhost:8001 for gRPC requests
I0122 13:43:38.977914 1 server.cc:1029] Building nvrpc server
I0122 13:43:38.977940 1 server.cc:1035] Register TensorRT GRPCService
I0122 13:43:38.977963 1 server.cc:1038] Register Infer RPC
I0122 13:43:38.977980 1 server.cc:1042] Register Status RPC
I0122 13:43:38.977989 1 server.cc:1046] Register Profile RPC
I0122 13:43:38.977998 1 server.cc:1050] Register Health RPC
I0122 13:43:38.978007 1 server.cc:1054] Register Executor
I0122 13:43:38.982098 1 server.cc:1135] localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 237] RAW: Entering the event loop ...
I0122 13:43:39.039619 1 server_status.cc:105] New status tracking for model 'model_1'
I0122 13:43:39.040311 1 server_core.cc:465] Adding/updating models.
I0122 13:43:39.040327 1 server_core.cc:562] (Re-)adding model: model_1
I0122 13:43:39.140693 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: model_1 version: 1}
I0122 13:43:39.140735 1 loader_harness.cc:66] Approving load for servable version {name: model_1 version: 1}
I0122 13:43:39.140752 1 loader_harness.cc:74] Loading servable version {name: model_1 version: 1}
I0122 13:43:39.143548 1 base_bundle.cc:168] Creating instance model_1_0_0_gpu0 on GPU 0 (5.0) using model.graphdef
I0122 13:43:39.208124 1 cuda_gpu_executor.cc:957] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0122 13:43:39.208825 1 gpu_device.cc:1432] Found device 0 with properties:
name: Quadro M1000M major: 5 minor: 0 memoryClockRate(GHz): 1.0715
pciBusID: 0000:01:00.0
totalMemory: 1.96GiB freeMemory: 1.19GiB
I0122 13:43:39.208850 1 gpu_device.cc:1482] Ignoring visible gpu device (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0) with Cuda compute capability 5.0. The minimum required Cuda capability is 5.2.
I0122 13:43:39.208862 1 gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
I0122 13:43:39.208870 1 gpu_device.cc:988] 0
I0122 13:43:39.208888 1 gpu_device.cc:1001] 0: N
I0122 13:43:39.233687 1 base_bundle.cc:168] Creating instance model_1_0_1_gpu0 on GPU 0 (5.0) using model.graphdef
I0122 13:43:39.233723 1 gpu_device.cc:1482] Ignoring visible gpu device (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0) with Cuda compute capability 5.0. The minimum required Cuda capability is 5.2.
I0122 13:43:39.233757 1 gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
I0122 13:43:39.233761 1 gpu_device.cc:988] 0
I0122 13:43:39.233765 1 gpu_device.cc:1001] 0: N
I0122 13:43:39.234945 1 base_bundle.cc:168] Creating instance model_1_0_2_gpu0 on GPU 0 (5.0) using model.graphdef
I0122 13:43:39.234992 1 gpu_device.cc:1482] Ignoring visible gpu device (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0) with Cuda compute capability 5.0. The minimum required Cuda capability is 5.2.
I0122 13:43:39.234998 1 gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
I0122 13:43:39.235016 1 gpu_device.cc:988] 0
I0122 13:43:39.235019 1 gpu_device.cc:1001] 0: N
I0122 13:43:39.236214 1 base_bundle.cc:168] Creating instance model_1_0_3_gpu0 on GPU 0 (5.0) using model.graphdef
I0122 13:43:39.236237 1 gpu_device.cc:1482] Ignoring visible gpu device (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0) with Cuda compute capability 5.0. The minimum required Cuda capability is 5.2.
I0122 13:43:39.236247 1 gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
I0122 13:43:39.236254 1 gpu_device.cc:988] 0
I0122 13:43:39.236278 1 gpu_device.cc:1001] 0: N
I0122 13:43:39.237000 1 infer.cc:788] Starting runner thread 0 at nice 5...
I0122 13:43:39.237058 1 infer.cc:788] Starting runner thread 1 at nice 5...
I0122 13:43:39.237161 1 infer.cc:788] Starting runner thread 2 at nice 5...
I0122 13:43:39.237272 1 loader_harness.cc:86] Successfully loaded servable version {name: model_1 version: 1}
I0122 13:43:39.237268 1 infer.cc:788] Starting runner thread 3 at nice 5...
E0122 13:43:40.977806 1 metrics.cc:213] failed to get power limit for GPU 0, NVML_ERROR 3
E0122 13:43:40.978112 1 metrics.cc:225] failed to get power usage for GPU 0, NVML_ERROR 3
E0122 13:43:40.978226 1 metrics.cc:238] failed to get energy consumption for GPU 0, NVML_ERROR 3
E0122 13:43:42.979284 1 metrics.cc:213] failed to get power limit for GPU 0, NVML_ERROR 3
E0122 13:43:42.979468 1 metrics.cc:225] failed to get power usage for GPU 0, NVML_ERROR 3
E0122 13:43:42.979555 1 metrics.cc:238] failed to get energy consumption for GPU 0, NVML_ERROR 3

How the Input tensor definition for tensorflow GraphDef model in TRTIS graph.pbtxt file?

The tensorflow GraphDef model have two inputs, one is input_a and the other is input_b. In the graph.pbtxt file, the input_a is defined as:
value {
list {
shape {
dim {
size: -1
}
}
}
}
, and the input_b is defined as:
value {
list {
shape {
dim {
size: -1
}
dim {
size: -1
}
}
}
}
, now, I want to konw how the dims definition of config.pbtxt file for input_a and input_b respectively ?

Encountered an error while loading tensorflow savedmodel

While TRTIS loading tensorflow savedmodel,an error occurred like this:
Loading servable: {name: road-tensorflow version: 1} failed: Not found: Op type not registered 'PyFunc' in binary running on efa5e12b876b. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resamplershould be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
what did it mean and what should i do? thanks!

Performance Example Application: [ 0] INTERNAL - No valid requests recorded within time interval. Please use a larger time window.

hi all, I am running the perf_client example applications and getting the below errors:

Server: 4xT4 GPUs
Docker image used for the server: nvcr.io/nvidia/tensorrtserver:19.02-py3
Command used to run the client: nvidia-docker run -it --rm --net=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 tensorrtserver_clients

Example 1:

root@R7425-T4:/workspace# /opt/tensorrtserver/bin/perf_client -m resnet50_netdef -p3000 -t4 -v
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 3000 msec

Request concurrency: 4
[ 0] INTERNAL - No valid requests recorded within time interval. Please use a larger time window.
Thread [0] had error: [inference:0 0] INVALID_ARG - unable to parse request for model 'resnet50_netdef'
Thread [1] had error: [inference:0 0] INVALID_ARG - unable to parse request for model 'resnet50_netdef'
Thread [2] had error: [inference:0 0] INVALID_ARG - unable to parse request for model 'resnet50_netdef'
Thread [3] had error: [inference:0 0] INVALID_ARG - unable to parse request for model 'resnet50_netdef'

Example 2:

root@R7425-T4:/workspace# /opt/tensorrtserver/bin/perf_client -m resnet50_netdef -d -c8 -l200 -p5000 -b8
*** Measurement Settings ***
  Batch size: 8
  Measurement window: 5000 msec
  Latency limit: 200 msec
  Concurrency limit: 8 concurrent requests

Request concurrency: 1
[ 0] INTERNAL - No valid requests recorded within time interval. Please use a larger time window.
Thread [0] had error: [inference:0 0] INVALID_ARG - unable to parse request for model 'resnet50_netdef'

Example 3:

root@R7425-T4:/workspace# /opt/tensorrtserver/bin/perf_client -m resnet50_netdef -p3000 -d -l50 -c 3
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 3000 msec
  Latency limit: 50 msec
  Concurrency limit: 3 concurrent requests

Request concurrency: 1
[ 0] INTERNAL - No valid requests recorded within time interval. Please use a larger time window.
Thread [0] had error: [inference:0 0] INVALID_ARG - unable to parse request for model 'resnet50_netdef'

On another hand, the TensorRT Inference Server is throwing these messages:
[libprotobuf ERROR external/protobuf_archive/src/google/protobuf/text_format.cc:307] Error parsing text-format nvidia.inferenceserver.InferRequestHeader: 1:79: Message type "nvidia.inferenceserver.InferRequestHeader" has no field named "id".

TRTIS should support TensorRT models that require custom plugins

image_client.py due to message type error

Hi,
I ran inference servers (using nvcr.io/nvidia/tensorrtserver:19.01-py3) and sent http request using image_client.py without modification (building Dockerfile).

Sent one request
$ python3 src/clients/python/image_client.py -m plan_model -s INCEPTION -u 172.17.0.2:8000 /opt/tensorrtserver/qa/images/mug.jpg

This is part of outputs
Trying 172.17.0.2...
TCP_NODELAY set
Connected to 172.17.0.2 (172.17.0.2) port 8000 (#0)
POST /api/infer/plan_model?format=binary HTTP/1.1
Host: 172.17.0.2:8000
User-Agent: libcurl-agent/1.0
Accept: /
Content-Type: application/octet-stream
NV-InferRequest:batch_size: 1 input { name: "input" dims: 3 dims: 299 dims: 299 } output { name: "InceptionV3/Logits/SpatialSqueeze" cls { count: 1 } }
Content-Length: 1072812

We are completely uploaded and fine
HTTP/1.1 400 Bad Request
NV-Status: code: INVALID_ARG msg: "unable to parse request for model 'plan_model'" server_id: "inference:0"
Content-Type: application/octet-stream
Date: Thu, 14 Feb 2019 01:50:56 GMT
Content-Length: 0

Connection #0 to host 172.17.0.2 left intact
Traceback (most recent call last):
File "src/clients/python/image_client.py", line 304, in
FLAGS.batch_size))
File "/usr/local/lib/python3.5/dist-packages/tensorrtserver/api/init.py", line 844, in run
self._last_request_id = _raise_if_error(c_void_p(_crequest_infer_ctx_run(self._ctx)))
File "/usr/local/lib/python3.5/dist-packages/tensorrtserver/api/init.py", line 183, in _raise_if_error
raise ex
tensorrtserver.api.InferenceServerException: [inference:0 0] unable to parse request for model 'plan_model'

In inference server side, it showed
[libprotobuf ERROR external/protobuf_archive/src/google/protobuf/text_format.cc:307] Error parsing text-format nvidia.inferenceserver.InferRequestHeader: 1:41: Message type "nvidia.inferenceserver.InferRequestHeader.Input" has no field named "dims".
E0214 01:29:37.093842 7363 server_status.cc:374] Unable to collect inference metrics for nullptr servable

Based on this file (https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/api.proto), there is dims field in Input.

This is my "config.pbtxt" file.
name: "plan_model"
platform: "tensorrt_plan"
max_batch_size: 1
input [
{
name: "input"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [3, 299, 299]
}
]
output [
{
name: "InceptionV3/Logits/SpatialSqueeze"
data_type: TYPE_FP32
dims: [1, 1, 1001]
label_filename: "imagenet_labels_1001.txt"
}
]

I guess request format has some issues related to dims part.
However, I am not clear.
Do you have any insight of this problem?

Thanks,

Clients build failed when set --build-arg "PYVER=3.6"

Will failed

sudo docker build -t tensorrtserver_clients --target trtserver_build --build-arg "PYVER=3.6" --build-arg "BUILD_CLIENTS_ONLY=1" .

but 3.5 works

sudo docker build -t tensorrtserver_clients --target trtserver_build --build-arg "PYVER=3.5" --build-arg "BUILD_CLIENTS_ONLY=1" .

can not run tensorrt-inference-servere-19.02 with cuda-driver-396.26

I pull the image nvcr.io/nvidia/tensorrtserver:19.02-py3 , and fail to start the server in the container.
The error msg is :

NVIDIA Release 19.02 (build 5627847)

Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2018 The TensorFlow Authors. All rights reserved.
Copyright 2018 The TensorFlow Serving Authors. All rights reserved.
Copyright (c) 2016-present, Facebook Inc. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
version 396.26 was detected and compatibility mode is UNAVAILABLE.
   [[CUDA Driver UNAVAILABLE (cuInit(0) returned 999)]]

But the container can run normally if I change my NVIDIA Driver from verion 396 to 384.
According to https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html , the container is supposed to run normally with NVIDIA Driver 384.111+.

in addition, my environment is Ubuntu 16.04, Tesla P4/P100.

resnet50_netdef does not run on CPU

Changing config.pbtxt for resnet50_netdef model (/docs/examples/model_repository/resnet50_netdef) to make use of the CPU, instead of the GPU, causes TRTIS to crash - see logs attached below.

Is there any way to make resnet50_netdef run on CPU, just like the simple mode (/docs/examples/model_repository/simple)? I know that this would greatly reduce the inference performance, but this is just for experimental usage.

config.pbtxt new content:

name: "resnet50_netdef"
platform: "caffe2_netdef"
max_batch_size: 128
input [
{
name: "data"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "softmax"
data_type: TYPE_FP32
dims: [ 1000 ]
label_filename: "resnet50_labels.txt"
}
]

Output:

===============================
== TensorRT Inference Server ==

NVIDIA Release 18.12 (build 880120)

I0123 15:31:31.274979 1 server.cc:701] Initializing TensorRT Inference Server
I0123 15:31:31.275056 1 server.cc:751] Reporting prometheus metrics on port 8002
I0123 15:31:31.276949 1 metrics.cc:148] found 1 GPUs supporting NVML metrics
I0123 15:31:31.282843 1 metrics.cc:158] GPU 0: Quadro M1000M
I0123 15:31:31.283517 1 server.cc:1121] Starting server 'inference:0' listening on
I0123 15:31:31.283533 1 server.cc:1125] localhost:8001 for gRPC requests
I0123 15:31:31.283642 1 server.cc:1029] Building nvrpc server
I0123 15:31:31.283659 1 server.cc:1035] Register TensorRT GRPCService
I0123 15:31:31.283670 1 server.cc:1038] Register Infer RPC
I0123 15:31:31.283676 1 server.cc:1042] Register Status RPC
I0123 15:31:31.283681 1 server.cc:1046] Register Profile RPC
I0123 15:31:31.283687 1 server.cc:1050] Register Health RPC
I0123 15:31:31.283710 1 server.cc:1054] Register Executor
I0123 15:31:31.286020 1 server.cc:1135] localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 237] RAW: Entering the event loop ...
I0123 15:31:31.342688 1 server_status.cc:105] New status tracking for model 'inception_graphdef'
I0123 15:31:31.342715 1 server_status.cc:105] New status tracking for model 'resnet50_netdef'
I0123 15:31:31.342726 1 server_status.cc:105] New status tracking for model 'simple'
I0123 15:31:31.342834 1 server_core.cc:465] Adding/updating models.
I0123 15:31:31.342847 1 server_core.cc:562] (Re-)adding model: inception_graphdef
I0123 15:31:31.342853 1 server_core.cc:562] (Re-)adding model: resnet50_netdef
I0123 15:31:31.342859 1 server_core.cc:562] (Re-)adding model: simple
I0123 15:31:31.443067 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: resnet50_netdef version: 1}
I0123 15:31:31.443114 1 loader_harness.cc:66] Approving load for servable version {name: resnet50_netdef version: 1}
I0123 15:31:31.443138 1 loader_harness.cc:74] Loading servable version {name: resnet50_netdef version: 1}
I0123 15:31:31.543224 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: simple version: 1}
I0123 15:31:31.543259 1 loader_harness.cc:66] Approving load for servable version {name: simple version: 1}
I0123 15:31:31.543270 1 loader_harness.cc:74] Loading servable version {name: simple version: 1}
I0123 15:31:31.544332 1 base_bundle.cc:168] Creating instance simple_0_gpu0 on GPU 0 (5.0) using model.graphdef
I0123 15:31:31.578120 1 netdef_bundle.cc:218] Creating instance resnet50_netdef_0_gpu0 on GPU 0 (5.0) using init_model.netdef and model.netdef
E0123 15:31:31.578449 1 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0123 15:31:31.578462 1 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0123 15:31:31.578466 1 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
I0123 15:31:31.621830 1 cuda_gpu_executor.cc:957] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0123 15:31:31.622515 1 gpu_device.cc:1432] Found device 0 with properties:
name: Quadro M1000M major: 5 minor: 0 memoryClockRate(GHz): 1.0715
pciBusID: 0000:01:00.0
totalMemory: 1.96GiB freeMemory: 1.20GiB
I0123 15:31:31.622537 1 gpu_device.cc:1482] Ignoring visible gpu device (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0) with Cuda compute capability 5.0. The minimum required Cuda capability is 5.2.
I0123 15:31:31.622578 1 gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
I0123 15:31:31.622600 1 gpu_device.cc:988] 0
I0123 15:31:31.622612 1 gpu_device.cc:1001] 0: N
I0123 15:31:31.643218 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: inception_graphdef version: 1}
I0123 15:31:31.643238 1 loader_harness.cc:66] Approving load for servable version {name: inception_graphdef version: 1}
I0123 15:31:31.643261 1 loader_harness.cc:74] Loading servable version {name: inception_graphdef version: 1}
I0123 15:31:31.643908 1 loader_harness.cc:86] Successfully loaded servable version {name: simple version: 1}
I0123 15:31:31.643909 1 infer.cc:788] Starting runner thread 0 at nice 5...
I0123 15:31:31.643952 1 base_bundle.cc:168] Creating instance inception_graphdef_0_0_gpu0 on GPU 0 (5.0) using model.graphdef
I0123 15:31:31.643985 1 gpu_device.cc:1482] Ignoring visible gpu device (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0) with Cuda compute capability 5.0. The minimum required Cuda capability is 5.2.
I0123 15:31:31.643995 1 gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
I0123 15:31:31.644002 1 gpu_device.cc:988] 0
I0123 15:31:31.644010 1 gpu_device.cc:1001] 0: N
I0123 15:31:31.856008 1 base_bundle.cc:168] Creating instance inception_graphdef_0_1_gpu0 on GPU 0 (5.0) using model.graphdef
I0123 15:31:31.856098 1 gpu_device.cc:1482] Ignoring visible gpu device (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0) with Cuda compute capability 5.0. The minimum required Cuda capability is 5.2.
I0123 15:31:31.856110 1 gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
I0123 15:31:31.856116 1 gpu_device.cc:988] 0
I0123 15:31:31.856121 1 gpu_device.cc:1001] 0: N
I0123 15:31:32.043441 1 base_bundle.cc:168] Creating instance inception_graphdef_0_2_gpu0 on GPU 0 (5.0) using model.graphdef
I0123 15:31:32.043501 1 gpu_device.cc:1482] Ignoring visible gpu device (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0) with Cuda compute capability 5.0. The minimum required Cuda capability is 5.2.
I0123 15:31:32.043508 1 gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
I0123 15:31:32.043512 1 gpu_device.cc:988] 0
I0123 15:31:32.043534 1 gpu_device.cc:1001] 0: N
W0123 15:31:32.081417 1 workspace.cc:170] Blob gpu_0/data not in the workspace.
terminate called after throwing an instance of 'c10::Error'
what(): [enforce fail at operator.cc:46] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/data
frame # 0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const*) + 0x76 (0x7fd803c78416 in /opt/tensorrtserver/lib/libc10.so)
frame # 1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x6aa (0x7fd8525ab71a in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 2: + 0x263e359 (0x7fd8064c3359 in /opt/tensorrtserver/lib/libcaffe2_gpu.so)
frame # 3: + 0x2d1c84b (0x7fd806ba184b in /opt/tensorrtserver/lib/libcaffe2_gpu.so)
frame # 4: + 0x2d1df9e (0x7fd806ba2f9e in /opt/tensorrtserver/lib/libcaffe2_gpu.so)
frame # 5: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*&&) + 0x23 (0x7fd85239dfd3 in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 6: + 0x13e45b8 (0x7fd8525a95b8 in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 7: + 0x13e6a19 (0x7fd8525aba19 in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 8: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x2cf (0x7fd8525ac5bf in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 9: caffe2::SimpleNet::SimpleNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x455 (0x7fd852551355 in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 10: + 0x1390c3e (0x7fd852555c3e in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 11: + 0x138cbb3 (0x7fd852551bb3 in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 12: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0xb67 (0x7fd85257adc7 in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 13: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0x14b (0x7fd85259824b in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 14: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7fd85259991f in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 15: + 0x140d229 (0x7fd8525d2229 in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 16: Caffe2WorkspaceCreate + 0x1307 (0x7fd8525d4277 in /opt/tensorrtserver/lib/libcaffe2.so)
frame # 17: + 0x870054 (0x55dd33ff1054 in trtserver)
frame # 18: + 0x870d77 (0x55dd33ff1d77 in trtserver)
frame # 19: + 0x867e1f (0x55dd33fe8e1f in trtserver)
frame # 20: + 0x863cc8 (0x55dd33fe4cc8 in trtserver)
frame # 21: + 0x863ebc (0x55dd33fe4ebc in trtserver)
frame # 22: + 0x86834a (0x55dd33fe934a in trtserver)
frame # 23: + 0x8e6869 (0x55dd34067869 in trtserver)
frame # 24: + 0x8e8c87 (0x55dd34069c87 in trtserver)
frame # 25: + 0x8e7ba6 (0x55dd34068ba6 in trtserver)
frame # 26: + 0x8e3eed (0x55dd34064eed in trtserver)
frame # 27: + 0x8e430c (0x55dd3406530c in trtserver)
frame # 28: + 0x8e5b66 (0x55dd34066b66 in trtserver)
frame # 29: + 0x8e5c5f (0x55dd34066c5f in trtserver)
frame # 30: + 0x66b41a9 (0x55dd39e351a9 in trtserver)
frame # 31: + 0x66b2347 (0x55dd39e33347 in trtserver)
frame # 32: + 0xb8c80 (0x7fd7f4356c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame # 33: + 0x76ba (0x7fd7f4f5e6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame # 34: clone + 0x6d (0x7fd7f3dc541d in /lib/x86_64-linux-gnu/libc.so.6)

Encountered out of memoryError when using perf_client test with various batch sizes

I am trying test Tesla P4 performance with TRT inference server. My GPU memory is 8GB.
I tested using perf_client c++ sample and resetnet50_netdef model, starting with batchsize=1 , and then double the batchsize and go on.
I found test result is OK when batchsize = 8 and test failed when batchsize = 16 due to out of memory error.
what make me confused is the batchsize=4 and batchsize=8 test also failed after the batchsize16 test got bad result. As my thought batchsize < 16 tests should be always OK.
Could you give me some explanation for that? Thanks.

simple_client.py: unexpected additional input data for model 'simple'

Tried to run /src/clients/python/simple_client.py (master branch, latest commit), while having tensorrtserver:18.12-py3 up and running with models from /docs/examples/model_repository, but encountered the following exception:

Exception has occurred: tensorrtserver.api.InferenceServerException
[inference:0 0] unexpected additional input data for model 'simple'
File "/home/x/sources/NVIDIA/tensorrt-inference-server/src/clients/python/simple_client.py", line 72, in
batch_size)

Environment setup:
Python 3.5.3
tensorrtserver:18.12-py3
NVIDIA-SMI: 410.79
Driver Version: 410.79
CUDA Version: 10.0
NVIDIA Docker: 2.0.3

TensorRT Inference Server was started with the following command:

nvidia-docker run --rm --name trtserver -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/x/sources/NVIDIA/tensorrt-inference-server/docs/examples/model_repository:/models nvcr.io/nvidia/tensorrtserver:18.12-py3 trtserver --model-store=/models

Error With Running the samples

I recently pulled Nvidiai Docker container for tensorrt inference engine and Ran Simple_Client.py example
without any change but Inference Engine returns this error which I have not been able to get an insight why it is complaining about . The Error Message is as follow : "nvidia.inferencserver.InferRequestHeader.Input" has no field named "dims" . I checked the message sent by client and it looked fine . I ran both client and engine on same machine and aslo tried to run the engine on another machine but got the same error .
I appreciate if you know what is going on .

with grpc protocol I had another error and it didn't work as well .

The grpc_image_client.py sample also throws an error: ( if I remeber correctly it was compalining about input size 0 )

python grpc_image_client.py -m resnet50_netdef ~/tensorrt-inference-server/qa/images/mug.jpg

Request 0, batch size 1
Traceback (most recent call last):
File "grpc_image_client.py", line 325, in
postprocess(response.meta_data.output, result_filenames[idx], FLAGS.batch_size)
File "grpc_image_client.py", line 188, in postprocess
raise Exception("expected 1 result, got {}".format(len(results)))
Exception: expected 1 result, got 0

image client build error

sudo docker build -t tensorrtserver_clients --target trtserver_build --build-arg "BUILD_CLIENTS_ONLY=1" .

then the error is as below

ERROR: no such package '@local_config_cuda//crosstool': The repository could not be resolved

some env information is as below

Docker version 18.09.0, build 4d60db4
tensorrtserver 18.11-py3
cuda 10
cudnn 7.3
cuda Driver Version: 410.48
bazel version : Build label: 0.20.0- (@non-git)

Does TRTIS support large models that place on multiple devices？

I noticed the description in the TRTIS documentation that "Multiple model support. The server can manage any number and mix of models (limited by system disk and memory resources)". So I want to know if TRTIS supports when my model is so large that can not load on only one device and need to split on multiple devices (GPUs). Thank you!

Submitting raw data via IPC

Hi, thanks for open-sourcing this project!

I experimented with the TensorRT inference server, and I found that with my target model (a TensorRT execution plan that has FP16 inputs and outputs) to max-out my system's two GPUsm I need to send about 1.2 GBytes per second through the network stack. In my view, this means that scaling this architecture to a server with eight (or even more) GPUs either requires (multiple) IB interconnects, or a preprocessor which is co-located with the inference server, which receives compressed images, and sends raw data to the TRT server.

Once we assume that a preprocessor is located on the same physical node as the TRT inference server (and hope that the CPUs does not become a bottleneck now), then it would be much preferable to submit raw data via IPC (e.g. through /dev/shm) to the inference server, and thus avoid the overhead introduced by gRPC.

Here are my questions:

Is the above assessment and the conclusions I draw from it reasonable?
Do you have "submission of raw data via IPC mechanisms" on your roadmap? E.g. a feature where one submits a reference to the blob of preprocessed data in shared memory to the server via gRPC, and the server then loads this blob and uses it as input. If so, when do you plan on releasing it?
If I were to implement a version of this myself, do you agree that a first quick-and-dirty approach would be to a) change the gRPC service proto, and then b) change GRPCInferRequestProvider::GetNextInputContent in tensorrt-inference-server/src/core/infer.cc accordingly? Did I overlook a place where changes are necessary?

Again, thanks for making this tool available.

got problem while serving with TensorRT plan

hi,
I'm trying to serving a TensorRT serialized plan file on tensorrt-inference-server. While initializing server, a segmentation fault came up. Log message looks like:

I1130 15:23:00.955141 12581 server.cc:631] Initializing TensorRT Inference Server
I1130 15:23:00.955237 12581 server.cc:680] Reporting prometheus metrics on port 8002
I1130 15:23:01.874070 12581 metrics.cc:129] found 2 GPUs supported power usage metric
I1130 15:23:01.881493 12581 metrics.cc:139] GPU 0: Tesla P40
I1130 15:23:01.887018 12581 metrics.cc:139] GPU 1: Tesla P40
I1130 15:23:01.887897 12581 server.cc:884] Starting server 'inference:0' listening on
I1130 15:23:01.887919 12581 server.cc:888] localhost:8001 for gRPC requests
I1130 15:23:01.889030 12581 server.cc:898] localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
I1130 15:23:01.932019 12581 server_core.cc:465] Adding/updating models.
I1130 15:23:01.932078 12581 server_core.cc:520] (Re-)adding model: resnet50_plan
I1130 15:23:02.032174 12581 basic_manager.cc:739] Successfully reserved resources to load servable {name: resnet50_plan version: 1}
I1130 15:23:02.032211 12581 loader_harness.cc:66] Approving load for servable version {name: resnet50_plan version: 1}
I1130 15:23:02.032240 12581 loader_harness.cc:74] Loading servable version {name: resnet50_plan version: 1}
I1130 15:23:02.138975 12581 plan_bundle.cc:301] Creating instance resnet50_plan_0_0_gpu1 on GPU 1 (6.1) using model.plan
I1130 15:23:02.840773 12581 logging.cc:39] Glob Size is 56 bytes.
I1130 15:23:02.841601 12581 logging.cc:39] Added linear block of size 8589934597
I1130 15:23:02.841628 12581 logging.cc:39] Added linear block of size 18369233829710790662
I1130 15:23:02.841734 12581 logging.cc:39] Added linear block of size 47244640284
I1130 15:23:02.841755 12581 logging.cc:39] Added linear block of size 154618822688
I1130 15:23:02.841775 12581 logging.cc:39] Added linear block of size 18446744069414584508
I1130 15:23:02.841789 12581 logging.cc:39] Added linear block of size 17179869216
I1130 15:23:02.841804 12581 logging.cc:39] Added linear block of size 1651470960
I1130 15:23:02.841818 12581 logging.cc:39] Added linear block of size 1305670057985
I1130 15:23:02.841837 12581 logging.cc:39] Added linear block of size 773094113281
I1130 15:23:02.841853 12581 logging.cc:39] Added linear block of size 17179869185
I1130 15:23:02.841867 12581 logging.cc:39] Added linear block of size 38547291084
I1130 15:23:02.841881 12581 logging.cc:39] Added linear block of size 17179869200
Segmentation fault (core dumped)

and part of tracing messages from gdb:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by trtserver --model-store=/exchange/model_repository/.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 strlen () at ../sysdeps/x86_64/strlen.S:106
106 ../sysdeps/x86_64/strlen.S: No such file or directory.
[Current thread is 1 (Thread 0x7f6afff40700 (LWP 12791))]
(gdb) bt
#0 strlen () at ../sysdeps/x86_64/strlen.S:106
#1 0x00007f6e75ea3d01 in std::basic_string<char, std::char_traits, std::allocator >::basic_string(char const*, std::allocator const&) ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f6efdaf181a in nvinfer1::rt::Engine::deserialize(void const*, unsigned long, nvinfer1::IGpuAllocator&, nvinfer1::IPluginFactory*) ()
from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5
#3 0x00007f6efdaf6bd3 in nvinfer1::Runtime::deserializeCudaEngine(void const*, unsigned long, nvinfer1::IPluginFactory*) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5
#4 0x00007f6f144c2cc4 in nvidia::inferenceserver::PlanBundle::CreateExecutionContext(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, nvidia::inferenceserver::ModelConfig const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<char, std::allocator >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::vector<char, std::allocator > > > > const&) ()
#5 0x00007f6f144c3a82 in nvidia::inferenceserver::PlanBundle::CreateExecutionContexts(nvidia::inferenceserver::ModelConfig const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<char, std::allocator >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::vector<char, std::allocator > > > > const&) ()
#6 0x00007f6f144bc1f4 in nvidia::inferenceserver::(anonymous namespace)::CreatePlanBundle(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >) ()
#7 0x00007f6f144ba497 in std::_Function_handler<tensorflow::Status (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >), tensorflow::Status ()(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >)>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >&&)
()
#8 0x00007f6f144ba58c in std::_Function_handler<tensorflow::Status (std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >), tensorflow::serving::SimpleLoaderSourceAdapter<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, nvidia::inferenceserver::PlanBundle>::Convert(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<tensorflow::serving::Loader, std::default_deletetensorflow::serving::Loader >)::{lambda(std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >)#1}>::_M_invoke(std::_Any_data const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >&&) ()
#9 0x00007f6f144bc86a in tensorflow::serving::SimpleLoadernvidia::inferenceserver::PlanBundle::Load() ()
#10 0x00007f6f14594099 in std::_Function_handler<tensorflow::Status (), tensorflow::serving::LoaderHarness::Load()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#11 0x00007f6f145964b7 in tensorflow::serving::Retry(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, unsigned int, long long, std::function<tensorflow::Status ()> const&, std::function<bool ()> const&) ()
#12 0x00007f6f145953d6 in tensorflow::serving::LoaderHarness::Load() ()
#13 0x00007f6f1459172d in tensorflow::serving::BasicManager::ExecuteLoad(tensorflow::serving::LoaderHarness) ()
#14 0x00007f6f14591b4c in tensorflow::serving::BasicManager::ExecuteLoadOrUnload(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, tensorflow::serving::LoaderHarness*) ()
#15 0x00007f6f14593396 in tensorflow::serving::BasicManager::HandleLoadOrUnloadRequest(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, std::function<void (tensorflow::Status const&)>) ()
#16 0x00007f6f1459348f in std::_Function_handler<void (), tensorflow::serving::BasicManager::LoadOrUnloadServable(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, std::function<void (tensorflow::Status const&)>)::{lambda()#2}>::_M_invoke(std::_Any_data const&) ()
#17 0x00007f6f19e06579 in Eigen::NonBlockingThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) ()
#18 0x00007f6f19e04717 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
()
#19 0x00007f6e75e8bc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007f6e76a936ba in start_thread (arg=0x7f6afff40700) at pthread_create.c:333
#21 0x00007f6e758fa41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

I'm not sure if it went wrong while serializing the plan file to build an inference engine, so do you got any ideas? or is there anything wrong with my plan file/config file etc.

I used these docker images for testing
tensorrtserver: nvcr.io/nvidia/tensorrtserver 18.09-py3
tensorrt: nvcr.io/nvidia/tensorrt 18.11-py3

triton-inference-server / server Goto Github PK

server's Introduction

Triton Inference Server

LATEST RELEASE

Serve a Model in 3 Easy Steps

Examples and Tutorials

Documentation

Build and Deploy

Using Triton

Preparing Models for Triton Inference Server

Configure and Use Triton Inference Server

Client Support and Examples

Extend Triton

Additional Documentation

Contributing

Reporting problems, asking questions

For more information

server's People

Contributors

Stargazers

Watchers

Forkers

server's Issues

=============================== == TensorRT Inference Server ==

Ouput:

== TensorRT Inference Server ==

=============================== == TensorRT Inference Server ==

Recommend Projects

Recommend Topics

Recommend Org

===============================
== TensorRT Inference Server ==

===============================
== TensorRT Inference Server ==