Comments (4)
@SunnyGhj Can you share complete verbose triton server logs? (--log-verbose=1)
from server.
@tanmayv25 @Tabrizian @fpetrini15
Colud you please help me?
from server.
@SunnyGhj Can you share complete verbose triton server logs? (--log-verbose=1)
Yes of course, the log as below
I0426 02:33:38.411272 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
I0426 02:33:38.898145 1 libtorch.cc:2253] TRITONBACKEND_Initialize: pytorch
I0426 02:33:38.898166 1 libtorch.cc:2263] Triton TRITONBACKEND API version: 1.13
I0426 02:33:38.898172 1 libtorch.cc:2269] 'pytorch' TRITONBACKEND API version: 1.13
I0426 02:33:38.898183 1 cache_manager.cc:478] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0426 02:33:39.064621 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6e54000000' with size 268435456
I0426 02:33:39.064895 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0426 02:33:39.086429 1 model_config_utils.cc:647] Server side auto-completed config: name: "relevance_distil_bert3layer"
platform: "tensorrt_plan"
version_policy {
latest {
num_versions: 2
}
}
input {
name: "token_type_ids"
data_type: TYPE_INT32
dims: -1
dims: 128
}
input {
name: "attention_mask"
data_type: TYPE_INT32
dims: -1
dims: 128
}
input {
name: "input_ids"
data_type: TYPE_INT32
dims: -1
dims: 128
}
output {
name: "logits"
data_type: TYPE_FP32
dims: -1
dims: 5
}
instance_group {
count: 3
}
default_model_filename: "model.plan"
optimization {
graph {
level: 1
}
cuda {
graphs: true
busy_wait_events: true
graph_spec {
input {
key: "attention_mask"
value {
dim: 25
dim: 128
}
}
input {
key: "input_ids"
value {
dim: 25
dim: 128
}
}
input {
key: "token_type_ids"
value {
dim: 25
dim: 128
}
}
}
}
}
parameters {
key: "execution_mode"
value {
string_value: "0"
}
}
parameters {
key: "inter_op_thread_count"
value {
string_value: "4"
}
}
parameters {
key: "intra_op_thread_count"
value {
string_value: "4"
}
}
backend: "tensorrt"
I0426 02:33:39.087951 1 model_lifecycle.cc:462] loading: relevance_distil_bert3layer:20240125
I0426 02:33:39.088072 1 backend_model.cc:362] Adding default backend config setting: default-max-batch-size,4
I0426 02:33:39.088093 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
I0426 02:33:39.088472 1 tensorrt.cc:65] TRITONBACKEND_Initialize: tensorrt
I0426 02:33:39.088479 1 tensorrt.cc:75] Triton TRITONBACKEND API version: 1.13
I0426 02:33:39.088482 1 tensorrt.cc:81] 'tensorrt' TRITONBACKEND API version: 1.13
I0426 02:33:39.088485 1 tensorrt.cc:105] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","execution-policy":"BLOCKING","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0426 02:33:39.088502 1 tensorrt.cc:178] Registering TensorRT Plugins
I0426 02:33:39.088515 1 logging.cc:49] Plugin creator already registered - ::BatchedNMSDynamic_TRT version 1
I0426 02:33:39.088520 1 logging.cc:49] Plugin creator already registered - ::BatchedNMS_TRT version 1
I0426 02:33:39.088524 1 logging.cc:49] Plugin creator already registered - ::BatchTilePlugin_TRT version 1
I0426 02:33:39.088529 1 logging.cc:49] Plugin creator already registered - ::Clip_TRT version 1
I0426 02:33:39.088534 1 logging.cc:49] Plugin creator already registered - ::CoordConvAC version 1
I0426 02:33:39.088539 1 logging.cc:49] Plugin creator already registered - ::CropAndResizeDynamic version 1
I0426 02:33:39.088544 1 logging.cc:49] Plugin creator already registered - ::CropAndResize version 1
I0426 02:33:39.088549 1 logging.cc:49] Plugin creator already registered - ::DecodeBbox3DPlugin version 1
I0426 02:33:39.088554 1 logging.cc:49] Plugin creator already registered - ::DetectionLayer_TRT version 1
I0426 02:33:39.088558 1 logging.cc:49] Plugin creator already registered - ::EfficientNMS_Explicit_TF_TRT version 1
I0426 02:33:39.088563 1 logging.cc:49] Plugin creator already registered - ::EfficientNMS_Implicit_TF_TRT version 1
I0426 02:33:39.088567 1 logging.cc:49] Plugin creator already registered - ::EfficientNMS_ONNX_TRT version 1
I0426 02:33:39.088578 1 logging.cc:49] Plugin creator already registered - ::EfficientNMS_TRT version 1
I0426 02:33:39.088582 1 logging.cc:49] Plugin creator already registered - ::FlattenConcat_TRT version 1
I0426 02:33:39.088587 1 logging.cc:49] Plugin creator already registered - ::GenerateDetection_TRT version 1
I0426 02:33:39.088592 1 logging.cc:49] Plugin creator already registered - ::GridAnchor_TRT version 1
I0426 02:33:39.088596 1 logging.cc:49] Plugin creator already registered - ::GridAnchorRect_TRT version 1
I0426 02:33:39.088600 1 logging.cc:49] Plugin creator already registered - ::InstanceNormalization_TRT version 1
I0426 02:33:39.088604 1 logging.cc:49] Plugin creator already registered - ::InstanceNormalization_TRT version 2
I0426 02:33:39.088608 1 logging.cc:49] Plugin creator already registered - ::LReLU_TRT version 1
I0426 02:33:39.088613 1 logging.cc:49] Plugin creator already registered - ::ModulatedDeformConv2d version 1
I0426 02:33:39.088617 1 logging.cc:49] Plugin creator already registered - ::MultilevelCropAndResize_TRT version 1
I0426 02:33:39.088621 1 logging.cc:49] Plugin creator already registered - ::MultilevelProposeROI_TRT version 1
I0426 02:33:39.088625 1 logging.cc:49] Plugin creator already registered - ::MultiscaleDeformableAttnPlugin_TRT version 1
I0426 02:33:39.088630 1 logging.cc:49] Plugin creator already registered - ::NMSDynamic_TRT version 1
I0426 02:33:39.088634 1 logging.cc:49] Plugin creator already registered - ::NMS_TRT version 1
I0426 02:33:39.088638 1 logging.cc:49] Plugin creator already registered - ::Normalize_TRT version 1
I0426 02:33:39.088642 1 logging.cc:49] Plugin creator already registered - ::PillarScatterPlugin version 1
I0426 02:33:39.088647 1 logging.cc:49] Plugin creator already registered - ::PriorBox_TRT version 1
I0426 02:33:39.088651 1 logging.cc:49] Plugin creator already registered - ::ProposalDynamic version 1
I0426 02:33:39.088655 1 logging.cc:49] Plugin creator already registered - ::ProposalLayer_TRT version 1
I0426 02:33:39.088659 1 logging.cc:49] Plugin creator already registered - ::Proposal version 1
I0426 02:33:39.088663 1 logging.cc:49] Plugin creator already registered - ::PyramidROIAlign_TRT version 1
I0426 02:33:39.088668 1 logging.cc:49] Plugin creator already registered - ::Region_TRT version 1
I0426 02:33:39.088672 1 logging.cc:49] Plugin creator already registered - ::Reorg_TRT version 1
I0426 02:33:39.088676 1 logging.cc:49] Plugin creator already registered - ::ResizeNearest_TRT version 1
I0426 02:33:39.088681 1 logging.cc:49] Plugin creator already registered - ::ROIAlign_TRT version 1
I0426 02:33:39.088685 1 logging.cc:49] Plugin creator already registered - ::RPROI_TRT version 1
I0426 02:33:39.088688 1 logging.cc:49] Plugin creator already registered - ::ScatterND version 1
I0426 02:33:39.088692 1 logging.cc:49] Plugin creator already registered - ::SpecialSlice_TRT version 1
I0426 02:33:39.088696 1 logging.cc:49] Plugin creator already registered - ::Split version 1
I0426 02:33:39.088700 1 logging.cc:49] Plugin creator already registered - ::VoxelGeneratorPlugin version 1
I0426 02:33:39.088967 1 tensorrt.cc:222] TRITONBACKEND_ModelInitialize: relevance_distil_bert3layer (version 20240125)
I0426 02:33:39.089438 1 model_config_utils.cc:1839] ModelConfig 64-bit fields:
I0426 02:33:39.089445 1 model_config_utils.cc:1841] ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0426 02:33:39.089448 1 model_config_utils.cc:1841] ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0426 02:33:39.089450 1 model_config_utils.cc:1841] ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0426 02:33:39.089453 1 model_config_utils.cc:1841] ModelConfig::ensemble_scheduling::step::model_version
I0426 02:33:39.089455 1 model_config_utils.cc:1841] ModelConfig::input::dims
I0426 02:33:39.089458 1 model_config_utils.cc:1841] ModelConfig::input::reshape::shape
I0426 02:33:39.089460 1 model_config_utils.cc:1841] ModelConfig::instance_group::secondary_devices::device_id
I0426 02:33:39.089467 1 model_config_utils.cc:1841] ModelConfig::model_warmup::inputs::value::dims
I0426 02:33:39.089470 1 model_config_utils.cc:1841] ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0426 02:33:39.089472 1 model_config_utils.cc:1841] ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0426 02:33:39.089474 1 model_config_utils.cc:1841] ModelConfig::output::dims
I0426 02:33:39.089477 1 model_config_utils.cc:1841] ModelConfig::output::reshape::shape
I0426 02:33:39.089479 1 model_config_utils.cc:1841] ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0426 02:33:39.089482 1 model_config_utils.cc:1841] ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0426 02:33:39.089484 1 model_config_utils.cc:1841] ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0426 02:33:39.089487 1 model_config_utils.cc:1841] ModelConfig::sequence_batching::state::dims
I0426 02:33:39.089489 1 model_config_utils.cc:1841] ModelConfig::sequence_batching::state::initial_state::dims
I0426 02:33:39.089492 1 model_config_utils.cc:1841] ModelConfig::version_policy::specific::versions
I0426 02:33:39.089599 1 model_state.cc:308] Setting the CUDA device to GPU0 to auto-complete config for relevance_distil_bert3layer
I0426 02:33:39.089721 1 model_state.cc:354] Using explicit serialized file 'model.plan' to auto-complete config for relevance_distil_bert3layer
I0426 02:33:39.232749 1 logging.cc:46] Loaded engine size: 85 MiB
I0426 02:33:39.327110 1 logging.cc:49] Deserialization required 91472 microseconds.
I0426 02:33:39.327138 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +84, now: CPU 0, GPU 84 (MiB)
W0426 02:33:39.337247 1 model_state.cc:522] The specified dimensions in model config for relevance_distil_bert3layer hints that batching is unavailable
I0426 02:33:39.342512 1 model_state.cc:379] post auto-complete:
{
"name": "relevance_distil_bert3layer",
"platform": "tensorrt_plan",
"backend": "tensorrt",
"version_policy": {
"latest": {
"num_versions": 2
}
},
"max_batch_size": 0,
"input": [
{
"name": "token_type_ids",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1,
128
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "attention_mask",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1,
128
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "input_ids",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1,
128
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "logits",
"data_type": "TYPE_FP32",
"dims": [
-1,
5
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"graph": {
"level": 1
},
"priority": "PRIORITY_DEFAULT",
"cuda": {
"graphs": true,
"busy_wait_events": true,
"graph_spec": [
{
"batch_size": 0,
"input": {
"token_type_ids": {
"dim": [
"25",
"128"
]
},
"attention_mask": {
"dim": [
"25",
"128"
]
},
"input_ids": {
"dim": [
"25",
"128"
]
}
}
}
],
"output_copy_stream": false
},
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "relevance_distil_bert3layer_0",
"kind": "KIND_GPU",
"count": 3,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.plan",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"intra_op_thread_count": {
"string_value": "4"
},
"inter_op_thread_count": {
"string_value": "4"
},
"execution_mode": {
"string_value": "0"
}
},
"model_warmup": []
}
I0426 02:33:39.343172 1 model_state.cc:272] model configuration:
{
"name": "relevance_distil_bert3layer",
"platform": "tensorrt_plan",
"backend": "tensorrt",
"version_policy": {
"latest": {
"num_versions": 2
}
},
"max_batch_size": 0,
"input": [
{
"name": "token_type_ids",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1,
128
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "attention_mask",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1,
128
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "input_ids",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1,
128
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "logits",
"data_type": "TYPE_FP32",
"dims": [
-1,
5
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"graph": {
"level": 1
},
"priority": "PRIORITY_DEFAULT",
"cuda": {
"graphs": true,
"busy_wait_events": true,
"graph_spec": [
{
"batch_size": 0,
"input": {
"input_ids": {
"dim": [
"25",
"128"
]
},
"attention_mask": {
"dim": [
"25",
"128"
]
},
"token_type_ids": {
"dim": [
"25",
"128"
]
}
}
}
],
"output_copy_stream": false
},
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "relevance_distil_bert3layer_0",
"kind": "KIND_GPU",
"count": 3,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.plan",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"intra_op_thread_count": {
"string_value": "4"
},
"inter_op_thread_count": {
"string_value": "4"
},
"execution_mode": {
"string_value": "0"
}
},
"model_warmup": []
}
I0426 02:33:39.343297 1 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: relevance_distil_bert3layer_0_0 (GPU device 0)
I0426 02:33:39.343418 1 backend_model_instance.cc:105] Creating instance relevance_distil_bert3layer_0_0 on GPU 0 (7.0) using artifact 'model.plan'
I0426 02:33:39.343517 1 instance_state.cc:256] Zero copy optimization is disabled
I0426 02:33:39.480353 1 logging.cc:46] Loaded engine size: 85 MiB
I0426 02:33:39.571531 1 logging.cc:49] Deserialization required 90942 microseconds.
I0426 02:33:39.571557 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +84, now: CPU 0, GPU 84 (MiB)
I0426 02:33:39.581548 1 model_state.cc:220] Created new runtime on GPU device 0, NVDLA core -1 for relevance_distil_bert3layer
I0426 02:33:39.581561 1 model_state.cc:227] Created new engine on GPU device 0, NVDLA core -1 for relevance_distil_bert3layer
I0426 02:33:39.581965 1 logging.cc:49] Total per-runner device persistent memory is 0
I0426 02:33:39.581972 1 logging.cc:49] Total per-runner host persistent memory is 32
I0426 02:33:39.582203 1 logging.cc:49] Allocated activation device memory of size 511285248
I0426 02:33:39.634158 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +488, now: CPU 0, GPU 572 (MiB)
I0426 02:33:39.634173 1 logging.cc:49] CUDA lazy loading is enabled.
I0426 02:33:39.634194 1 instance_state.cc:1797] Detected input_ids as execution binding for relevance_distil_bert3layer_0_0
I0426 02:33:39.634200 1 instance_state.cc:1797] Detected attention_mask as execution binding for relevance_distil_bert3layer_0_0
I0426 02:33:39.634205 1 instance_state.cc:1797] Detected token_type_ids as execution binding for relevance_distil_bert3layer_0_0
I0426 02:33:39.634210 1 instance_state.cc:1797] Detected logits as execution binding for relevance_distil_bert3layer_0_0
I0426 02:33:39.680524 1 instance_state.cc:3746] captured CUDA graph for relevance_distil_bert3layer_0_0, batch size 0
I0426 02:33:39.680545 1 instance_state.cc:188] Created instance relevance_distil_bert3layer_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0426 02:33:39.680677 1 backend_model_instance.cc:806] Starting backend thread for relevance_distil_bert3layer_0_0 at nice 0 on device 0...
I0426 02:33:39.680834 1 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: relevance_distil_bert3layer_0_1 (GPU device 0)
I0426 02:33:39.680945 1 backend_model_instance.cc:105] Creating instance relevance_distil_bert3layer_0_1 on GPU 0 (7.0) using artifact 'model.plan'
I0426 02:33:39.681052 1 instance_state.cc:256] Zero copy optimization is disabled
I0426 02:33:39.817964 1 logging.cc:46] Loaded engine size: 85 MiB
I0426 02:33:39.908375 1 logging.cc:49] Deserialization required 90162 microseconds.
I0426 02:33:39.908403 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +84, now: CPU 0, GPU 656 (MiB)
I0426 02:33:39.918349 1 model_state.cc:227] Created new engine on GPU device 0, NVDLA core -1 for relevance_distil_bert3layer
I0426 02:33:39.918741 1 logging.cc:49] Total per-runner device persistent memory is 0
I0426 02:33:39.918754 1 logging.cc:49] Total per-runner host persistent memory is 32
I0426 02:33:39.919007 1 logging.cc:49] Allocated activation device memory of size 511285248
I0426 02:33:39.969935 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +488, now: CPU 0, GPU 1144 (MiB)
I0426 02:33:39.969951 1 logging.cc:49] CUDA lazy loading is enabled.
I0426 02:33:39.969963 1 instance_state.cc:1797] Detected input_ids as execution binding for relevance_distil_bert3layer_0_1
I0426 02:33:39.969967 1 instance_state.cc:1797] Detected attention_mask as execution binding for relevance_distil_bert3layer_0_1
I0426 02:33:39.969971 1 instance_state.cc:1797] Detected token_type_ids as execution binding for relevance_distil_bert3layer_0_1
I0426 02:33:39.969974 1 instance_state.cc:1797] Detected logits as execution binding for relevance_distil_bert3layer_0_1
I0426 02:33:39.974035 1 instance_state.cc:3746] captured CUDA graph for relevance_distil_bert3layer_0_1, batch size 0
I0426 02:33:39.974051 1 instance_state.cc:188] Created instance relevance_distil_bert3layer_0_1 on GPU 0 with stream priority 0 and optimization profile default[0];
I0426 02:33:39.974168 1 backend_model_instance.cc:806] Starting backend thread for relevance_distil_bert3layer_0_1 at nice 0 on device 0...
I0426 02:33:39.974303 1 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: relevance_distil_bert3layer_0_2 (GPU device 0)
I0426 02:33:39.974406 1 backend_model_instance.cc:105] Creating instance relevance_distil_bert3layer_0_2 on GPU 0 (7.0) using artifact 'model.plan'
I0426 02:33:39.974508 1 instance_state.cc:256] Zero copy optimization is disabled
I0426 02:33:40.112753 1 logging.cc:46] Loaded engine size: 85 MiB
I0426 02:33:40.204935 1 logging.cc:49] Deserialization required 91919 microseconds.
I0426 02:33:40.204967 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +85, now: CPU 0, GPU 1229 (MiB)
I0426 02:33:40.214984 1 model_state.cc:227] Created new engine on GPU device 0, NVDLA core -1 for relevance_distil_bert3layer
I0426 02:33:40.215370 1 logging.cc:49] Total per-runner device persistent memory is 0
I0426 02:33:40.215377 1 logging.cc:49] Total per-runner host persistent memory is 32
I0426 02:33:40.215613 1 logging.cc:49] Allocated activation device memory of size 511285248
I0426 02:33:40.267325 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +487, now: CPU 0, GPU 1716 (MiB)
I0426 02:33:40.267346 1 logging.cc:49] CUDA lazy loading is enabled.
I0426 02:33:40.267361 1 instance_state.cc:1797] Detected input_ids as execution binding for relevance_distil_bert3layer_0_2
I0426 02:33:40.267365 1 instance_state.cc:1797] Detected attention_mask as execution binding for relevance_distil_bert3layer_0_2
I0426 02:33:40.267369 1 instance_state.cc:1797] Detected token_type_ids as execution binding for relevance_distil_bert3layer_0_2
I0426 02:33:40.267372 1 instance_state.cc:1797] Detected logits as execution binding for relevance_distil_bert3layer_0_2
I0426 02:33:40.271491 1 instance_state.cc:3746] captured CUDA graph for relevance_distil_bert3layer_0_2, batch size 0
I0426 02:33:40.271509 1 instance_state.cc:188] Created instance relevance_distil_bert3layer_0_2 on GPU 0 with stream priority 0 and optimization profile default[0];
I0426 02:33:40.271614 1 backend_model_instance.cc:806] Starting backend thread for relevance_distil_bert3layer_0_2 at nice 0 on device 0...
I0426 02:33:40.271777 1 model_lifecycle.cc:815] successfully loaded 'relevance_distil_bert3layer'
I0426 02:33:40.271839 1 server.cc:603]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0426 02:33:40.271906 1 server.cc:630]
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorrt | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","execution-policy":"BLOCKING","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0426 02:33:40.271937 1 server.cc:673]
+-----------------------------+----------+--------+
| Model | Version | Status |
+-----------------------------+----------+--------+
| relevance_distil_bert3layer | 20240125 | READY |
+-----------------------------+----------+--------+
I0426 02:33:40.317589 1 metrics.cc:808] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB
I0426 02:33:40.317799 1 metrics.cc:701] Collecting CPU metrics
I0426 02:33:40.317958 1 tritonserver.cc:2385]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.35.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /models |
| model_control_mode | MODE_POLL |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0426 02:33:40.318453 1 grpc_server.cc:2339]
+----------------------------------------------+---------+
| GRPC KeepAlive Option | Value |
+----------------------------------------------+---------+
| keepalive_time_ms | 7200000 |
| keepalive_timeout_ms | 20000 |
| keepalive_permit_without_calls | 0 |
| http2_max_pings_without_data | 2 |
| http2_min_recv_ping_interval_without_data_ms | 300000 |
| http2_max_ping_strikes | 2 |
+----------------------------------------------+---------+
I0426 02:33:40.318958 1 grpc_server.cc:99] Ready for RPC 'Check', 0
I0426 02:33:40.318985 1 grpc_server.cc:99] Ready for RPC 'ServerLive', 0
I0426 02:33:40.318991 1 grpc_server.cc:99] Ready for RPC 'ServerReady', 0
I0426 02:33:40.318995 1 grpc_server.cc:99] Ready for RPC 'ModelReady', 0
I0426 02:33:40.319001 1 grpc_server.cc:99] Ready for RPC 'ServerMetadata', 0
I0426 02:33:40.319008 1 grpc_server.cc:99] Ready for RPC 'ModelMetadata', 0
I0426 02:33:40.319014 1 grpc_server.cc:99] Ready for RPC 'ModelConfig', 0
I0426 02:33:40.319020 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryStatus', 0
I0426 02:33:40.319026 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryRegister', 0
I0426 02:33:40.319032 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0426 02:33:40.319038 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryStatus', 0
I0426 02:33:40.319042 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryRegister', 0
I0426 02:33:40.319048 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0426 02:33:40.319055 1 grpc_server.cc:99] Ready for RPC 'RepositoryIndex', 0
I0426 02:33:40.319061 1 grpc_server.cc:99] Ready for RPC 'RepositoryModelLoad', 0
I0426 02:33:40.319065 1 grpc_server.cc:99] Ready for RPC 'RepositoryModelUnload', 0
I0426 02:33:40.319071 1 grpc_server.cc:99] Ready for RPC 'ModelStatistics', 0
I0426 02:33:40.319077 1 grpc_server.cc:99] Ready for RPC 'Trace', 0
I0426 02:33:40.319083 1 grpc_server.cc:99] Ready for RPC 'Logging', 0
I0426 02:33:40.319108 1 grpc_server.cc:348] Thread started for CommonHandler
I0426 02:33:40.319191 1 infer_handler.cc:693] New request handler for ModelInferHandler, 0
I0426 02:33:40.319216 1 infer_handler.h:1046] Thread started for ModelInferHandler
I0426 02:33:40.319290 1 infer_handler.cc:693] New request handler for ModelInferHandler, 0
I0426 02:33:40.319315 1 infer_handler.h:1046] Thread started for ModelInferHandler
I0426 02:33:40.319385 1 stream_infer_handler.cc:127] New request handler for ModelStreamInferHandler, 0
I0426 02:33:40.319408 1 infer_handler.h:1046] Thread started for ModelStreamInferHandler
I0426 02:33:40.319412 1 grpc_server.cc:2445] Started GRPCInferenceService at 0.0.0.0:8001
I0426 02:33:40.319618 1 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
I0426 02:33:40.360609 1 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002
I0426 02:33:40.365821 1 server.cc:374] Polling model repository
I0426 02:33:44.202919 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:33:51.395017 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:33:53.402431 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:33:55.376807 1 server.cc:374] Polling model repository
I0426 02:33:55.408704 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:33:58.223462 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:04.201442 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:10.392861 1 server.cc:374] Polling model repository
I0426 02:34:18.223519 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:21.395169 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:23.401222 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:24.201572 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:25.404892 1 server.cc:374] Polling model repository
I0426 02:34:25.407563 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:38.223630 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:40.418995 1 server.cc:374] Polling model repository
I0426 02:34:42.463945 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:34:42.465770 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d80004d40] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d80003a28] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80003378] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d800053e8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d800053e8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80003378] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80003a28] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits
I0426 02:34:42.465876 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_0, executing 1 requests
I0426 02:34:42.465897 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_0 with 1 requests
I0426 02:34:42.465902 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_0 with 1 requests
I0426 02:34:42.465955 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_0
I0426 02:34:42.465983 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:34:42.466034 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:34:42.466053 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:34:42.466077 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_0
I0426 02:34:42.466384 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:34:42.466395 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:34:42.466402 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e44008610
I0426 02:34:42.466409 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:34:42.469031 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e44008610
I0426 02:34:42.469068 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_0 released 1 requests
I0426 02:34:42.469076 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:34:42.469094 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:34:42.469099 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:34:42.469104 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:34:44.201598 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:51.394663 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:53.400596 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:55.408256 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:55.431574 1 server.cc:374] Polling model repository
I0426 02:34:58.223529 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:00.663151 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:00.663247 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:01.663843 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:01.663947 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:02.665415 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:02.666906 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:03.667317 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:03.667372 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:04.201815 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:04.668259 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:04.668349 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:05.669003 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:05.669067 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:06.669538 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:06.669594 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:07.670177 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:07.671670 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:08.672065 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:08.672122 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:09.672477 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:09.672540 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:10.447313 1 server.cc:374] Polling model repository
I0426 02:35:10.672921 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:10.672986 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:11.673341 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:11.673398 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:12.674220 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:12.674322 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:13.675467 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:13.677511 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:14.677983 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:14.678081 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:15.678377 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:15.678477 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:16.678883 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:16.678949 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:17.679866 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:17.681380 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:18.223516 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:18.681780 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:18.681837 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:19.682190 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:19.682268 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:20.722343 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:21.395131 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:23.401441 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:24.201218 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:25.408940 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:25.458183 1 server.cc:374] Polling model repository
I0426 02:35:38.223624 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:40.468503 1 server.cc:374] Polling model repository
I0426 02:35:44.201183 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:51.395171 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:53.401414 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:55.408792 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:55.482191 1 server.cc:374] Polling model repository
I0426 02:35:57.068948 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:35:57.069944 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d80010fa0] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d800115c8] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80002188] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80004bc8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d80004bc8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80002188] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d800115c8] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits
I0426 02:35:57.070030 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_1, executing 1 requests
I0426 02:35:57.070050 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_1 with 1 requests
I0426 02:35:57.070055 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_1 with 1 requests
I0426 02:35:57.070104 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_1
I0426 02:35:57.070127 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:35:57.070164 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:35:57.070182 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:35:57.070211 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_1
I0426 02:35:57.070467 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:35:57.070478 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:35:57.070485 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e3c008610
I0426 02:35:57.070492 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:35:57.073122 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e3c008610
I0426 02:35:57.073150 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_1 released 1 requests
I0426 02:35:57.073155 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:35:57.073163 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:35:57.073168 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:35:57.073173 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:35:57.906319 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:35:57.907357 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d80011960] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d80012158] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80011fc8] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80011ea8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d80011ea8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80011fc8] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80012158] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits
I0426 02:35:57.907442 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_2, executing 1 requests
I0426 02:35:57.907461 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_2 with 1 requests
I0426 02:35:57.907466 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_2 with 1 requests
I0426 02:35:57.907523 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_2
I0426 02:35:57.907546 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:35:57.907581 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:35:57.907600 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:35:57.907619 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_2
I0426 02:35:57.907901 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:35:57.907913 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:35:57.907920 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e0c008610
I0426 02:35:57.907928 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:35:57.910527 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e0c008610
I0426 02:35:57.910555 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_2 released 1 requests
I0426 02:35:57.910560 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:35:57.910567 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:35:57.910570 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:35:57.910573 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:35:58.223819 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:58.562130 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:35:58.565215 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d80012580] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d80007128] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80006f68] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80012a68] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d80012a68] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80006f68] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80007128] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits
I0426 02:35:58.565294 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_0, executing 1 requests
I0426 02:35:58.565312 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_0 with 1 requests
I0426 02:35:58.565318 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_0 with 1 requests
I0426 02:35:58.565354 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_0
I0426 02:35:58.565387 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:35:58.565423 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:35:58.565440 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:35:58.565460 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_0
I0426 02:35:58.565714 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:35:58.565724 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:35:58.565739 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e440098e0
I0426 02:35:58.565746 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:35:58.568341 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e440098e0
I0426 02:35:58.568372 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_0 released 1 requests
I0426 02:35:58.568378 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:35:58.568384 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:35:58.568387 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:35:58.568391 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:35:59.153891 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:35:59.154983 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d800075d0] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d8000dfd8] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d800072b8] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80007ab8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d80007ab8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d800072b8] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d8000dfd8] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits
I0426 02:35:59.155044 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_1, executing 1 requests
I0426 02:35:59.155054 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_1 with 1 requests
I0426 02:35:59.155059 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_1 with 1 requests
I0426 02:35:59.155092 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_1
I0426 02:35:59.155120 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:35:59.155154 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:35:59.155172 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:35:59.155203 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_1
I0426 02:35:59.155456 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:35:59.155467 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:35:59.155474 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e3c0098d0
I0426 02:35:59.155481 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:35:59.158098 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e3c0098d0
I0426 02:35:59.158128 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_1 released 1 requests
I0426 02:35:59.158133 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:35:59.158139 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:35:59.158143 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:35:59.158154 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:36:04.200829 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:36:10.500690 1 server.cc:374] Polling model repository
from server.
https://github.com/triton-inference-server/tensorrt_backend/blob/5c881ce8f74988deedc473bb78a9417ffc650757/src/instance_state.cc#L3817
According to the above code, when max_batch_size==0 and graph_spec.batch_size==0, the first element of cuda_graph_key is 0. This seems to be a bug! The first element of cuda_graph_key should be set to 1. Because of during inference, the first element of input_dims will be set to 1, that is, [1,...], which is inconsistent with the previous cuda graph key, so cuda graph cannot be founded,refer to the code below.
https://github.com/triton-inference-server/tensorrt_backend/blob/5c881ce8f74988deedc473bb78a9417ffc650757/src/instance_state.cc#L563
https://github.com/triton-inference-server/tensorrt_backend/blob/5c881ce8f74988deedc473bb78a9417ffc650757/src/instance_state.cc#L3244
from server.
Related Issues (20)
- Memorystore Redis IAM AUTH
- Response caching GPU tensors HOT 1
- Abnormal system memory usage while enabling GPU metrics HOT 1
- Request for Improved Metrics and Real-Time Concurrency Reporting in Triton Inference Server
- Python AsyncIO infer does not support shared memory HOT 1
- client silent failure - E0422 05:03:24.145960 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument HOT 3
- [RFE] HandleGenerate equivalent for sagemaker_server.cc HOT 1
- The time spent on the inference request process far exceeds the model inference time. How can I determine where this additional time is being consumed?
- Casting NumPy string array to np_utils.Tensor disproportionately increases latency HOT 5
- On server/deploy/oci -> running "helm install example ." to deploy the Inference Server and pod doesn't get to running due to Liveness probe failed & Readiness probe failed HOT 1
- trt_profile_max_shapes not supported for ONNX-TRT backend HOT 1
- Failed to initialize Python stub + ModuleNotFoundError: No module named 'nvtabular', 'merlin' HOT 2
- does triton support different model-repository assemble into a batch? HOT 1
- Question: Which backends automatically warm up models? HOT 1
- [Question] Is it possible to shutdown Triton if we detect certain cuda errors ? HOT 1
- Perf Analyzer Error: Cannot send stop request without specifying a request_id HOT 1
- Python Backend: one model instance over multiple GPUs HOT 2
- Logs not getting generated with GRPC HOT 1
- Input data/shape validation HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from server.