Code Monkey home page Code Monkey logo

Comments (4)

tanmayv25 avatar tanmayv25 commented on June 11, 2024 1

@SunnyGhj Can you share complete verbose triton server logs? (--log-verbose=1)

from server.

SunnyGhj avatar SunnyGhj commented on June 11, 2024

@tanmayv25 @Tabrizian @fpetrini15
Colud you please help me?

from server.

SunnyGhj avatar SunnyGhj commented on June 11, 2024

@SunnyGhj Can you share complete verbose triton server logs? (--log-verbose=1)

Yes of course, the log as below

I0426 02:33:38.411272 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
I0426 02:33:38.898145 1 libtorch.cc:2253] TRITONBACKEND_Initialize: pytorch
I0426 02:33:38.898166 1 libtorch.cc:2263] Triton TRITONBACKEND API version: 1.13
I0426 02:33:38.898172 1 libtorch.cc:2269] 'pytorch' TRITONBACKEND API version: 1.13
I0426 02:33:38.898183 1 cache_manager.cc:478] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0426 02:33:39.064621 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6e54000000' with size 268435456
I0426 02:33:39.064895 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0426 02:33:39.086429 1 model_config_utils.cc:647] Server side auto-completed config: name: "relevance_distil_bert3layer"
platform: "tensorrt_plan"
version_policy {
  latest {
    num_versions: 2
  }
}
input {
  name: "token_type_ids"
  data_type: TYPE_INT32
  dims: -1
  dims: 128
}
input {
  name: "attention_mask"
  data_type: TYPE_INT32
  dims: -1
  dims: 128
}
input {
  name: "input_ids"
  data_type: TYPE_INT32
  dims: -1
  dims: 128
}
output {
  name: "logits"
  data_type: TYPE_FP32
  dims: -1
  dims: 5
}
instance_group {
  count: 3
}
default_model_filename: "model.plan"
optimization {
  graph {
    level: 1
  }
  cuda {
    graphs: true
    busy_wait_events: true
    graph_spec {
      input {
        key: "attention_mask"
        value {
          dim: 25
          dim: 128
        }
      }
      input {
        key: "input_ids"
        value {
          dim: 25
          dim: 128
        }
      }
      input {
        key: "token_type_ids"
        value {
          dim: 25
          dim: 128
        }
      }
    }
  }
}
parameters {
  key: "execution_mode"
  value {
    string_value: "0"
  }
}
parameters {
  key: "inter_op_thread_count"
  value {
    string_value: "4"
  }
}
parameters {
  key: "intra_op_thread_count"
  value {
    string_value: "4"
  }
}
backend: "tensorrt"

I0426 02:33:39.087951 1 model_lifecycle.cc:462] loading: relevance_distil_bert3layer:20240125
I0426 02:33:39.088072 1 backend_model.cc:362] Adding default backend config setting: default-max-batch-size,4
I0426 02:33:39.088093 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
I0426 02:33:39.088472 1 tensorrt.cc:65] TRITONBACKEND_Initialize: tensorrt
I0426 02:33:39.088479 1 tensorrt.cc:75] Triton TRITONBACKEND API version: 1.13
I0426 02:33:39.088482 1 tensorrt.cc:81] 'tensorrt' TRITONBACKEND API version: 1.13
I0426 02:33:39.088485 1 tensorrt.cc:105] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","execution-policy":"BLOCKING","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0426 02:33:39.088502 1 tensorrt.cc:178] Registering TensorRT Plugins
I0426 02:33:39.088515 1 logging.cc:49] Plugin creator already registered - ::BatchedNMSDynamic_TRT version 1
I0426 02:33:39.088520 1 logging.cc:49] Plugin creator already registered - ::BatchedNMS_TRT version 1
I0426 02:33:39.088524 1 logging.cc:49] Plugin creator already registered - ::BatchTilePlugin_TRT version 1
I0426 02:33:39.088529 1 logging.cc:49] Plugin creator already registered - ::Clip_TRT version 1
I0426 02:33:39.088534 1 logging.cc:49] Plugin creator already registered - ::CoordConvAC version 1
I0426 02:33:39.088539 1 logging.cc:49] Plugin creator already registered - ::CropAndResizeDynamic version 1
I0426 02:33:39.088544 1 logging.cc:49] Plugin creator already registered - ::CropAndResize version 1
I0426 02:33:39.088549 1 logging.cc:49] Plugin creator already registered - ::DecodeBbox3DPlugin version 1
I0426 02:33:39.088554 1 logging.cc:49] Plugin creator already registered - ::DetectionLayer_TRT version 1
I0426 02:33:39.088558 1 logging.cc:49] Plugin creator already registered - ::EfficientNMS_Explicit_TF_TRT version 1
I0426 02:33:39.088563 1 logging.cc:49] Plugin creator already registered - ::EfficientNMS_Implicit_TF_TRT version 1
I0426 02:33:39.088567 1 logging.cc:49] Plugin creator already registered - ::EfficientNMS_ONNX_TRT version 1
I0426 02:33:39.088578 1 logging.cc:49] Plugin creator already registered - ::EfficientNMS_TRT version 1
I0426 02:33:39.088582 1 logging.cc:49] Plugin creator already registered - ::FlattenConcat_TRT version 1
I0426 02:33:39.088587 1 logging.cc:49] Plugin creator already registered - ::GenerateDetection_TRT version 1
I0426 02:33:39.088592 1 logging.cc:49] Plugin creator already registered - ::GridAnchor_TRT version 1
I0426 02:33:39.088596 1 logging.cc:49] Plugin creator already registered - ::GridAnchorRect_TRT version 1
I0426 02:33:39.088600 1 logging.cc:49] Plugin creator already registered - ::InstanceNormalization_TRT version 1
I0426 02:33:39.088604 1 logging.cc:49] Plugin creator already registered - ::InstanceNormalization_TRT version 2
I0426 02:33:39.088608 1 logging.cc:49] Plugin creator already registered - ::LReLU_TRT version 1
I0426 02:33:39.088613 1 logging.cc:49] Plugin creator already registered - ::ModulatedDeformConv2d version 1
I0426 02:33:39.088617 1 logging.cc:49] Plugin creator already registered - ::MultilevelCropAndResize_TRT version 1
I0426 02:33:39.088621 1 logging.cc:49] Plugin creator already registered - ::MultilevelProposeROI_TRT version 1
I0426 02:33:39.088625 1 logging.cc:49] Plugin creator already registered - ::MultiscaleDeformableAttnPlugin_TRT version 1
I0426 02:33:39.088630 1 logging.cc:49] Plugin creator already registered - ::NMSDynamic_TRT version 1
I0426 02:33:39.088634 1 logging.cc:49] Plugin creator already registered - ::NMS_TRT version 1
I0426 02:33:39.088638 1 logging.cc:49] Plugin creator already registered - ::Normalize_TRT version 1
I0426 02:33:39.088642 1 logging.cc:49] Plugin creator already registered - ::PillarScatterPlugin version 1
I0426 02:33:39.088647 1 logging.cc:49] Plugin creator already registered - ::PriorBox_TRT version 1
I0426 02:33:39.088651 1 logging.cc:49] Plugin creator already registered - ::ProposalDynamic version 1
I0426 02:33:39.088655 1 logging.cc:49] Plugin creator already registered - ::ProposalLayer_TRT version 1
I0426 02:33:39.088659 1 logging.cc:49] Plugin creator already registered - ::Proposal version 1
I0426 02:33:39.088663 1 logging.cc:49] Plugin creator already registered - ::PyramidROIAlign_TRT version 1
I0426 02:33:39.088668 1 logging.cc:49] Plugin creator already registered - ::Region_TRT version 1
I0426 02:33:39.088672 1 logging.cc:49] Plugin creator already registered - ::Reorg_TRT version 1
I0426 02:33:39.088676 1 logging.cc:49] Plugin creator already registered - ::ResizeNearest_TRT version 1
I0426 02:33:39.088681 1 logging.cc:49] Plugin creator already registered - ::ROIAlign_TRT version 1
I0426 02:33:39.088685 1 logging.cc:49] Plugin creator already registered - ::RPROI_TRT version 1
I0426 02:33:39.088688 1 logging.cc:49] Plugin creator already registered - ::ScatterND version 1
I0426 02:33:39.088692 1 logging.cc:49] Plugin creator already registered - ::SpecialSlice_TRT version 1
I0426 02:33:39.088696 1 logging.cc:49] Plugin creator already registered - ::Split version 1
I0426 02:33:39.088700 1 logging.cc:49] Plugin creator already registered - ::VoxelGeneratorPlugin version 1
I0426 02:33:39.088967 1 tensorrt.cc:222] TRITONBACKEND_ModelInitialize: relevance_distil_bert3layer (version 20240125)
I0426 02:33:39.089438 1 model_config_utils.cc:1839] ModelConfig 64-bit fields:
I0426 02:33:39.089445 1 model_config_utils.cc:1841] 	ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0426 02:33:39.089448 1 model_config_utils.cc:1841] 	ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0426 02:33:39.089450 1 model_config_utils.cc:1841] 	ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0426 02:33:39.089453 1 model_config_utils.cc:1841] 	ModelConfig::ensemble_scheduling::step::model_version
I0426 02:33:39.089455 1 model_config_utils.cc:1841] 	ModelConfig::input::dims
I0426 02:33:39.089458 1 model_config_utils.cc:1841] 	ModelConfig::input::reshape::shape
I0426 02:33:39.089460 1 model_config_utils.cc:1841] 	ModelConfig::instance_group::secondary_devices::device_id
I0426 02:33:39.089467 1 model_config_utils.cc:1841] 	ModelConfig::model_warmup::inputs::value::dims
I0426 02:33:39.089470 1 model_config_utils.cc:1841] 	ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0426 02:33:39.089472 1 model_config_utils.cc:1841] 	ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0426 02:33:39.089474 1 model_config_utils.cc:1841] 	ModelConfig::output::dims
I0426 02:33:39.089477 1 model_config_utils.cc:1841] 	ModelConfig::output::reshape::shape
I0426 02:33:39.089479 1 model_config_utils.cc:1841] 	ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0426 02:33:39.089482 1 model_config_utils.cc:1841] 	ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0426 02:33:39.089484 1 model_config_utils.cc:1841] 	ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0426 02:33:39.089487 1 model_config_utils.cc:1841] 	ModelConfig::sequence_batching::state::dims
I0426 02:33:39.089489 1 model_config_utils.cc:1841] 	ModelConfig::sequence_batching::state::initial_state::dims
I0426 02:33:39.089492 1 model_config_utils.cc:1841] 	ModelConfig::version_policy::specific::versions
I0426 02:33:39.089599 1 model_state.cc:308] Setting the CUDA device to GPU0 to auto-complete config for relevance_distil_bert3layer
I0426 02:33:39.089721 1 model_state.cc:354] Using explicit serialized file 'model.plan' to auto-complete config for relevance_distil_bert3layer
I0426 02:33:39.232749 1 logging.cc:46] Loaded engine size: 85 MiB
I0426 02:33:39.327110 1 logging.cc:49] Deserialization required 91472 microseconds.
I0426 02:33:39.327138 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +84, now: CPU 0, GPU 84 (MiB)
W0426 02:33:39.337247 1 model_state.cc:522] The specified dimensions in model config for relevance_distil_bert3layer hints that batching is unavailable
I0426 02:33:39.342512 1 model_state.cc:379] post auto-complete:
{
    "name": "relevance_distil_bert3layer",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 2
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "token_type_ids",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                128
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "attention_mask",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                128
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "input_ids",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                128
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "logits",
            "data_type": "TYPE_FP32",
            "dims": [
                -1,
                5
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "graph": {
            "level": 1
        },
        "priority": "PRIORITY_DEFAULT",
        "cuda": {
            "graphs": true,
            "busy_wait_events": true,
            "graph_spec": [
                {
                    "batch_size": 0,
                    "input": {
                        "token_type_ids": {
                            "dim": [
                                "25",
                                "128"
                            ]
                        },
                        "attention_mask": {
                            "dim": [
                                "25",
                                "128"
                            ]
                        },
                        "input_ids": {
                            "dim": [
                                "25",
                                "128"
                            ]
                        }
                    }
                }
            ],
            "output_copy_stream": false
        },
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "relevance_distil_bert3layer_0",
            "kind": "KIND_GPU",
            "count": 3,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "intra_op_thread_count": {
            "string_value": "4"
        },
        "inter_op_thread_count": {
            "string_value": "4"
        },
        "execution_mode": {
            "string_value": "0"
        }
    },
    "model_warmup": []
}
I0426 02:33:39.343172 1 model_state.cc:272] model configuration:
{
    "name": "relevance_distil_bert3layer",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 2
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "token_type_ids",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                128
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "attention_mask",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                128
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "input_ids",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                128
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "logits",
            "data_type": "TYPE_FP32",
            "dims": [
                -1,
                5
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "graph": {
            "level": 1
        },
        "priority": "PRIORITY_DEFAULT",
        "cuda": {
            "graphs": true,
            "busy_wait_events": true,
            "graph_spec": [
                {
                    "batch_size": 0,
                    "input": {
                        "input_ids": {
                            "dim": [
                                "25",
                                "128"
                            ]
                        },
                        "attention_mask": {
                            "dim": [
                                "25",
                                "128"
                            ]
                        },
                        "token_type_ids": {
                            "dim": [
                                "25",
                                "128"
                            ]
                        }
                    }
                }
            ],
            "output_copy_stream": false
        },
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "relevance_distil_bert3layer_0",
            "kind": "KIND_GPU",
            "count": 3,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "intra_op_thread_count": {
            "string_value": "4"
        },
        "inter_op_thread_count": {
            "string_value": "4"
        },
        "execution_mode": {
            "string_value": "0"
        }
    },
    "model_warmup": []
}
I0426 02:33:39.343297 1 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: relevance_distil_bert3layer_0_0 (GPU device 0)
I0426 02:33:39.343418 1 backend_model_instance.cc:105] Creating instance relevance_distil_bert3layer_0_0 on GPU 0 (7.0) using artifact 'model.plan'
I0426 02:33:39.343517 1 instance_state.cc:256] Zero copy optimization is disabled
I0426 02:33:39.480353 1 logging.cc:46] Loaded engine size: 85 MiB
I0426 02:33:39.571531 1 logging.cc:49] Deserialization required 90942 microseconds.
I0426 02:33:39.571557 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +84, now: CPU 0, GPU 84 (MiB)
I0426 02:33:39.581548 1 model_state.cc:220] Created new runtime on GPU device 0, NVDLA core -1 for relevance_distil_bert3layer
I0426 02:33:39.581561 1 model_state.cc:227] Created new engine on GPU device 0, NVDLA core -1 for relevance_distil_bert3layer
I0426 02:33:39.581965 1 logging.cc:49] Total per-runner device persistent memory is 0
I0426 02:33:39.581972 1 logging.cc:49] Total per-runner host persistent memory is 32
I0426 02:33:39.582203 1 logging.cc:49] Allocated activation device memory of size 511285248
I0426 02:33:39.634158 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +488, now: CPU 0, GPU 572 (MiB)
I0426 02:33:39.634173 1 logging.cc:49] CUDA lazy loading is enabled.
I0426 02:33:39.634194 1 instance_state.cc:1797] Detected input_ids as execution binding for relevance_distil_bert3layer_0_0
I0426 02:33:39.634200 1 instance_state.cc:1797] Detected attention_mask as execution binding for relevance_distil_bert3layer_0_0
I0426 02:33:39.634205 1 instance_state.cc:1797] Detected token_type_ids as execution binding for relevance_distil_bert3layer_0_0
I0426 02:33:39.634210 1 instance_state.cc:1797] Detected logits as execution binding for relevance_distil_bert3layer_0_0
I0426 02:33:39.680524 1 instance_state.cc:3746] captured CUDA graph for relevance_distil_bert3layer_0_0, batch size 0
I0426 02:33:39.680545 1 instance_state.cc:188] Created instance relevance_distil_bert3layer_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0426 02:33:39.680677 1 backend_model_instance.cc:806] Starting backend thread for relevance_distil_bert3layer_0_0 at nice 0 on device 0...
I0426 02:33:39.680834 1 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: relevance_distil_bert3layer_0_1 (GPU device 0)
I0426 02:33:39.680945 1 backend_model_instance.cc:105] Creating instance relevance_distil_bert3layer_0_1 on GPU 0 (7.0) using artifact 'model.plan'
I0426 02:33:39.681052 1 instance_state.cc:256] Zero copy optimization is disabled
I0426 02:33:39.817964 1 logging.cc:46] Loaded engine size: 85 MiB
I0426 02:33:39.908375 1 logging.cc:49] Deserialization required 90162 microseconds.
I0426 02:33:39.908403 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +84, now: CPU 0, GPU 656 (MiB)
I0426 02:33:39.918349 1 model_state.cc:227] Created new engine on GPU device 0, NVDLA core -1 for relevance_distil_bert3layer
I0426 02:33:39.918741 1 logging.cc:49] Total per-runner device persistent memory is 0
I0426 02:33:39.918754 1 logging.cc:49] Total per-runner host persistent memory is 32
I0426 02:33:39.919007 1 logging.cc:49] Allocated activation device memory of size 511285248
I0426 02:33:39.969935 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +488, now: CPU 0, GPU 1144 (MiB)
I0426 02:33:39.969951 1 logging.cc:49] CUDA lazy loading is enabled.
I0426 02:33:39.969963 1 instance_state.cc:1797] Detected input_ids as execution binding for relevance_distil_bert3layer_0_1
I0426 02:33:39.969967 1 instance_state.cc:1797] Detected attention_mask as execution binding for relevance_distil_bert3layer_0_1
I0426 02:33:39.969971 1 instance_state.cc:1797] Detected token_type_ids as execution binding for relevance_distil_bert3layer_0_1
I0426 02:33:39.969974 1 instance_state.cc:1797] Detected logits as execution binding for relevance_distil_bert3layer_0_1
I0426 02:33:39.974035 1 instance_state.cc:3746] captured CUDA graph for relevance_distil_bert3layer_0_1, batch size 0
I0426 02:33:39.974051 1 instance_state.cc:188] Created instance relevance_distil_bert3layer_0_1 on GPU 0 with stream priority 0 and optimization profile default[0];
I0426 02:33:39.974168 1 backend_model_instance.cc:806] Starting backend thread for relevance_distil_bert3layer_0_1 at nice 0 on device 0...
I0426 02:33:39.974303 1 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: relevance_distil_bert3layer_0_2 (GPU device 0)
I0426 02:33:39.974406 1 backend_model_instance.cc:105] Creating instance relevance_distil_bert3layer_0_2 on GPU 0 (7.0) using artifact 'model.plan'
I0426 02:33:39.974508 1 instance_state.cc:256] Zero copy optimization is disabled
I0426 02:33:40.112753 1 logging.cc:46] Loaded engine size: 85 MiB
I0426 02:33:40.204935 1 logging.cc:49] Deserialization required 91919 microseconds.
I0426 02:33:40.204967 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +85, now: CPU 0, GPU 1229 (MiB)
I0426 02:33:40.214984 1 model_state.cc:227] Created new engine on GPU device 0, NVDLA core -1 for relevance_distil_bert3layer
I0426 02:33:40.215370 1 logging.cc:49] Total per-runner device persistent memory is 0
I0426 02:33:40.215377 1 logging.cc:49] Total per-runner host persistent memory is 32
I0426 02:33:40.215613 1 logging.cc:49] Allocated activation device memory of size 511285248
I0426 02:33:40.267325 1 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +487, now: CPU 0, GPU 1716 (MiB)
I0426 02:33:40.267346 1 logging.cc:49] CUDA lazy loading is enabled.
I0426 02:33:40.267361 1 instance_state.cc:1797] Detected input_ids as execution binding for relevance_distil_bert3layer_0_2
I0426 02:33:40.267365 1 instance_state.cc:1797] Detected attention_mask as execution binding for relevance_distil_bert3layer_0_2
I0426 02:33:40.267369 1 instance_state.cc:1797] Detected token_type_ids as execution binding for relevance_distil_bert3layer_0_2
I0426 02:33:40.267372 1 instance_state.cc:1797] Detected logits as execution binding for relevance_distil_bert3layer_0_2
I0426 02:33:40.271491 1 instance_state.cc:3746] captured CUDA graph for relevance_distil_bert3layer_0_2, batch size 0
I0426 02:33:40.271509 1 instance_state.cc:188] Created instance relevance_distil_bert3layer_0_2 on GPU 0 with stream priority 0 and optimization profile default[0];
I0426 02:33:40.271614 1 backend_model_instance.cc:806] Starting backend thread for relevance_distil_bert3layer_0_2 at nice 0 on device 0...
I0426 02:33:40.271777 1 model_lifecycle.cc:815] successfully loaded 'relevance_distil_bert3layer'
I0426 02:33:40.271839 1 server.cc:603] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0426 02:33:40.271906 1 server.cc:630] 
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend  | Path                                                      | Config                                                                                                                                                                                      |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch  | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so   | {}                                                                                                                                                                                          |
| tensorrt | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","execution-policy":"BLOCKING","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0426 02:33:40.271937 1 server.cc:673] 
+-----------------------------+----------+--------+
| Model                       | Version  | Status |
+-----------------------------+----------+--------+
| relevance_distil_bert3layer | 20240125 | READY  |
+-----------------------------+----------+--------+

I0426 02:33:40.317589 1 metrics.cc:808] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB
I0426 02:33:40.317799 1 metrics.cc:701] Collecting CPU metrics
I0426 02:33:40.317958 1 tritonserver.cc:2385] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.35.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /models                                                                                                                                                                                                         |
| model_control_mode               | MODE_POLL                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0426 02:33:40.318453 1 grpc_server.cc:2339] 
+----------------------------------------------+---------+
| GRPC KeepAlive Option                        | Value   |
+----------------------------------------------+---------+
| keepalive_time_ms                            | 7200000 |
| keepalive_timeout_ms                         | 20000   |
| keepalive_permit_without_calls               | 0       |
| http2_max_pings_without_data                 | 2       |
| http2_min_recv_ping_interval_without_data_ms | 300000  |
| http2_max_ping_strikes                       | 2       |
+----------------------------------------------+---------+

I0426 02:33:40.318958 1 grpc_server.cc:99] Ready for RPC 'Check', 0
I0426 02:33:40.318985 1 grpc_server.cc:99] Ready for RPC 'ServerLive', 0
I0426 02:33:40.318991 1 grpc_server.cc:99] Ready for RPC 'ServerReady', 0
I0426 02:33:40.318995 1 grpc_server.cc:99] Ready for RPC 'ModelReady', 0
I0426 02:33:40.319001 1 grpc_server.cc:99] Ready for RPC 'ServerMetadata', 0
I0426 02:33:40.319008 1 grpc_server.cc:99] Ready for RPC 'ModelMetadata', 0
I0426 02:33:40.319014 1 grpc_server.cc:99] Ready for RPC 'ModelConfig', 0
I0426 02:33:40.319020 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryStatus', 0
I0426 02:33:40.319026 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryRegister', 0
I0426 02:33:40.319032 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0426 02:33:40.319038 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryStatus', 0
I0426 02:33:40.319042 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryRegister', 0
I0426 02:33:40.319048 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0426 02:33:40.319055 1 grpc_server.cc:99] Ready for RPC 'RepositoryIndex', 0
I0426 02:33:40.319061 1 grpc_server.cc:99] Ready for RPC 'RepositoryModelLoad', 0
I0426 02:33:40.319065 1 grpc_server.cc:99] Ready for RPC 'RepositoryModelUnload', 0
I0426 02:33:40.319071 1 grpc_server.cc:99] Ready for RPC 'ModelStatistics', 0
I0426 02:33:40.319077 1 grpc_server.cc:99] Ready for RPC 'Trace', 0
I0426 02:33:40.319083 1 grpc_server.cc:99] Ready for RPC 'Logging', 0
I0426 02:33:40.319108 1 grpc_server.cc:348] Thread started for CommonHandler
I0426 02:33:40.319191 1 infer_handler.cc:693] New request handler for ModelInferHandler, 0
I0426 02:33:40.319216 1 infer_handler.h:1046] Thread started for ModelInferHandler
I0426 02:33:40.319290 1 infer_handler.cc:693] New request handler for ModelInferHandler, 0
I0426 02:33:40.319315 1 infer_handler.h:1046] Thread started for ModelInferHandler
I0426 02:33:40.319385 1 stream_infer_handler.cc:127] New request handler for ModelStreamInferHandler, 0
I0426 02:33:40.319408 1 infer_handler.h:1046] Thread started for ModelStreamInferHandler
I0426 02:33:40.319412 1 grpc_server.cc:2445] Started GRPCInferenceService at 0.0.0.0:8001
I0426 02:33:40.319618 1 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
I0426 02:33:40.360609 1 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002
I0426 02:33:40.365821 1 server.cc:374] Polling model repository
I0426 02:33:44.202919 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:33:51.395017 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:33:53.402431 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:33:55.376807 1 server.cc:374] Polling model repository
I0426 02:33:55.408704 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:33:58.223462 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:04.201442 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:10.392861 1 server.cc:374] Polling model repository
I0426 02:34:18.223519 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:21.395169 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:23.401222 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:24.201572 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:25.404892 1 server.cc:374] Polling model repository
I0426 02:34:25.407563 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:38.223630 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:40.418995 1 server.cc:374] Polling model repository
I0426 02:34:42.463945 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:34:42.465770 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d80004d40] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d80003a28] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80003378] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d800053e8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d800053e8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80003378] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80003a28] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits

I0426 02:34:42.465876 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_0, executing 1 requests
I0426 02:34:42.465897 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_0 with 1 requests
I0426 02:34:42.465902 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_0 with 1 requests
I0426 02:34:42.465955 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_0
I0426 02:34:42.465983 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:34:42.466034 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:34:42.466053 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:34:42.466077 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_0
I0426 02:34:42.466384 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:34:42.466395 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:34:42.466402 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e44008610
I0426 02:34:42.466409 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:34:42.469031 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e44008610
I0426 02:34:42.469068 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_0 released 1 requests
I0426 02:34:42.469076 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:34:42.469094 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:34:42.469099 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:34:42.469104 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:34:44.201598 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:34:51.394663 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:53.400596 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:55.408256 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:34:55.431574 1 server.cc:374] Polling model repository
I0426 02:34:58.223529 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:00.663151 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:00.663247 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:01.663843 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:01.663947 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:02.665415 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:02.666906 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:03.667317 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:03.667372 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:04.201815 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:04.668259 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:04.668349 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:05.669003 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:05.669067 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:06.669538 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:06.669594 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:07.670177 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:07.671670 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:08.672065 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:08.672122 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:09.672477 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:09.672540 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:10.447313 1 server.cc:374] Polling model repository
I0426 02:35:10.672921 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:10.672986 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:11.673341 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:11.673398 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:12.674220 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:12.674322 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:13.675467 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:13.677511 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:14.677983 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:14.678081 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:15.678377 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:15.678477 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:16.678883 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:16.678949 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:17.679866 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:17.681380 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:18.223516 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:18.681780 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:18.681837 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:19.682190 1 http_server.cc:3449] HTTP request: 0 /v2/models/product-search-ltr-dnn-model-v1
I0426 02:35:19.682268 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:20.722343 1 http_server.cc:3449] HTTP request: 0 /v2/models/relevance_distil_bert3layer
I0426 02:35:21.395131 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:23.401441 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:24.201218 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:25.408940 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:25.458183 1 server.cc:374] Polling model repository
I0426 02:35:38.223624 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:40.468503 1 server.cc:374] Polling model repository
I0426 02:35:44.201183 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:51.395171 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:53.401414 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:55.408792 1 http_server.cc:3449] HTTP request: 0 /v2/models/triton-product-category-rank-model-v6
I0426 02:35:55.482191 1 server.cc:374] Polling model repository
I0426 02:35:57.068948 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:35:57.069944 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d80010fa0] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d800115c8] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80002188] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80004bc8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d80004bc8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80002188] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d800115c8] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits

I0426 02:35:57.070030 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_1, executing 1 requests
I0426 02:35:57.070050 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_1 with 1 requests
I0426 02:35:57.070055 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_1 with 1 requests
I0426 02:35:57.070104 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_1
I0426 02:35:57.070127 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:35:57.070164 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:35:57.070182 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:35:57.070211 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_1
I0426 02:35:57.070467 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:35:57.070478 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:35:57.070485 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e3c008610
I0426 02:35:57.070492 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:35:57.073122 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e3c008610
I0426 02:35:57.073150 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_1 released 1 requests
I0426 02:35:57.073155 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:35:57.073163 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:35:57.073168 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:35:57.073173 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:35:57.906319 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:35:57.907357 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d80011960] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d80012158] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80011fc8] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80011ea8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d80011ea8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80011fc8] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80012158] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits

I0426 02:35:57.907442 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_2, executing 1 requests
I0426 02:35:57.907461 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_2 with 1 requests
I0426 02:35:57.907466 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_2 with 1 requests
I0426 02:35:57.907523 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_2
I0426 02:35:57.907546 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:35:57.907581 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:35:57.907600 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:35:57.907619 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_2
I0426 02:35:57.907901 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:35:57.907913 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:35:57.907920 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e0c008610
I0426 02:35:57.907928 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:35:57.910527 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e0c008610
I0426 02:35:57.910555 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_2 released 1 requests
I0426 02:35:57.910560 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:35:57.910567 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:35:57.910570 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:35:57.910573 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:35:58.223819 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:35:58.562130 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:35:58.565215 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d80012580] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d80007128] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80006f68] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80012a68] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d80012a68] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80006f68] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80007128] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits

I0426 02:35:58.565294 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_0, executing 1 requests
I0426 02:35:58.565312 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_0 with 1 requests
I0426 02:35:58.565318 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_0 with 1 requests
I0426 02:35:58.565354 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_0
I0426 02:35:58.565387 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:35:58.565423 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:35:58.565440 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:35:58.565460 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_0
I0426 02:35:58.565714 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:35:58.565724 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:35:58.565739 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e440098e0
I0426 02:35:58.565746 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:35:58.568341 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e440098e0
I0426 02:35:58.568372 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_0 released 1 requests
I0426 02:35:58.568378 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:35:58.568384 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:35:58.568387 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:35:58.568391 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:35:59.153891 1 http_server.cc:3449] HTTP request: 2 /v2/models/relevance_distil_bert3layer/versions/20240125/infer
I0426 02:35:59.154983 1 infer_request.cc:751] [request id: 1] prepared: [0x0x7f6d800075d0] request id: 1, model: relevance_distil_bert3layer, requested version: 20240125, actual version: 20240125, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6d8000dfd8] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d800072b8] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d80007ab8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
override inputs:
inputs:
[0x0x7f6d80007ab8] input: input_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d800072b8] input: attention_mask, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
[0x0x7f6d8000dfd8] input: token_type_ids, type: INT32, original shape: [25,128], batch + shape: [25,128], shape: [25,128]
original requested outputs:
requested outputs:
logits

I0426 02:35:59.155044 1 tensorrt.cc:381] model relevance_distil_bert3layer, instance relevance_distil_bert3layer_0_1, executing 1 requests
I0426 02:35:59.155054 1 instance_state.cc:360] TRITONBACKEND_ModelExecute: Issuing relevance_distil_bert3layer_0_1 with 1 requests
I0426 02:35:59.155059 1 instance_state.cc:409] TRITONBACKEND_ModelExecute: Running relevance_distil_bert3layer_0_1 with 1 requests
I0426 02:35:59.155092 1 instance_state.cc:1437] Optimization profile default [0] is selected for relevance_distil_bert3layer_0_1
I0426 02:35:59.155120 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e54000090
I0426 02:35:59.155154 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540032a0
I0426 02:35:59.155172 1 pinned_memory_manager.cc:161] pinned memory allocation: size 12800, addr 0x7f6e540064b0
I0426 02:35:59.155203 1 instance_state.cc:900] Context with profile default [0] is being executed for relevance_distil_bert3layer_0_1
I0426 02:35:59.155456 1 infer_response.cc:167] add response output: output: logits, type: FP32, shape: [25,5]
I0426 02:35:59.155467 1 http_server.cc:1101] HTTP: unable to provide 'logits' in GPU, will use CPU
I0426 02:35:59.155474 1 http_server.cc:1121] HTTP using buffer for: 'logits', size: 500, addr: 0x7f6e3c0098d0
I0426 02:35:59.155481 1 pinned_memory_manager.cc:161] pinned memory allocation: size 500, addr 0x7f6e540096c0
I0426 02:35:59.158098 1 http_server.cc:1195] HTTP release: size 500, addr 0x7f6e3c0098d0
I0426 02:35:59.158128 1 instance_state.cc:1294] TRITONBACKEND_ModelExecute: model relevance_distil_bert3layer_0_1 released 1 requests
I0426 02:35:59.158133 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540096c0
I0426 02:35:59.158139 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e54000090
I0426 02:35:59.158143 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540032a0
I0426 02:35:59.158154 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f6e540064b0
I0426 02:36:04.200829 1 http_server.cc:139] HTTP request: 0 /metrics
I0426 02:36:10.500690 1 server.cc:374] Polling model repository

from server.

SunnyGhj avatar SunnyGhj commented on June 11, 2024

https://github.com/triton-inference-server/tensorrt_backend/blob/5c881ce8f74988deedc473bb78a9417ffc650757/src/instance_state.cc#L3817
According to the above code, when max_batch_size==0 and graph_spec.batch_size==0, the first element of cuda_graph_key is 0. This seems to be a bug! The first element of cuda_graph_key should be set to 1. Because of during inference, the first element of input_dims will be set to 1, that is, [1,...], which is inconsistent with the previous cuda graph key, so cuda graph cannot be founded,refer to the code below.
https://github.com/triton-inference-server/tensorrt_backend/blob/5c881ce8f74988deedc473bb78a9417ffc650757/src/instance_state.cc#L563
https://github.com/triton-inference-server/tensorrt_backend/blob/5c881ce8f74988deedc473bb78a9417ffc650757/src/instance_state.cc#L3244

@tanmayv25

from server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.