Transformer inference on TensorRT with INT-8 precision
Repository contains inference example and accuracy validation of quantized transformer TensorRT models.
All onnx models are published on Hugging Face ๐ค:
Our example notebooks automatically download the appropriate onnx and build engine.
TensorRT INT8+FP32
torch FP16
Lambada Acc
72.11%
71.43%
Model size (GB)
2.0
3.2
TensorRT INT8+FP32
torch FP16
torch FP32
Lambada Acc
78.46%
79.53%
-
Model size (GB)
8.5
12.1
24.2
GPU RTX 4090
CPU 11th Gen Intel(R) Core(TM) i7-11700K
TensorRT 8.5.3.1
pytorch 1.13.1+cu116
Input sequance length
Number of generated tokens
TensorRT INT8+FP32 ms
torch FP16 ms
Acceleration
64
64
462
1190
2.58
64
128
920
2360
2.54
64
256
1890
4710
2.54
Input sequance length
Number of generated tokens
TensorRT INT8+FP32 ms
torch FP16 ms
Acceleration
64
64
1040
1610
1.55
64
128
2089
3224
1.54
64
256
4236
6479
1.53
GPU RTX 4090
CPU 11th Gen Intel(R) Core(TM) i7-11700K
TensorRT 8.5.3.1
pytorch 1.13.1+cu116