Comments (22)
Hello @Nipi64310 ,Forward 已适配 TensorRT 8 Release v2.0.0,同时也向前兼容了 TensorRT 7;后续我们也会在其他方向进一步做优化,请耐心等待,也欢迎提 issues 哈 :)
from forward.
上面这个是cudnn的安装问题。参考pytorch/pytorch#41593
cmake成功了,但是make -j 报错了
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- The CUDA compiler identification is NVIDIA 11.1.105
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
CMake Warning at CMakeLists.txt:69 (message):
ENABLE_TORCH_PLUGIN=ON, TORCH_PLUGIN NOT SUPPORT dynamic batch.
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found version "11.1")
-- CUDA_NVCC_FLAGS: -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75
-- Using the single-header code from /home/soft/wp/Forward/source/third_party/json/single_include/
-- Found TensorRT: /home/soft/wp/TensorRT-8.2.0.6/lib/libnvinfer.so;/home/soft/wp/TensorRT-8.2.0.6/lib/libnvinfer_plugin.so;/home/soft/wp/TensorRT-8.2.0.6/lib/libnvonnxparser.so;/home/soft/wp/TensorRT-8.2.0.6/lib/libnvparsers.so (found version "8.2.0")
-- Found CUDA: /usr/local/cuda (found version "11.1")
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so
-- Found cuDNN: v8.2.1 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so)
-- /usr/local/cuda/lib64/libnvrtc.so shorthash is 3a20f2b6
-- Added CUDA NVCC flags for: -gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
-- Found Torch: /home/soft/wp/libtorch/lib/libtorch.so
-- Find Torch VERSION: 1.9.1
-- TORCH_HAS_CUDA, TORCH_CUDA_LIBRARIES = /usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/home/soft/wp/libtorch/lib/libc10_cuda.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/soft/wp/Forward/build
make -j报错
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(187): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getPluginNamespace" is incompatible with that of overridden function "nvinfer1::IPluginV2::getPluginNamespace"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(186): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::setPluginNamespace" is incompatible with that of overridden function "nvinfer1::IPluginV2::setPluginNamespace"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(185): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::destroy" is incompatible with that of overridden function "nvinfer1::IPluginV2::destroy"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(184): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::serialize" is incompatible with that of overridden function "nvinfer1::IPluginV2::serialize"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(183): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getSerializationSize" is incompatible with that of overridden function "nvinfer1::IPluginV2::getSerializationSize"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(182): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::terminate" is incompatible with that of overridden function "nvinfer1::IPluginV2::terminate"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(181): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::initialize" is incompatible with that of overridden function "nvinfer1::IPluginV2::initialize"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(180): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getNbOutputs" is incompatible with that of overridden function "nvinfer1::IPluginV2::getNbOutputs"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(179): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getPluginVersion" is incompatible with that of overridden function "nvinfer1::IPluginV2::getPluginVersion"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(178): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getPluginType" is incompatible with that of overridden function "nvinfer1::IPluginV2::getPluginType"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(174): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getOutputDataType" is incompatible with that of overridden function "nvinfer1::IPluginV2Ext::getOutputDataType"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(169): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::enqueue" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::enqueue(const nvinfer1::PluginTensorDesc *, const nvinfer1::PluginTensorDesc *, const void *const *, void *const *, void *, cudaStream_t)"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(167): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getWorkspaceSize" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::getWorkspaceSize(const nvinfer1::PluginTensorDesc *, int32_t, const nvinfer1::PluginTensorDesc *, int32_t) const"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(165): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::configurePlugin" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::configurePlugin(const nvinfer1::DynamicPluginTensorDesc *, int32_t, const nvinfer1::DynamicPluginTensorDesc *, int32_t)"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(163): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::supportsFormatCombination" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::supportsFormatCombination"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(160): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getOutputDimensions" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::getOutputDimensions(int32_t, const nvinfer1::DimsExprs *, int32_t, nvinfer1::IExprBuilder &)"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(159): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::clone" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::clone"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(159): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::clone" is incompatible with that of overridden function "nvinfer1::IPluginV2Ext::clone"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(159): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::clone" is incompatible with that of overridden function "nvinfer1::IPluginV2::clone"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(240): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::getPluginNamespace" is incompatible with that of overridden function "nvinfer1::IPluginCreator::getPluginNamespace"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(238): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::setPluginNamespace" is incompatible with that of overridden function "nvinfer1::IPluginCreator::setPluginNamespace"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(235): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::deserializePlugin" is incompatible with that of overridden function "nvinfer1::IPluginCreator::deserializePlugin"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(232): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::createPlugin" is incompatible with that of overridden function "nvinfer1::IPluginCreator::createPlugin"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(230): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::getFieldNames" is incompatible with that of overridden function "nvinfer1::IPluginCreator::getFieldNames"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(228): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::getPluginVersion" is incompatible with that of overridden function "nvinfer1::IPluginCreator::getPluginVersion"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(226): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::getPluginName" is incompatible with that of overridden function "nvinfer1::IPluginCreator::getPluginName"
35 errors detected in the compilation of "/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/emb_layer_norm_plugin/emb_layer_norm_kernel.cu".
35 errors detected in the compilation of "/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/gelu_plugin/gelu_kernel.cu".
CMake Error at trt_engine_generated_emb_layer_norm_kernel.cu.o.cmake:281 (message):
Error generating file
/home/soft/wp/Forward/build/source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/emb_layer_norm_plugin/./trt_engine_generated_emb_layer_norm_kernel.cu.o
source/trt_engine/CMakeFiles/trt_engine.dir/build.make:105: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/emb_layer_norm_plugin/trt_engine_generated_emb_layer_norm_kernel.cu.o' failed
make[2]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/emb_layer_norm_plugin/trt_engine_generated_emb_layer_norm_kernel.cu.o] Error 1
CMake Error at trt_engine_generated_gelu_kernel.cu.o.cmake:281 (message):
Error generating file
/home/soft/wp/Forward/build/source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/gelu_plugin/./trt_engine_generated_gelu_kernel.cu.o
source/trt_engine/CMakeFiles/trt_engine.dir/build.make:112: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/gelu_plugin/trt_engine_generated_gelu_kernel.cu.o' failed
make[2]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/gelu_plugin/trt_engine_generated_gelu_kernel.cu.o] Error 1
68 errors detected in the compilation of "/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/qkv_to_context_plugin/qkv_to_context.cu".
CMake Error at trt_engine_generated_qkv_to_context.cu.o.cmake:281 (message):
Error generating file
/home/soft/wp/Forward/build/source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/qkv_to_context_plugin/./trt_engine_generated_qkv_to_context.cu.o
source/trt_engine/CMakeFiles/trt_engine.dir/build.make:77: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/qkv_to_context_plugin/trt_engine_generated_qkv_to_context.cu.o' failed
make[2]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/qkv_to_context_plugin/trt_engine_generated_qkv_to_context.cu.o] Error 1
68 errors detected in the compilation of "/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_kernel.cu".
CMake Error at trt_engine_generated_skip_layer_norm_kernel.cu.o.cmake:281 (message):
Error generating file
/home/soft/wp/Forward/build/source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/skip_layer_norm_plugin/./trt_engine_generated_skip_layer_norm_kernel.cu.o
source/trt_engine/CMakeFiles/trt_engine.dir/build.make:154: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/skip_layer_norm_plugin/trt_engine_generated_skip_layer_norm_kernel.cu.o' failed
make[2]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/skip_layer_norm_plugin/trt_engine_generated_skip_layer_norm_kernel.cu.o] Error 1
CMakeFiles/Makefile2:219: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/all' failed
make[1]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
from forward.
上面这个是cudnn的安装问题。参考pytorch/pytorch#41593 cmake成功了,但是make -j 报错了
-- The C compiler identification is GNU 5.4.0 -- The CXX compiler identification is GNU 5.4.0 -- The CUDA compiler identification is NVIDIA 11.1.105 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done CMake Warning at CMakeLists.txt:69 (message): ENABLE_TORCH_PLUGIN=ON, TORCH_PLUGIN NOT SUPPORT dynamic batch.
-- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found CUDA: /usr/local/cuda (found version "11.1") -- CUDA_NVCC_FLAGS: -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -- Using the single-header code from /home/soft/wp/Forward/source/third_party/json/single_include/ -- Found TensorRT: /home/soft/wp/TensorRT-8.2.0.6/lib/libnvinfer.so;/home/soft/wp/TensorRT-8.2.0.6/lib/libnvinfer_plugin.so;/home/soft/wp/TensorRT-8.2.0.6/lib/libnvonnxparser.so;/home/soft/wp/TensorRT-8.2.0.6/lib/libnvparsers.so (found version "8.2.0") -- Found CUDA: /usr/local/cuda (found version "11.1") -- Caffe2: CUDA detected: 11.1 -- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc -- Caffe2: CUDA toolkit directory: /usr/local/cuda -- Caffe2: Header version is: 11.1 -- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so -- Found cuDNN: v8.2.1 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so) -- /usr/local/cuda/lib64/libnvrtc.so shorthash is 3a20f2b6 -- Added CUDA NVCC flags for: -gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75 -- Found Torch: /home/soft/wp/libtorch/lib/libtorch.so -- Find Torch VERSION: 1.9.1 -- TORCH_HAS_CUDA, TORCH_CUDA_LIBRARIES = /usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/home/soft/wp/libtorch/lib/libc10_cuda.so -- Configuring done -- Generating done -- Build files have been written to: /home/soft/wp/Forward/build
make -j报错 /home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(187): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getPluginNamespace" is incompatible with that of overridden function "nvinfer1::IPluginV2::getPluginNamespace"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(186): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::setPluginNamespace" is incompatible with that of overridden function "nvinfer1::IPluginV2::setPluginNamespace"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(185): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::destroy" is incompatible with that of overridden function "nvinfer1::IPluginV2::destroy"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(184): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::serialize" is incompatible with that of overridden function "nvinfer1::IPluginV2::serialize"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(183): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getSerializationSize" is incompatible with that of overridden function "nvinfer1::IPluginV2::getSerializationSize"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(182): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::terminate" is incompatible with that of overridden function "nvinfer1::IPluginV2::terminate"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(181): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::initialize" is incompatible with that of overridden function "nvinfer1::IPluginV2::initialize"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(180): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getNbOutputs" is incompatible with that of overridden function "nvinfer1::IPluginV2::getNbOutputs"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(179): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getPluginVersion" is incompatible with that of overridden function "nvinfer1::IPluginV2::getPluginVersion"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(178): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getPluginType" is incompatible with that of overridden function "nvinfer1::IPluginV2::getPluginType"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(174): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getOutputDataType" is incompatible with that of overridden function "nvinfer1::IPluginV2Ext::getOutputDataType"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(169): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::enqueue" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::enqueue(const nvinfer1::PluginTensorDesc *, const nvinfer1::PluginTensorDesc *, const void *const *, void *const *, void *, cudaStream_t)"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(167): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getWorkspaceSize" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::getWorkspaceSize(const nvinfer1::PluginTensorDesc *, int32_t, const nvinfer1::PluginTensorDesc *, int32_t) const"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(165): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::configurePlugin" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::configurePlugin(const nvinfer1::DynamicPluginTensorDesc *, int32_t, const nvinfer1::DynamicPluginTensorDesc *, int32_t)"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(163): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::supportsFormatCombination" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::supportsFormatCombination"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(160): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::getOutputDimensions" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::getOutputDimensions(int32_t, const nvinfer1::DimsExprs *, int32_t, nvinfer1::IExprBuilder &)"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(159): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::clone" is incompatible with that of overridden function "nvinfer1::IPluginV2DynamicExt::clone"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(159): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::clone" is incompatible with that of overridden function "nvinfer1::IPluginV2Ext::clone"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(159): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPlugin::clone" is incompatible with that of overridden function "nvinfer1::IPluginV2::clone"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(240): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::getPluginNamespace" is incompatible with that of overridden function "nvinfer1::IPluginCreator::getPluginNamespace"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(238): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::setPluginNamespace" is incompatible with that of overridden function "nvinfer1::IPluginCreator::setPluginNamespace"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(235): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::deserializePlugin" is incompatible with that of overridden function "nvinfer1::IPluginCreator::deserializePlugin"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(232): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::createPlugin" is incompatible with that of overridden function "nvinfer1::IPluginCreator::createPlugin"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(230): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::getFieldNames" is incompatible with that of overridden function "nvinfer1::IPluginCreator::getFieldNames"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(228): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::getPluginVersion" is incompatible with that of overridden function "nvinfer1::IPluginCreator::getPluginVersion"
/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_plugin.h(226): error: exception specification for virtual function "fwd::bert::SkipLayerNormVarSeqlenPluginCreator::getPluginName" is incompatible with that of overridden function "nvinfer1::IPluginCreator::getPluginName"
35 errors detected in the compilation of "/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/emb_layer_norm_plugin/emb_layer_norm_kernel.cu". 35 errors detected in the compilation of "/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/gelu_plugin/gelu_kernel.cu". CMake Error at trt_engine_generated_emb_layer_norm_kernel.cu.o.cmake:281 (message): Error generating file /home/soft/wp/Forward/build/source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/emb_layer_norm_plugin/./trt_engine_generated_emb_layer_norm_kernel.cu.o
source/trt_engine/CMakeFiles/trt_engine.dir/build.make:105: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/emb_layer_norm_plugin/trt_engine_generated_emb_layer_norm_kernel.cu.o' failed make[2]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/emb_layer_norm_plugin/trt_engine_generated_emb_layer_norm_kernel.cu.o] Error 1 CMake Error at trt_engine_generated_gelu_kernel.cu.o.cmake:281 (message): Error generating file /home/soft/wp/Forward/build/source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/gelu_plugin/./trt_engine_generated_gelu_kernel.cu.o
source/trt_engine/CMakeFiles/trt_engine.dir/build.make:112: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/gelu_plugin/trt_engine_generated_gelu_kernel.cu.o' failed make[2]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/gelu_plugin/trt_engine_generated_gelu_kernel.cu.o] Error 1 68 errors detected in the compilation of "/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/qkv_to_context_plugin/qkv_to_context.cu". CMake Error at trt_engine_generated_qkv_to_context.cu.o.cmake:281 (message): Error generating file /home/soft/wp/Forward/build/source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/qkv_to_context_plugin/./trt_engine_generated_qkv_to_context.cu.o
source/trt_engine/CMakeFiles/trt_engine.dir/build.make:77: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/qkv_to_context_plugin/trt_engine_generated_qkv_to_context.cu.o' failed make[2]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/qkv_to_context_plugin/trt_engine_generated_qkv_to_context.cu.o] Error 1 68 errors detected in the compilation of "/home/soft/wp/Forward/source/trt_engine/trt_network_crt/plugins/skip_layer_norm_plugin/skip_layer_norm_kernel.cu". CMake Error at trt_engine_generated_skip_layer_norm_kernel.cu.o.cmake:281 (message): Error generating file /home/soft/wp/Forward/build/source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/skip_layer_norm_plugin/./trt_engine_generated_skip_layer_norm_kernel.cu.o
source/trt_engine/CMakeFiles/trt_engine.dir/build.make:154: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/skip_layer_norm_plugin/trt_engine_generated_skip_layer_norm_kernel.cu.o' failed make[2]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/trt_network_crt/plugins/skip_layer_norm_plugin/trt_engine_generated_skip_layer_norm_kernel.cu.o] Error 1 CMakeFiles/Makefile2:219: recipe for target 'source/trt_engine/CMakeFiles/trt_engine.dir/all' failed make[1]: *** [source/trt_engine/CMakeFiles/trt_engine.dir/all] Error 2 Makefile:83: recipe for target 'all' failed make: *** [all] Error 2
重新装tensorrt 7.2.3.4 构建成功,8.x版本会报错
from forward.
但是import forward报错
ImportError: /home/soft/download/Forward/build/bin/forward.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit8toIValueEN8pybind116handleERKSt10shared_ptrIN3c104TypeEENS4_8optionalIiEE
目录下有这些东西 ls bin/
forward.cpython-38-x86_64-linux-gnu.so libfwd_torch.so libsimple-utils.a libtrt_engine.so
cmake命令
cmake .. -DTensorRT_ROOT=/home/soft/download/TensorRT-7.2.3.4 -DENABLE_TORCH=ON -DENABLE_TORCH_PLUGIN=ON -DPYTHON_EXECUTABLE=/usr/local/bin/python -DBUILD_PYTHON_LIB=ON
cmake日志
-DPYTHON_EXECUTABLE=/usr/local/bin/python -DBUILD_PYTHON_LIB=ON
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- The CUDA compiler identification is NVIDIA 11.1.105
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
CMake Deprecation Warning at CMakeLists.txt:28 (cmake_policy):
The OLD behavior for policy CMP0074 will be removed from a future version
of CMake.
The cmake-policies(7) manual explains that the OLD behaviors of all
policies are deprecated and that a policy should be set to OLD only under
specific short-term circumstances. Projects should be ported to the NEW
behavior and not rely on setting a policy to OLD.
CMake Warning at CMakeLists.txt:69 (message):
ENABLE_TORCH_PLUGIN=ON, TORCH_PLUGIN NOT SUPPORT dynamic batch.
CMake Warning at CMakeLists.txt:96 (message):
_GLIBCXX_USE_CXX11_ABI=0 is set for PyTorch libraries. Check dependencies
for this flag.
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found version "11.1")
-- CUDA_NVCC_FLAGS: -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75
-- Using the single-header code from /home/soft/download/Forward/source/third_party/json/single_include/
-- Found TensorRT: /home/soft/download/TensorRT-7.2.3.4/lib/libnvinfer.so;/home/soft/download/TensorRT-7.2.3.4/lib/libnvinfer_plugin.so;/home/soft/download/TensorRT-7.2.3.4/lib/libnvonnxparser.so;/home/soft/download/TensorRT-7.2.3.4/lib/libnvparsers.so (found version "7.2.3")
-- Found CUDA: /usr/local/cuda (found version "11.1")
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so
-- Found cuDNN: v8.0.5 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so)
-- /usr/local/cuda/lib64/libnvrtc.so shorthash is 3a20f2b6
-- Added CUDA NVCC flags for: -gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
-- Found Torch: /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch.so
-- Find Torch VERSION: 1.8.1
-- TORCH_HAS_CUDA, TORCH_CUDA_LIBRARIES = /usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so
-- Found PythonInterp: /usr/local/bin/python (found version "3.8.10")
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.8.so
-- pybind11 v2.3.dev0
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- LTO enabled
-- Configuring done
-- Generating done
-- Build files have been written to: /home/soft/download/Forward/build
ImportError: /home/soft/download/Forward/build/bin/forward.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit8toIValueEN8pybind116handleERKSt10shared_ptrIN3c104TypeEENS4_8optionalIiEE
from forward.
但是import forward报错 ImportError: /home/soft/download/Forward/build/bin/forward.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit8toIValueEN8pybind116handleERKSt10shared_ptrIN3c104TypeEENS4_8optionalIiEE
目录下有这些东西 ls bin/ forward.cpython-38-x86_64-linux-gnu.so libfwd_torch.so libsimple-utils.a libtrt_engine.so
cmake命令
cmake .. -DTensorRT_ROOT=/home/soft/download/TensorRT-7.2.3.4 -DENABLE_TORCH=ON -DENABLE_TORCH_PLUGIN=ON -DPYTHON_EXECUTABLE=/usr/local/bin/python -DBUILD_PYTHON_LIB=ON
cmake日志 -DPYTHON_EXECUTABLE=/usr/local/bin/python -DBUILD_PYTHON_LIB=ON -- The C compiler identification is GNU 9.3.0 -- The CXX compiler identification is GNU 9.3.0 -- The CUDA compiler identification is NVIDIA 11.1.105 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done CMake Deprecation Warning at CMakeLists.txt:28 (cmake_policy): The OLD behavior for policy CMP0074 will be removed from a future version of CMake.
The cmake-policies(7) manual explains that the OLD behaviors of all policies are deprecated and that a policy should be set to OLD only under specific short-term circumstances. Projects should be ported to the NEW behavior and not rely on setting a policy to OLD.
CMake Warning at CMakeLists.txt:69 (message): ENABLE_TORCH_PLUGIN=ON, TORCH_PLUGIN NOT SUPPORT dynamic batch.
CMake Warning at CMakeLists.txt:96 (message): _GLIBCXX_USE_CXX11_ABI=0 is set for PyTorch libraries. Check dependencies for this flag.
-- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found CUDA: /usr/local/cuda (found version "11.1") -- CUDA_NVCC_FLAGS: -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -- Using the single-header code from /home/soft/download/Forward/source/third_party/json/single_include/ -- Found TensorRT: /home/soft/download/TensorRT-7.2.3.4/lib/libnvinfer.so;/home/soft/download/TensorRT-7.2.3.4/lib/libnvinfer_plugin.so;/home/soft/download/TensorRT-7.2.3.4/lib/libnvonnxparser.so;/home/soft/download/TensorRT-7.2.3.4/lib/libnvparsers.so (found version "7.2.3") -- Found CUDA: /usr/local/cuda (found version "11.1") -- Caffe2: CUDA detected: 11.1 -- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc -- Caffe2: CUDA toolkit directory: /usr/local/cuda -- Caffe2: Header version is: 11.1 -- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so -- Found cuDNN: v8.0.5 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so) -- /usr/local/cuda/lib64/libnvrtc.so shorthash is 3a20f2b6 -- Added CUDA NVCC flags for: -gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75 -- Found Torch: /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch.so -- Find Torch VERSION: 1.8.1 -- TORCH_HAS_CUDA, TORCH_CUDA_LIBRARIES = /usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so -- Found PythonInterp: /usr/local/bin/python (found version "3.8.10") -- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.8.so -- pybind11 v2.3.dev0 -- Performing Test HAS_FLTO -- Performing Test HAS_FLTO - Success -- LTO enabled -- Configuring done -- Generating done -- Build files have been written to: /home/soft/download/Forward/build
ImportError: /home/soft/download/Forward/build/bin/forward.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit8toIValueEN8pybind116handleERKSt10shared_ptrIN3c104TypeEENS4_8optionalIiEE
参考这个https://github.com/Tencent/Forward/blob/master/doc/cn/usages/torch_usage_CN.md,装了cpu版本的torch
然后import正常的。
测试demo中的pytorch bert速度大概是在7+ms, gpu型号是T4,不知道是否合理。
相比于nvidia官方的脚本导出的要慢3ms
from forward.
Hello @Nipi64310 ,
感谢您对 Forward 项目的关注,看到您上述的踩坑过程,我表示抱歉:
1、Forward 在开发之初是基于 TensorRT 7 的,因此项目在开源后我们建议用户采用 7.2.1.6 版本。
2、我们了解到 TensorRT 8 的适用人群在增多,官方更是更新到了 8.2 版本。但由于 Forward 还未对 TensorRT 8 进行适配,在使用高版本的 CUDA 和 TensorRT 进行编译时会存在不兼容的问题。
3、Forward 团队近期也在讨论适配 TensorRT 8 的事宜,若完成相应的升级,我们也会在第一时间进行发布。
4、您提到的 pytorch bert 的导出效率问题,我们会进行排查,并在稍后反馈给您。
from forward.
Hello @Nipi64310 ,
感谢您对 Forward 项目的关注,看到您上述的踩坑过程,我表示抱歉:
1、Forward 在开发之初是基于 TensorRT 7 的,因此项目在开源后我们建议用户采用 7.2.1.6 版本。
2、我们了解到 TensorRT 8 的适用人群在增多,官方更是更新到了 8.2 版本。但由于 Forward 还未对 TensorRT 8 进行适配,在使用高版本的 CUDA 和 TensorRT 进行编译时会存在不兼容的问题。
3、Forward 团队近期也在讨论适配 TensorRT 8 的事宜,若完成相应的升级,我们也会在第一时间进行发布。
4、您提到的 pytorch bert 的导出效率问题,我们会进行排查,并在稍后反馈给您。
hello @zhaoyiluo 感谢回复
另外cuda版本的pytorch编译完成之后,import也会报错如下(测试过多个版本,1.10.0,1.8.x,1.7.x),是否只能使用cpu版本的pytorch编译
ImportError: /home/soft/download/Forward/build/bin/forward.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit8toIValueEN8pybind116handleERKSt10shared_ptrIN3c104TypeEENS4_8optionalIiEE
from forward.
Hi @Nipi64310 ,
针对您提到的 pytorch bert 的效率问题,我先解释下我们的 demo 运行逻辑。demo 在推理时 tensor 默认在 CPU 上,在推理结束后,tensor 默认也在 CPU 上,即整个流程为:从 CPU 拷贝 tensor 至 GPU、进行推理、从 GPU 拷贝结果回 CPU。
我们仅根据目前得到的信息,无法排除是否是由于两次拷贝造成的导出时间变长。希望您这边可以提供测试时用到的官方脚本(或可供下载的地址链接),以及测试时用到的环境,包括 TensorRT 版本、PyTorch 版本等,尽可能详细。
针对 pytorch cuda 的问题,根据报错的信息,我们怀疑是 pytorch cuda 的符号链接出错导致的,这边还是建议您使用 cpu 版本的 pytorch,来避免可能的问题。
对于 pytorch cuda 这个问题,我想做进一步解释。我们的项目是可以用 pytorch cuda 的,但是 pytorch cuda 在安装时会自带 cudatoolkit,这个 cudatoolkit 可能会与 cuda 自带的 cudatoolkit 产生冲突(包括 cudnn 等其他相关的支持)。若在这个过程中,能保证在安装 pytorch cuda 时得到的 cudatoolkit 和 cuda 自身的 cudatoolkit 是同一份 cudatoolkit(包括其他相关支持),那就能保证使用的顺利。这也是我们为什么推荐用户使用 cpu 版本的原因,避免后续这些问题。
欢迎追问!
from forward.
Hi @Nipi64310 ,
针对您提到的 pytorch bert 的效率问题,我先解释下我们的 demo 运行逻辑。demo 在推理时 tensor 默认在 CPU 上,在推理结束后,tensor 默认也在 CPU 上,即整个流程为:从 CPU 拷贝 tensor 至 GPU、进行推理、从 GPU 拷贝结果回 CPU。
我们仅根据目前得到的信息,无法排除是否是由于两次拷贝造成的导出时间变长。希望您这边可以提供测试时用到的官方脚本(或可供下载的地址链接),以及测试时用到的环境,包括 TensorRT 版本、PyTorch 版本等,尽可能详细。
针对 pytorch cuda 的问题,根据报错的信息,我们怀疑是 pytorch cuda 的符号链接出错导致的,这边还是建议您使用 cpu 版本的 pytorch,来避免可能的问题。
对于 pytorch cuda 这个问题,我想做进一步解释。我们的项目是可以用 pytorch cuda 的,但是 pytorch cuda 在安装时会自带 cudatoolkit,这个 cudatoolkit 可能会与 cuda 自带的 cudatoolkit 产生冲突(包括 cudnn 等其他相关的支持)。若在这个过程中,能保证在安装 pytorch cuda 时得到的 cudatoolkit 和 cuda 自身的 cudatoolkit 是同一份 cudatoolkit(包括其他相关支持),那就能保证使用的顺利。这也是我们为什么推荐用户使用 cpu 版本的原因,避免后续这些问题。
欢迎追问!
hello @zhaoyiluo 感谢回复
测试了不同长度的base版本bert耗时,对比自己转的Tensorrt engine耗时,Forward推理速度随着长度变长浮动还是挺大的
测试环境:
Tesla T4
Tensorrt 7.2.3
cuda 11.1
cudnn 8.1.0
pytorch 1.7.0+cpu
python 3.8.10
其他库版本信息同该评论一致 #33 (comment)
Forward测试脚本如下, 基本同官方脚本一致
from transformers import BertTokenizer, BertModel, BertForSequenceClassification,TFBertForSequenceClassification
import torch
import forward
import time
def TestForward(jit_path):
tokenizer = BertTokenizer.from_pretrained('/home/jovyan/pretrained_models/macbert/')
model = BertForSequenceClassification.from_pretrained('/home/jovyan/pretrained_models/macbert/', torchscript=True)
model.cpu()
model.eval()
inputs = tokenizer("你好", max_length=128, pad_to_max_length=True, return_tensors="pt")
dummy_inputs = (
inputs["input_ids"],
inputs["attention_mask"],
inputs["token_type_ids"],
)
print('dummy_inputs :', dummy_inputs)
traced_model = torch.jit.trace(model, dummy_inputs)
traced_model.save(jit_path)
print('Jit model is saved.')
builder = forward.TorchBuilder()
builder.set_mode('float32')
engine = builder.build(jit_path, dummy_inputs)
engine_path = jit_path + '.engine'
engine.save(engine_path)
engine = forward.TorchEngine()
engine.load(engine_path)
ground_truth = traced_model(*dummy_inputs)
print('ground_truth', ground_truth)
time_list = []
for _ in range(1000):
start = time.time()
# outputs = traced_model(*dummy_inputs)
outputs = engine.forward(dummy_inputs)
if _ > 100:
time_list.append(time.time()-start)
print('outputs : ', outputs,time.time()-start)
print('avg time:{}'.format(sum(time_list) / len(time_list)))
if __name__ == "__main__":
jit_path = 'bert_cpu.pt'
print("jit_path : ", jit_path)
TestForward(jit_path)
from forward.
Hi @Nipi64310 ,
针对您提到的 pytorch bert 的效率问题,我先解释下我们的 demo 运行逻辑。demo 在推理时 tensor 默认在 CPU 上,在推理结束后,tensor 默认也在 CPU 上,即整个流程为:从 CPU 拷贝 tensor 至 GPU、进行推理、从 GPU 拷贝结果回 CPU。
我们仅根据目前得到的信息,无法排除是否是由于两次拷贝造成的导出时间变长。希望您这边可以提供测试时用到的官方脚本(或可供下载的地址链接),以及测试时用到的环境,包括 TensorRT 版本、PyTorch 版本等,尽可能详细。
针对 pytorch cuda 的问题,根据报错的信息,我们怀疑是 pytorch cuda 的符号链接出错导致的,这边还是建议您使用 cpu 版本的 pytorch,来避免可能的问题。
对于 pytorch cuda 这个问题,我想做进一步解释。我们的项目是可以用 pytorch cuda 的,但是 pytorch cuda 在安装时会自带 cudatoolkit,这个 cudatoolkit 可能会与 cuda 自带的 cudatoolkit 产生冲突(包括 cudnn 等其他相关的支持)。若在这个过程中,能保证在安装 pytorch cuda 时得到的 cudatoolkit 和 cuda 自身的 cudatoolkit 是同一份 cudatoolkit(包括其他相关支持),那就能保证使用的顺利。这也是我们为什么推荐用户使用 cpu 版本的原因,避免后续这些问题。
欢迎追问!hello @zhaoyiluo 感谢回复 测试了不同长度的base版本bert耗时,对比自己转的Tensorrt engine耗时,Forward推理速度随着长度变长浮动还是挺大的
测试环境: Tesla T4 Tensorrt 7.2.3 cuda 11.1 cudnn 8.1.0 pytorch 1.7.0+cpu python 3.8.10 其他库版本信息同该评论一致 #33 (comment)
Forward测试脚本如下, 基本同官方脚本一致
from transformers import BertTokenizer, BertModel, BertForSequenceClassification,TFBertForSequenceClassification import torch import forward import time def TestForward(jit_path): tokenizer = BertTokenizer.from_pretrained('/home/jovyan/pretrained_models/macbert/') model = BertForSequenceClassification.from_pretrained('/home/jovyan/pretrained_models/macbert/', torchscript=True) model.cpu() model.eval() inputs = tokenizer("你好", max_length=128, pad_to_max_length=True, return_tensors="pt") dummy_inputs = ( inputs["input_ids"], inputs["attention_mask"], inputs["token_type_ids"], ) print('dummy_inputs :', dummy_inputs) traced_model = torch.jit.trace(model, dummy_inputs) traced_model.save(jit_path) print('Jit model is saved.') builder = forward.TorchBuilder() builder.set_mode('float32') engine = builder.build(jit_path, dummy_inputs) engine_path = jit_path + '.engine' engine.save(engine_path) engine = forward.TorchEngine() engine.load(engine_path) ground_truth = traced_model(*dummy_inputs) print('ground_truth', ground_truth) time_list = [] for _ in range(1000): start = time.time() # outputs = traced_model(*dummy_inputs) outputs = engine.forward(dummy_inputs) if _ > 100: time_list.append(time.time()-start) print('outputs : ', outputs,time.time()-start) print('avg time:{}'.format(sum(time_list) / len(time_list))) if __name__ == "__main__": jit_path = 'bert_cpu.pt' print("jit_path : ", jit_path) TestForward(jit_path)
@Nipi64310 根据您提供的 Forward 测试脚本,跑了下 “你好”,在不同长度下的表现,
测试环境:
Tesla T4
Tensorrt 7.2.1.6
cuda 10.2
cudnn 8.0.4
pytorch 1.7.1
python 3.6.8
测试结果:
32 | 64 | 128 | 256 |
---|---|---|---|
0.005885s | 0.007538s | 0.012257s | 0.023222s |
在仅比对和您跑的 Forward 结果,我们得出的结论也存在一定差异。进一步和您确认的是,目录下的 pretrained_models/macbert/
是否是 bert-base-uncased
?
我的测试脚本如下:
import torch
import forward
import time
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
def TestForward(jit_path):
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', torchscript=True)
model.cpu()
model.eval()
inputs = tokenizer("你好", max_length=256, pad_to_max_length=True, return_tensors="pt")
dummy_inputs = (
inputs["input_ids"],
inputs["attention_mask"],
inputs["token_type_ids"],
)
print('dummy_inputs :', dummy_inputs)
traced_model = torch.jit.trace(model, dummy_inputs)
traced_model.save(jit_path)
print('Jit model is saved.')
builder = forward.TorchBuilder()
builder.set_mode('float32')
engine = builder.build(jit_path, dummy_inputs)
engine_path = jit_path + '.engine'
engine.save(engine_path)
engine = forward.TorchEngine()
engine.load(engine_path)
ground_truth = traced_model(*dummy_inputs)
print('ground_truth', ground_truth)
time_list = []
for _ in range(1000):
start = time.time()
# outputs = traced_model(*dummy_inputs)
outputs = engine.forward(dummy_inputs)
if _ > 100:
time_list.append(time.time() - start)
print('outputs : ', outputs, time.time() - start)
print('avg time:{}'.format(sum(time_list) / len(time_list)))
if __name__ == "__main__":
jit_path = 'bert_cpu.pt'
print("jit_path : ", jit_path)
TestForward(jit_path)
CUDA 版本的差异不应导致如此巨大的性能表现,我这边需要做进一步的排查。
此外,您这边是否也能同时提供测试用到的官方脚本(若是通过命令行运行,也希望能提供详细的步骤),便于我们这边快速确定问题。
感谢您的支持!
from forward.
Hi @Nipi64310 ,
针对您提到的 pytorch bert 的效率问题,我先解释下我们的 demo 运行逻辑。demo 在推理时 tensor 默认在 CPU 上,在推理结束后,tensor 默认也在 CPU 上,即整个流程为:从 CPU 拷贝 tensor 至 GPU、进行推理、从 GPU 拷贝结果回 CPU。
我们仅根据目前得到的信息,无法排除是否是由于两次拷贝造成的导出时间变长。希望您这边可以提供测试时用到的官方脚本(或可供下载的地址链接),以及测试时用到的环境,包括 TensorRT 版本、PyTorch 版本等,尽可能详细。
针对 pytorch cuda 的问题,根据报错的信息,我们怀疑是 pytorch cuda 的符号链接出错导致的,这边还是建议您使用 cpu 版本的 pytorch,来避免可能的问题。
对于 pytorch cuda 这个问题,我想做进一步解释。我们的项目是可以用 pytorch cuda 的,但是 pytorch cuda 在安装时会自带 cudatoolkit,这个 cudatoolkit 可能会与 cuda 自带的 cudatoolkit 产生冲突(包括 cudnn 等其他相关的支持)。若在这个过程中,能保证在安装 pytorch cuda 时得到的 cudatoolkit 和 cuda 自身的 cudatoolkit 是同一份 cudatoolkit(包括其他相关支持),那就能保证使用的顺利。这也是我们为什么推荐用户使用 cpu 版本的原因,避免后续这些问题。
欢迎追问!hello @zhaoyiluo 感谢回复 测试了不同长度的base版本bert耗时,对比自己转的Tensorrt engine耗时,Forward推理速度随着长度变长浮动还是挺大的
测试环境: Tesla T4 Tensorrt 7.2.3 cuda 11.1 cudnn 8.1.0 pytorch 1.7.0+cpu python 3.8.10 其他库版本信息同该评论一致 #33 (comment)
Forward测试脚本如下, 基本同官方脚本一致from transformers import BertTokenizer, BertModel, BertForSequenceClassification,TFBertForSequenceClassification import torch import forward import time def TestForward(jit_path): tokenizer = BertTokenizer.from_pretrained('/home/jovyan/pretrained_models/macbert/') model = BertForSequenceClassification.from_pretrained('/home/jovyan/pretrained_models/macbert/', torchscript=True) model.cpu() model.eval() inputs = tokenizer("你好", max_length=128, pad_to_max_length=True, return_tensors="pt") dummy_inputs = ( inputs["input_ids"], inputs["attention_mask"], inputs["token_type_ids"], ) print('dummy_inputs :', dummy_inputs) traced_model = torch.jit.trace(model, dummy_inputs) traced_model.save(jit_path) print('Jit model is saved.') builder = forward.TorchBuilder() builder.set_mode('float32') engine = builder.build(jit_path, dummy_inputs) engine_path = jit_path + '.engine' engine.save(engine_path) engine = forward.TorchEngine() engine.load(engine_path) ground_truth = traced_model(*dummy_inputs) print('ground_truth', ground_truth) time_list = [] for _ in range(1000): start = time.time() # outputs = traced_model(*dummy_inputs) outputs = engine.forward(dummy_inputs) if _ > 100: time_list.append(time.time()-start) print('outputs : ', outputs,time.time()-start) print('avg time:{}'.format(sum(time_list) / len(time_list))) if __name__ == "__main__": jit_path = 'bert_cpu.pt' print("jit_path : ", jit_path) TestForward(jit_path)
@Nipi64310 根据您提供的 Forward 测试脚本,跑了下 “你好”,在不同长度下的表现,
测试环境:
Tesla T4 Tensorrt 7.2.1.6 cuda 10.2 cudnn 8.0.4 pytorch 1.7.1 python 3.6.8
测试结果:
32 64 128 256
0.005885s 0.007538s 0.012257s 0.023222s
在仅比对和您跑的 Forward 结果,我们得出的结论也存在一定差异。进一步和您确认的是,目录下的pretrained_models/macbert/
是否是bert-base-uncased
?我的测试脚本如下:
import torch import forward import time from transformers import BertTokenizer, BertModel, BertForSequenceClassification def TestForward(jit_path): tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased', torchscript=True) model.cpu() model.eval() inputs = tokenizer("你好", max_length=256, pad_to_max_length=True, return_tensors="pt") dummy_inputs = ( inputs["input_ids"], inputs["attention_mask"], inputs["token_type_ids"], ) print('dummy_inputs :', dummy_inputs) traced_model = torch.jit.trace(model, dummy_inputs) traced_model.save(jit_path) print('Jit model is saved.') builder = forward.TorchBuilder() builder.set_mode('float32') engine = builder.build(jit_path, dummy_inputs) engine_path = jit_path + '.engine' engine.save(engine_path) engine = forward.TorchEngine() engine.load(engine_path) ground_truth = traced_model(*dummy_inputs) print('ground_truth', ground_truth) time_list = [] for _ in range(1000): start = time.time() # outputs = traced_model(*dummy_inputs) outputs = engine.forward(dummy_inputs) if _ > 100: time_list.append(time.time() - start) print('outputs : ', outputs, time.time() - start) print('avg time:{}'.format(sum(time_list) / len(time_list))) if __name__ == "__main__": jit_path = 'bert_cpu.pt' print("jit_path : ", jit_path) TestForward(jit_path)CUDA 版本的差异不应导致如此巨大的性能表现,我这边需要做进一步的排查。
此外,您这边是否也能同时提供测试用到的官方脚本(若是通过命令行运行,也希望能提供详细的步骤),便于我们这边快速确定问题。
感谢您的支持!
@zhaoyiluo 根据你贴的脚本,然后,放在Forward/build/bin目录下,命名为xxx.py,命令行执行python -u xxx.py
第一次会去下载bert-base-uncased这个模型,但是我的平均耗时与之前测试一致,另外,我测速是修改
inputs = tokenizer("你好", max_length=128, pad_to_max_length=True, return_tensors="pt")
中max_length参数大小
注:最新测试的cudnn版本为8.2.1,其他版本一致,但是实际测试耗时并没有区别
32 | 64 | 128 | 256 |
---|---|---|---|
0.0029070s | 0.0047896s | 0.00706928s | 0.013830870s |
from forward.
Hi @Nipi64310 ,
针对您提到的 pytorch bert 的效率问题,我先解释下我们的 demo 运行逻辑。demo 在推理时 tensor 默认在 CPU 上,在推理结束后,tensor 默认也在 CPU 上,即整个流程为:从 CPU 拷贝 tensor 至 GPU、进行推理、从 GPU 拷贝结果回 CPU。
我们仅根据目前得到的信息,无法排除是否是由于两次拷贝造成的导出时间变长。希望您这边可以提供测试时用到的官方脚本(或可供下载的地址链接),以及测试时用到的环境,包括 TensorRT 版本、PyTorch 版本等,尽可能详细。
针对 pytorch cuda 的问题,根据报错的信息,我们怀疑是 pytorch cuda 的符号链接出错导致的,这边还是建议您使用 cpu 版本的 pytorch,来避免可能的问题。
对于 pytorch cuda 这个问题,我想做进一步解释。我们的项目是可以用 pytorch cuda 的,但是 pytorch cuda 在安装时会自带 cudatoolkit,这个 cudatoolkit 可能会与 cuda 自带的 cudatoolkit 产生冲突(包括 cudnn 等其他相关的支持)。若在这个过程中,能保证在安装 pytorch cuda 时得到的 cudatoolkit 和 cuda 自身的 cudatoolkit 是同一份 cudatoolkit(包括其他相关支持),那就能保证使用的顺利。这也是我们为什么推荐用户使用 cpu 版本的原因,避免后续这些问题。
欢迎追问!hello @zhaoyiluo 感谢回复 测试了不同长度的base版本bert耗时,对比自己转的Tensorrt engine耗时,Forward推理速度随着长度变长浮动还是挺大的
测试环境: Tesla T4 Tensorrt 7.2.3 cuda 11.1 cudnn 8.1.0 pytorch 1.7.0+cpu python 3.8.10 其他库版本信息同该评论一致 #33 (comment)
Forward测试脚本如下, 基本同官方脚本一致from transformers import BertTokenizer, BertModel, BertForSequenceClassification,TFBertForSequenceClassification import torch import forward import time def TestForward(jit_path): tokenizer = BertTokenizer.from_pretrained('/home/jovyan/pretrained_models/macbert/') model = BertForSequenceClassification.from_pretrained('/home/jovyan/pretrained_models/macbert/', torchscript=True) model.cpu() model.eval() inputs = tokenizer("你好", max_length=128, pad_to_max_length=True, return_tensors="pt") dummy_inputs = ( inputs["input_ids"], inputs["attention_mask"], inputs["token_type_ids"], ) print('dummy_inputs :', dummy_inputs) traced_model = torch.jit.trace(model, dummy_inputs) traced_model.save(jit_path) print('Jit model is saved.') builder = forward.TorchBuilder() builder.set_mode('float32') engine = builder.build(jit_path, dummy_inputs) engine_path = jit_path + '.engine' engine.save(engine_path) engine = forward.TorchEngine() engine.load(engine_path) ground_truth = traced_model(*dummy_inputs) print('ground_truth', ground_truth) time_list = [] for _ in range(1000): start = time.time() # outputs = traced_model(*dummy_inputs) outputs = engine.forward(dummy_inputs) if _ > 100: time_list.append(time.time()-start) print('outputs : ', outputs,time.time()-start) print('avg time:{}'.format(sum(time_list) / len(time_list))) if __name__ == "__main__": jit_path = 'bert_cpu.pt' print("jit_path : ", jit_path) TestForward(jit_path)
@Nipi64310 根据您提供的 Forward 测试脚本,跑了下 “你好”,在不同长度下的表现,
测试环境:
Tesla T4 Tensorrt 7.2.1.6 cuda 10.2 cudnn 8.0.4 pytorch 1.7.1 python 3.6.8
测试结果:
32 64 128 256
0.005885s 0.007538s 0.012257s 0.023222s
在仅比对和您跑的 Forward 结果,我们得出的结论也存在一定差异。进一步和您确认的是,目录下的pretrained_models/macbert/
是否是bert-base-uncased
?
我的测试脚本如下:import torch import forward import time from transformers import BertTokenizer, BertModel, BertForSequenceClassification def TestForward(jit_path): tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased', torchscript=True) model.cpu() model.eval() inputs = tokenizer("你好", max_length=256, pad_to_max_length=True, return_tensors="pt") dummy_inputs = ( inputs["input_ids"], inputs["attention_mask"], inputs["token_type_ids"], ) print('dummy_inputs :', dummy_inputs) traced_model = torch.jit.trace(model, dummy_inputs) traced_model.save(jit_path) print('Jit model is saved.') builder = forward.TorchBuilder() builder.set_mode('float32') engine = builder.build(jit_path, dummy_inputs) engine_path = jit_path + '.engine' engine.save(engine_path) engine = forward.TorchEngine() engine.load(engine_path) ground_truth = traced_model(*dummy_inputs) print('ground_truth', ground_truth) time_list = [] for _ in range(1000): start = time.time() # outputs = traced_model(*dummy_inputs) outputs = engine.forward(dummy_inputs) if _ > 100: time_list.append(time.time() - start) print('outputs : ', outputs, time.time() - start) print('avg time:{}'.format(sum(time_list) / len(time_list))) if __name__ == "__main__": jit_path = 'bert_cpu.pt' print("jit_path : ", jit_path) TestForward(jit_path)CUDA 版本的差异不应导致如此巨大的性能表现,我这边需要做进一步的排查。
此外,您这边是否也能同时提供测试用到的官方脚本(若是通过命令行运行,也希望能提供详细的步骤),便于我们这边快速确定问题。
感谢您的支持!@zhaoyiluo 根据你贴的脚本,然后,放在Forward/build/bin目录下,命名为xxx.py,命令行执行python -u xxx.py 第一次会去下载bert-base-uncased这个模型,但是我的平均耗时与之前测试一致,另外,我测速是修改
inputs = tokenizer("你好", max_length=128, pad_to_max_length=True, return_tensors="pt")
中max_length参数大小注:最新测试的cudnn版本为8.2.1,其他版本一致,但是实际测试耗时并没有区别
32 64 128 256
0.0029070s 0.0047896s 0.00706928s 0.013830870s
@Nipi64310 和您的理解正确,是对于 max_length 进行测试。请问您用的 nvidia-tensorrt 脚本是哪一个(不是用 Forward 项目测试的)?
from forward.
Hello @zhaoyiluo 上面的nvidia-tensorrt测速, 我转换的脚本修改于官方demo
只是想同Forward的推理速度对比一下,非常抱歉,由于修改的脚本已经上传至公司仓库,所以不能提供出来。
从你复现的版本耗时对比,请问我现在评测的Forward推理速度是否已经符合Forward在该gpu下推理速度的预期?
32 | 64 | 128 | 256 |
---|---|---|---|
0.0029070s | 0.0047896s | 0.00706928s | 0.013830870s |
from forward.
Hi @Nipi64310 ,
Forward 对于 bert 的加速在 FP32 和 FP16 模式下,我们和官方(TensorRT 7)导出的引擎在性能上对齐过;可以理解为,在 FP32 和 FP16 推理模式下,两者是一样的。但在 INT8 模式下,存在一些差异。由于使用 INT8 推理存在精度上的损失,我们使用了 FP16 和 INT8 混合精度推理的方式进行补偿。
基于您提供的测试数据,Forward 对于不同长度的 base 版本 bert 耗时增长,这个是可以解释的。因为 Forward 每次推理都需要两次拷贝的过程,随着长度的增加,拷贝的数据量也会增加。
对于通过 NVIDIA 官方脚本修改后的测试代码,我推测这里应该是没有拷贝的过程(或仅有一次),这个猜测是基于 max_length=64
的表现优于 max_length=32
得出的,这点也是我一直想和您确认的。
此外,刚确认了下我的测试环境,由于我使用的 T4 是共享 GPU,可能导致与您这边的结果差异较大。这个不影响上述结论哈。
from forward.
Hi @Nipi64310 ,
Forward 对于 bert 的加速在 FP32 和 FP16 模式下,我们和官方(TensorRT 7)导出的引擎在性能上对齐过;可以理解为,在 FP32 和 FP16 推理模式下,两者是一样的。但在 INT8 模式下,存在一些差异。由于使用 INT8 推理存在精度上的损失,我们使用了 FP16 和 INT8 混合精度推理的方式进行补偿。
基于您提供的测试数据,Forward 对于不同长度的 base 版本 bert 耗时增长,这个是可以解释的。因为 Forward 每次推理都需要两次拷贝的过程,随着长度的增加,拷贝的数据量也会增加。
对于通过 NVIDIA 官方脚本修改后的测试代码,我推测这里应该是没有拷贝的过程(或仅有一次),这个猜测是基于
max_length=64
的表现优于max_length=32
得出的,这点也是我一直想和您确认的。此外,刚确认了下我的测试环境,由于我使用的 T4 是共享 GPU,可能导致与您这边的结果差异较大。这个不影响上述结论哈。
Hello @zhaoyiluo
我重新测试了一下速度,确保拷贝次数一致,确实同你说的一致,Forward推理速度在FP32下同官方基本一致,上面Tensorrt测速比较快,是因为使用的FP16,Forward可太棒了,简单易用速度快。顺便测速了一下torch-gpu(版本1.10.0+cu111; GPU T4)的耗时,pytorch测速32,64,128耗时竟然差不多,pytorch-gpu测速脚本贴在下面,期待解释!
32 | 64 | 128 | 256 | |
---|---|---|---|---|
Forward | 0.0031379 | 0.0045251 | 0.0076434 | 0.013988 |
Forward-FP16 | 0.0018890 | 0.0019043 | 0.0022379 | 0.003538 |
Tensorrt | 0.0030247 | 0.0042729 | 0.0072241 | 0.013747 |
Tensorrt-FP16 | 0.0018210 | 0.0018326 | 0.0022883 | 0.003524 |
Pytorch-GPU | 0.0118754 | 0.0117887 | 0.0116027 | 0.017381 |
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import time
def TestForward(jit_path):
tokenizer = BertTokenizer.from_pretrained('/home/jovyan/pretrained_models/macbert/')
model = BertForSequenceClassification.from_pretrained('/home/jovyan/pretrained_models/macbert/', torchscript=True)
model.cuda()
model.eval()
inputs = tokenizer("你好", max_length=32, pad_to_max_length=True, return_tensors="pt")
dummy_inputs = (
inputs["input_ids"].cuda(),
inputs["attention_mask"].cuda(),
inputs["token_type_ids"].cuda(),
)
print('dummy_inputs :', dummy_inputs)
traced_model = torch.jit.trace(model, dummy_inputs)
time_list = []
for _ in range(1000):
start = time.time()
inputs = tokenizer("你好", max_length=32, pad_to_max_length=True, return_tensors="pt")
dummy_inputs = (
inputs["input_ids"].cuda(),
inputs["attention_mask"].cuda(),
inputs["token_type_ids"].cuda(),
)
outputs = traced_model(*dummy_inputs)[0].detach().cpu().numpy()
if _ > 100:
time_list.append(time.time() - start)
print('outputs : ', outputs, time.time() - start)
print('avg time:{}'.format(sum(time_list) / len(time_list)))
if __name__ == "__main__":
jit_path = 'bert_cpu.pt'
print("jit_path : ", jit_path)
TestForward(jit_path)
顺便问一下,Forward-TF这个报错应该怎么处理
builder.build(model_path, dummy_input)的结果是None,测试代码如下。dummy_inputs的key换成input_signature的key
input_ids, attention_mask, token_type_ids,返回仍然是None
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained('/home/jovyan/pretrained_models/macbert/',from_pt=True)
model.save('test_macbert')
import numpy as np
model_path='test_macbert/saved_model.pb'
engine_path='tfbert.engine'
batch_size = 16
seq_len = 64
builder = forward.TfBuilder()
dummy_input = {"input_ids" : np.ones([batch_size , seq_len], dtype='int32'),
"input_mask" : np.ones([batch_size , seq_len], dtype='int32'),
"segment_ids" : np.ones([batch_size , seq_len], dtype='int32')}
builder.set_mode('float32')
tf_engine = builder.build(model_path, dummy_input)
need_save = True
if need_save:
# save engine
tf_engine.save(engine_path)
# load saved engine
tf_engine = forward.TfEngine()
tf_engine.load(engine_path)
inputs = dummy_input
outputs = tf_engine.forward(inputs)
print(outputs)
from forward.
Hi @Nipi64310 ,
Forward 对于 bert 的加速在 FP32 和 FP16 模式下,我们和官方(TensorRT 7)导出的引擎在性能上对齐过;可以理解为,在 FP32 和 FP16 推理模式下,两者是一样的。但在 INT8 模式下,存在一些差异。由于使用 INT8 推理存在精度上的损失,我们使用了 FP16 和 INT8 混合精度推理的方式进行补偿。
基于您提供的测试数据,Forward 对于不同长度的 base 版本 bert 耗时增长,这个是可以解释的。因为 Forward 每次推理都需要两次拷贝的过程,随着长度的增加,拷贝的数据量也会增加。
对于通过 NVIDIA 官方脚本修改后的测试代码,我推测这里应该是没有拷贝的过程(或仅有一次),这个猜测是基于max_length=64
的表现优于max_length=32
得出的,这点也是我一直想和您确认的。
此外,刚确认了下我的测试环境,由于我使用的 T4 是共享 GPU,可能导致与您这边的结果差异较大。这个不影响上述结论哈。Hello @zhaoyiluo 我重新测试了一下速度,确保拷贝次数一致,确实同你说的一致,Forward推理速度在FP32下同官方基本一致,上面Tensorrt测速比较快,是因为使用的FP16,Forward可太棒了,简单易用速度快。顺便测速了一下torch-gpu(版本1.10.0+cu111; GPU T4)的耗时,pytorch测速32,64,128耗时竟然差不多,pytorch-gpu测速脚本贴在下面,期待解释!
32 64 128 256
Forward 0.0031379 0.0045251 0.0076434 0.013988
Forward-FP16 0.0018890 0.0019043 0.0022379 0.003538
Tensorrt 0.0030247 0.0042729 0.0072241 0.013747
Tensorrt-FP16 0.0018210 0.0018326 0.0022883 0.003524
Pytorch-GPU 0.0118754 0.0117887 0.0116027 0.017381from transformers import BertTokenizer, BertForSequenceClassification import torch import time def TestForward(jit_path): tokenizer = BertTokenizer.from_pretrained('/home/jovyan/pretrained_models/macbert/') model = BertForSequenceClassification.from_pretrained('/home/jovyan/pretrained_models/macbert/', torchscript=True) model.cuda() model.eval() inputs = tokenizer("你好", max_length=32, pad_to_max_length=True, return_tensors="pt") dummy_inputs = ( inputs["input_ids"].cuda(), inputs["attention_mask"].cuda(), inputs["token_type_ids"].cuda(), ) print('dummy_inputs :', dummy_inputs) traced_model = torch.jit.trace(model, dummy_inputs) time_list = [] for _ in range(1000): start = time.time() inputs = tokenizer("你好", max_length=32, pad_to_max_length=True, return_tensors="pt") dummy_inputs = ( inputs["input_ids"].cuda(), inputs["attention_mask"].cuda(), inputs["token_type_ids"].cuda(), ) outputs = traced_model(*dummy_inputs)[0].detach().cpu().numpy() if _ > 100: time_list.append(time.time() - start) print('outputs : ', outputs, time.time() - start) print('avg time:{}'.format(sum(time_list) / len(time_list))) if __name__ == "__main__": jit_path = 'bert_cpu.pt' print("jit_path : ", jit_path) TestForward(jit_path)顺便问一下,Forward-TF这个报错应该怎么处理 builder.build(model_path, dummy_input)的结果是None,测试代码如下。dummy_inputs的key换成input_signature的key input_ids, attention_mask, token_type_ids,返回仍然是None
from transformers import TFBertForSequenceClassification model = TFBertForSequenceClassification.from_pretrained('/home/jovyan/pretrained_models/macbert/',from_pt=True) model.save('test_macbert') import numpy as np model_path='test_macbert/saved_model.pb' engine_path='tfbert.engine' batch_size = 16 seq_len = 64 builder = forward.TfBuilder() dummy_input = {"input_ids" : np.ones([batch_size , seq_len], dtype='int32'), "input_mask" : np.ones([batch_size , seq_len], dtype='int32'), "segment_ids" : np.ones([batch_size , seq_len], dtype='int32')} builder.set_mode('float32') tf_engine = builder.build(model_path, dummy_input) need_save = True if need_save: # save engine tf_engine.save(engine_path) # load saved engine tf_engine = forward.TfEngine() tf_engine.load(engine_path) inputs = dummy_input outputs = tf_engine.forward(inputs) print(outputs)
Hello @Nipi64310 ,
针对您的第一个问题以及给出的测试代码,我的理解是,您想了解在使用 torch-gpu 时,通过模型直接推理(不使用 Forward) outputs = traced_model(*dummy_inputs)[0].detach().cpu().numpy()
,在 bert 的最大序列长度为 32,64,128 时的耗时变化不大的原因吗?如果我对问题的理解有误,欢迎指正。
如果理解正确,造成耗时变化不大的原因主要和 GPU 相关,具体可深入至 GPU 的架构(此处是 Turing 架构)和工作原理(涉及 调度方式、流处理器、流多处理器、线程束、线程块等等)。个人拙见,在这个情况下想要比较 GPU 的推理性能,我更建议采用两块不同架构的 GPU 做对照试验,看看另一块 GPU 在跑 bert 的最大序列长度为 32,64,128 时的耗时,这样更为合理。
针对 fwd-tf 推理 bert 模型的问题,tf_engine
应该返回了 NULL
,即没有成功生成引擎。对于 TF 框架,我们目前需要将 pb 模型的 tf.Variable
冻结,否则在序列化时会出现问题。在 python/bert_helpers
目录下,我们提供了相应的脚本(基于 TF1),来源于 google-research/bert。
我在这两天内做了一些尝试,通过 手动 或 tf_upgrade_v2
自动的方式去更新脚本,但还是会报错 bert 模型结构的问题(即无法正常加载模型)。换而言之,我们目前无法保证下载到的 bert 模型,其内部的实现算子仍能适配这些历史脚本(更新于 3 年前)。接下来我会尝试其他方法去实现我们冻结 tf.Variable
的目的,也会在更新后的第一时间通知您。
若还有疑问,欢迎继续追问。
更新:
可以参考 Demo for building BERT model 中使用 TF 推理 Bert 模型的过程。
为了正常运行转换脚本,请在 TF1 环境下运行。推荐使用 pip install tensorflow-gpu==1.15
或 pip install tensorflow==1.15
对于在纯 TF2 环境下使用 fwd-tf,我们团队会尽快更新脚本。
感谢支持!
from forward.
Hi @zhaoyiluo 感谢详细的解释。
我在使用Forward-Torch的时候,因为存在一些开源的预训练权重,同bert结构不一致的情况,比如中文的simbert, 这个模型在embedding层和encoder层之间有一层linear映射,因为embedding层的size是128,encoder hidden size大于128。
如果想用transformers工具加载该模型示例代码如下,多了一个self.embedding_hidden_mapping_in的linear层,但是因为加上这个层之后,Forward就无法构建了,build出来的engine为None,请问类似这种情况应该怎么处理呢
from transformers.models.bert.modeling_bert import BertForSequenceClassification,BertModel,BertPreTrainedModel,BertEmbeddings,BertEncoder,BertPooler,BaseModelOutputWithPoolingAndCrossAttentions
class BertModel(BertPreTrainedModel):
def __init__(self, config, add_pooling_layer=True):
super().__init__(config)
self.config = config
config.hidden_size,config.embedding_size = config.embedding_size,config.hidden_size
self.embeddings = BertEmbeddings(config)
config.hidden_size,config.embedding_size = config.embedding_size,config.hidden_size
self.encoder = BertEncoder(config)
self.pooler = BertPooler(config) if add_pooling_layer else None
self.embedding_hidden_mapping_in=nn.Linear(config.embedding_size,config.hidden_size)
self.init_weights()
def get_input_embeddings(self):
return self.embeddings.word_embeddings
def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value
def _prune_heads(self, heads_to_prune):
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_values=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if self.config.is_decoder:
use_cache = use_cache if use_cache is not None else self.config.use_cache
else:
use_cache = False
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
input_shape = input_ids.size()
batch_size, seq_length = input_shape
elif inputs_embeds is not None:
input_shape = inputs_embeds.size()[:-1]
batch_size, seq_length = input_shape
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
device = input_ids.device if input_ids is not None else inputs_embeds.device
# past_key_values_length
past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
if attention_mask is None:
attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
if token_type_ids is None:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
# If a 2D or 3D attention mask is provided for the cross-attention
# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
if self.config.is_decoder and encoder_hidden_states is not None:
encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
if encoder_attention_mask is None:
encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
else:
encoder_extended_attention_mask = None
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
embedding_output = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids,
inputs_embeds=inputs_embeds,
past_key_values_length=past_key_values_length,
)
embedding_output = self.embedding_hidden_mapping_in(embedding_output)
encoder_outputs = self.encoder(
embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_extended_attention_mask,
past_key_values=past_key_values,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
return pooled_output
测试代码
config_json = {
"attention_probs_dropout_prob": 0.0,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 312,
"embedding_size": 128,
"initializer_range": 0.02,
"intermediate_size": 1248,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 4,
"type_vocab_size": 2,
"vocab_size": 13685
}
from torch import nn
import forward
from transformers import BertConfig
from transformers import BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('/home/jovyan/pretrained_models/macbert/')
inputs = tokenizer("你好", max_length=16, pad_to_max_length=True, return_tensors="pt")
dummy_inputs = (
inputs["input_ids"],
inputs["attention_mask"],
inputs["token_type_ids"],
)
model = BertModel(config=BertConfig.from_dict(config_json))
model.eval()
jit_path = 'bert_cpu.pt'
traced_model = torch.jit.trace(model, dummy_inputs)
traced_model.save(jit_path)
builder = forward.TorchBuilder()
builder.set_mode('float32')
engine = builder.build(jit_path, dummy_inputs)
engine
from forward.
Hi @zhaoyiluo 感谢详细的解释。 我在使用Forward-Torch的时候,因为存在一些开源的预训练权重,同bert结构不一致的情况,比如中文的simbert, 这个模型在embedding层和encoder层之间有一层linear映射,因为embedding层的size是128,encoder hidden size大于128。 如果想用transformers工具加载该模型示例代码如下,多了一个self.embedding_hidden_mapping_in的linear层,但是因为加上这个层之后,Forward就无法构建了,build出来的engine为None,请问类似这种情况应该怎么处理呢
from transformers.models.bert.modeling_bert import BertForSequenceClassification,BertModel,BertPreTrainedModel,BertEmbeddings,BertEncoder,BertPooler,BaseModelOutputWithPoolingAndCrossAttentions class BertModel(BertPreTrainedModel): def __init__(self, config, add_pooling_layer=True): super().__init__(config) self.config = config config.hidden_size,config.embedding_size = config.embedding_size,config.hidden_size self.embeddings = BertEmbeddings(config) config.hidden_size,config.embedding_size = config.embedding_size,config.hidden_size self.encoder = BertEncoder(config) self.pooler = BertPooler(config) if add_pooling_layer else None self.embedding_hidden_mapping_in=nn.Linear(config.embedding_size,config.hidden_size) self.init_weights() def get_input_embeddings(self): return self.embeddings.word_embeddings def set_input_embeddings(self, value): self.embeddings.word_embeddings = value def _prune_heads(self, heads_to_prune): """ Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base class PreTrainedModel """ for layer, heads in heads_to_prune.items(): self.encoder.layer[layer].attention.prune_heads(heads) def forward( self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, ): output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_hidden_states = ( output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states ) return_dict = return_dict if return_dict is not None else self.config.use_return_dict if self.config.is_decoder: use_cache = use_cache if use_cache is not None else self.config.use_cache else: use_cache = False if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: input_shape = input_ids.size() batch_size, seq_length = input_shape elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] batch_size, seq_length = input_shape else: raise ValueError("You have to specify either input_ids or inputs_embeds") device = input_ids.device if input_ids is not None else inputs_embeds.device # past_key_values_length past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0 if attention_mask is None: attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device) if token_type_ids is None: token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device) # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] # ourselves in which case we just need to make it broadcastable to all heads. extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device) # If a 2D or 3D attention mask is provided for the cross-attention # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length] if self.config.is_decoder and encoder_hidden_states is not None: encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size() encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length) if encoder_attention_mask is None: encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device) encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask) else: encoder_extended_attention_mask = None # Prepare head mask if needed # 1.0 in head_mask indicate we keep the head # attention_probs has shape bsz x n_heads x N x N # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) embedding_output = self.embeddings( input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds, past_key_values_length=past_key_values_length, ) embedding_output = self.embedding_hidden_mapping_in(embedding_output) encoder_outputs = self.encoder( embedding_output, attention_mask=extended_attention_mask, head_mask=head_mask, encoder_hidden_states=encoder_hidden_states, encoder_attention_mask=encoder_extended_attention_mask, past_key_values=past_key_values, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) sequence_output = encoder_outputs[0] pooled_output = self.pooler(sequence_output) if self.pooler is not None else None return pooled_output测试代码
config_json = { "attention_probs_dropout_prob": 0.0, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 312, "embedding_size": 128, "initializer_range": 0.02, "intermediate_size": 1248, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 4, "type_vocab_size": 2, "vocab_size": 13685 } from torch import nn import forward from transformers import BertConfig from transformers import BertTokenizer import torch tokenizer = BertTokenizer.from_pretrained('/home/jovyan/pretrained_models/macbert/') inputs = tokenizer("你好", max_length=16, pad_to_max_length=True, return_tensors="pt") dummy_inputs = ( inputs["input_ids"], inputs["attention_mask"], inputs["token_type_ids"], ) model = BertModel(config=BertConfig.from_dict(config_json)) model.eval() jit_path = 'bert_cpu.pt' traced_model = torch.jit.trace(model, dummy_inputs) traced_model.save(jit_path) builder = forward.TorchBuilder() builder.set_mode('float32') engine = builder.build(jit_path, dummy_inputs) engine
Hello @Nipi64310 ,很抱歉 Forward 目前只支持原始的 Bert,并允许用户在其尾部添加一些常见的操作;但如果对 Bert 的结构进行了调整,暂时还不支持这样的推理。
from forward.
Hi @zhaoyiluo,
Get,期待未来能支持这种只进行简单修改的bert结构的推理,再一次表示感谢,可以关闭issue了
from forward.
Hi @zhaoyiluo, Get,期待未来能支持这种只进行简单修改的bert结构的推理,再一次表示感谢,可以关闭issue了
已收到您的建议,也再次感谢对 Forward 的支持!
from forward.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
from forward.
This issue was closed because it has been stalled for 5 days with no activity.
from forward.
Related Issues (20)
- 使用这个Forward框架进行推断的时候会有精度损失吗? HOT 6
- 关于fwd-torch的几个路径的问题 HOT 2
- Segmentation fault (core dumped) when transfer keras model to trt. HOT 4
- Help to support tensorflow slim "Flatten" pattern to Tensorrt. HOT 10
- 编译VC项目时出错 HOT 14
- vs2017怎么跑demo? HOT 8
- Is have docker with all dependencies? HOT 3
- Will forward framework support paddlepaddle in the future? HOT 4
- win_python_keras版本加载模型时报错 HOT 4
- 编译出错 找不到cublas_device库 HOT 7
- keras.layer里Embedding的trt实现 HOT 8
- Keras中Flatten层的支持 HOT 2
- 对‘fwd::TrtForwardEngine::Load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)’未定义的引用 HOT 9
- 是否考虑重新写一个模型序列化?? HOT 3
- 与trtorch等项目的优劣对比 HOT 3
- reflectPad存在两个问题 HOT 9
- 编译fwd-pytroch时提示缺少头文件 HOT 1
- 优化模型时报错 HOT 2
- [TRT] (Unnamed Layer* 0) [Convolution]: at least 4 dimensions are required for input HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from forward.