From bb8d633a71eb6b3d8e23e62d6792b28bcf95e817 Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 12:28:52 -0700 Subject: [PATCH 01/14] Improve TRT doc and advance the EP session option --- .../TensorRT-ExecutionProvider.md | 305 ++++++++++++------ 1 file changed, 202 insertions(+), 103 deletions(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index c8443f885c406..aedf8feeddc8a 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -106,144 +106,139 @@ sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider' ``` ## Configurations -There are two ways to configure TensorRT settings, either by environment variables or by execution provider option APIs. +There are two ways to configure TensorRT settings, either by TensorRT execution provider options(recommended) or environment variables. + +There are one-to-one mappings between **TensorRT Execution Provider Session Options** and **Environment Variables** shown as below: + +| TensorRT EP Session Options | Environment Variables | Type | +|:--------------------------------------|:-----------------------------------------------|:-------| +| trt_max_workspace_size | ORT_TENSORRT_MAX_WORKSPACE_SIZE | int | +| trt_max_partition_iterations | ORT_TENSORRT_MAX_PARTITION_ITERATIONS | int | +| trt_min_subgraph_size | ORT_TENSORRT_MIN_SUBGRAPH_SIZE | int | +| trt_fp16_enable | ORT_TENSORRT_FP16_ENABLE | bool | +| trt_int8_enable | ORT_TENSORRT_INT8_ENABLE | bool | +| trt_int8_calibration_table_name | ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME | string | +| trt_int8_use_native_calibration_table | ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE | bool | +| trt_dla_enable | ORT_TENSORRT_DLA_ENABLE | bool | +| trt_dla_core | ORT_TENSORRT_DLA_CORE | int | +| trt_engine_cache_enable | ORT_TENSORRT_ENGINE_CACHE_ENABLE | bool | +| trt_engine_cache_path | ORT_TENSORRT_CACHE_PATH | string | +| trt_dump_subgraphs | ORT_TENSORRT_DUMP_SUBGRAPHS | bool | +| trt_force_sequential_engine_build | ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD | bool | +| trt_context_memory_sharing_enable | ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE | bool | +| trt_layer_norm_fp32_fallback | ORT_TENSORRT_LAYER_NORM_FP32_FALLBACK | bool | +| trt_timing_cache_enable | ORT_TENSORRT_TIMING_CACHE_ENABLE | bool | +| trt_force_timing_cache | ORT_TENSORRT_FORCE_TIMING_CACHE_ENABLE | bool | +| trt_detailed_build_log | ORT_TENSORRT_DETAILED_BUILD_LOG_ENABLE | bool | +| trt_build_heuristics_enable | ORT_TENSORRT_BUILD_HEURISTICS_ENABLE | bool | +| trt_sparsity_enable | ORT_TENSORRT_SPARSITY_ENABLE | bool | +| trt_builder_optimization_level | ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL | int | +| trt_auxiliary_streams | ORT_TENSORRT_AUXILIARY_STREAMS | int | +| trt_tactic_sources | ORT_TENSORRT_TACTIC_SOURCES | string | +| trt_extra_plugin_lib_paths | ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS | string | +| trt_profile_min_shapes | ORT_TENSORRT_PROFILE_MIN_SHAPES | string | +| trt_profile_max_shapes | ORT_TENSORRT_PROFILE_MAX_SHAPES | string | +| trt_profile_opt_shapes | ORT_TENSORRT_PROFILE_OPT_SHAPES | string | -### Environment Variables -Following environment variables can be set for TensorRT execution provider. - -* `ORT_TENSORRT_MAX_WORKSPACE_SIZE`: maximum workspace size for TensorRT engine. Default value: 1073741824 (1GB). - -* `ORT_TENSORRT_MAX_PARTITION_ITERATIONS`: maximum number of iterations allowed in model partitioning for TensorRT. If target model can't be successfully partitioned when the maximum number of iterations is reached, the whole model will fall back to other execution providers such as CUDA or CPU. Default value: 1000. - -* `ORT_TENSORRT_MIN_SUBGRAPH_SIZE`: minimum node size in a subgraph after partitioning. Subgraphs with smaller size will fall back to other execution providers. Default value: 1. - -* `ORT_TENSORRT_FP16_ENABLE`: Enable FP16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support FP16 precision. - -* `ORT_TENSORRT_INT8_ENABLE`: Enable INT8 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support INT8 precision. - -* `ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME`: Specify INT8 calibration table file for non-QDQ models in INT8 mode. Note calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty. - -* `ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE`: Select what calibration table is used for non-QDQ models in INT8 mode. If 1, native TensorRT generated calibration table is used; if 0, ONNXRUNTIME tool generated calibration table is used. Default value: 0. - * **Note: Please copy up-to-date calibration table file to `ORT_TENSORRT_CACHE_PATH` before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced.** - -* `ORT_TENSORRT_DLA_ENABLE`: Enable DLA (Deep Learning Accelerator). 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support DLA. - -* `ORT_TENSORRT_DLA_CORE`: Specify DLA core to execute on. Default value: 0. +> Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. -* `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0. - * **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:** - * Model changes (if there are any changes to the model topology, opset version, operators etc.) - * ORT version changes (i.e. moving from ORT version 1.8 to 1.9) - * TensorRT version changes (i.e. moving from TensorRT 7.0 to 8.0) - * Hardware changes. (Engine and profile files are not portable and optimized for specific Nvidia hardware) +### Execution Provider Options -* `ORT_TENSORRT_CACHE_PATH`: Specify path for TensorRT engine and profile files if `ORT_TENSORRT_ENGINE_CACHE_ENABLE` is 1, or path for INT8 calibration table file if ORT_TENSORRT_INT8_ENABLE is 1. +TensorRT configurations can be set by execution provider options. It's useful when each model and inference session have their own configurations. In this case, execution provider option settings will override any environment variable settings. All configurations should be set explicitly, otherwise default value will be taken. -* `ORT_TENSORRT_DUMP_SUBGRAPHS`: Dumps the subgraphs that are transformed into TRT engines in onnx format to the filesystem. This can help debugging subgraphs, e.g. by using `trtexec --onnx my_model.onnx` and check the outputs of the parser. 1: enabled, 0: disabled. Default value: 0. +* `trt_max_workspace_size`: maximum workspace size for TensorRT engine. + * Default value: 1073741824 (1GB). -* `ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD`: Sequentially build TensorRT engines across provider instances in multi-GPU environment. 1: enabled, 0: disabled. Default value: 0. +* `trt_max_partition_iterations`: maximum number of iterations allowed in model partitioning for TensorRT. + * If target model can't be successfully partitioned when the maximum number of iterations is reached, the whole model will fall back to other execution providers such as CUDA or CPU. + * Default value: 1000. -* `ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE`: Share execution context memory between TensorRT subgraphs. Default 0 = false, nonzero = true. +* `trt_min_subgraph_size`: minimum node size in a subgraph after partitioning. + * Subgraphs with smaller size will fall back to other execution providers. + * Default value: 1. -* `ORT_TENSORRT_LAYER_NORM_FP32_FALLBACK`: Force Pow + Reduce ops in layer norm to FP32. Default 0 = false, nonzero = true. +* `trt_fp16_enable`: Enable FP16 mode in TensorRT. + > Note: not all Nvidia GPUs support FP16 precision. -* `ORT_TENSORRT_TIMING_CACHE_ENABLE`: Enable TensorRT timing cache. Default 0 = false, nonzero = true. Check [Timing cache](#timing-cache) for details. +* `trt_int8_enable`: Enable INT8 mode in TensorRT. + > Note: not all Nvidia GPUs support INT8 precision. -* `ORT_TENSORRT_FORCE_TIMING_CACHE_ENABLE`: Force the TensorRT timing cache to be used even if device profile does not match. Default 0 = false, nonzero = true. +* `trt_int8_calibration_table_name`: Specify INT8 calibration table file for non-QDQ models in INT8 mode. + > Note: calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty. -* `ORT_TENSORRT_DETAILED_BUILD_LOG_ENABLE`: Enable detailed build step logging on TensorRT EP with timing for each engine build. Default 0 = false, nonzero = true. +* `trt_int8_use_native_calibration_table`: Select what calibration table is used for non-QDQ models in INT8 mode. + * If `True`, native TensorRT generated calibration table is used; + * If `False`, ONNXRUNTIME tool generated calibration table is used. + > Note: Please copy up-to-date calibration table file to `trt_engine_cache_path` before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced. -* `ORT_TENSORRT_BUILD_HEURISTICS_ENABLE`: Build engine using heuristics to reduce build time. Default 0 = false, nonzero = true. +* `trt_dla_enable`: Enable DLA (Deep Learning Accelerator). + > Note: Not all Nvidia GPUs support DLA. -* `ORT_TENSORRT_SPARSITY_ENABLE`: Control if sparsity can be used by TRT. Default 0 = false, 1 = true. Check `--sparsity` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). +* `trt_dla_core`: Specify DLA core to execute on. Default value: 0. -* `ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL`: Set the builder optimization level. WARNING: levels below 3 do not guarantee good engine performance, but greatly improve build time. Default 3, valid range [0-5]. Check `--builderOptimizationLevel` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). +* `trt_engine_cache_enable`: Enable TensorRT engine caching. -* `ORT_TENSORRT_AUXILIARY_STREAMS`: Set maximum number of auxiliary streams per inference stream. Setting this value to 0 will lead to optimal memory usage. Default -1 = heuristics. Check `--maxAuxStreams` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). + * The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. -* `ORT_TENSORRT_TACTIC_SOURCES`: Specify the tactics to be used by adding (+) or removing (-) tactics from the default tactic sources (default = all available tactics) e.g. "-CUDNN,+CUBLAS" available keys: "CUBLAS", "CUBLAS_LT", "CUDNN" or "EDGE_MASK_CONVOLUTIONS". + * Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). -* `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS`: Specify extra TensorRT plugin library paths. ORT TRT by default supports any TRT plugins registered in TRT registry in TRT plugin library (i.e., `libnvinfer_plugin.so`). Moreover, if users want to use other TRT plugins that are not in TRT plugin library, for example, FasterTransformer has many TRT plugin implementations for different models, user can specify like this `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS=libvit_plugin.so;libvit_int8_plugin.so`. + * Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. -* `ORT_TENSORRT_PROFILE_MIN_SHAPES`, `ORT_TENSORRT_PROFILE_MAX_SHAPES` and `ORT_TENSORRT_PROFILE_OPT_SHAPES` : Build with dynamic shapes using a profile with the min/max/opt shapes provided. The format of the profile shapes is "input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,..." and these three flags should all be provided in order to enable explicit profile shapes feature. Check [Explicit shape range for dynamic shape input](#explicit-shape-range-for-dynamic-shape-input) and TRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details. + > **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:** + > + > * Model changes (if there are any changes to the model topology, opset version, operators etc.) + > * ORT version changes (i.e. moving from ORT version 1.8 to 1.9) + > * TensorRT version changes (i.e. moving from TensorRT 7.0 to 8.0) -One can override default values by setting environment variables. e.g. on Linux: +* `trt_engine_cache_path`: Specify path for TensorRT engine and profile files if `trt_engine_cache_enable` is `True`, or path for INT8 calibration table file if `trt_int8_enable` is `True`. -```bash -# Override default max workspace size to 2GB -export ORT_TENSORRT_MAX_WORKSPACE_SIZE=2147483648 +* `trt_dump_subgraphs`: Dumps the subgraphs that are transformed into TRT engines in onnx format to the filesystem. + * This can help debugging subgraphs, e.g. by using `trtexec --onnx my_model.onnx` and check the outputs of the parser. -# Override default maximum number of iterations to 10 -export ORT_TENSORRT_MAX_PARTITION_ITERATIONS=10 +* `trt_force_sequential_engine_build`: Sequentially build TensorRT engines across provider instances in multi-GPU environment. -# Override default minimum subgraph node size to 5 -export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5 +* `trt_context_memory_sharing_enable`: Share execution context memory between TensorRT subgraphs. -# Enable FP16 mode in TensorRT -export ORT_TENSORRT_FP16_ENABLE=1 +* `trt_layer_norm_fp32_fallback`: Force Pow + Reduce ops in layer norm to FP32. -# Enable INT8 mode in TensorRT -export ORT_TENSORRT_INT8_ENABLE=1 +* `trt_timing_cache_enable`: Enable TensorRT timing cache. + * Check [Timing cache](#timing-cache) for details. -# Use native TensorRT calibration table -export ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE=1 +* `trt_force_timing_cache`: Force the TensorRT timing cache to be used even if device profile does not match. -# Enable TensorRT engine caching -export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1 -# Please Note warning above. This feature is experimental. -# Engine cache files must be invalidated if there are any changes to the model, ORT version, TensorRT version or if the underlying hardware changes. Engine files are not portable across devices. +* `trt_detailed_build_log`: Enable detailed build step logging on TensorRT EP with timing for each engine build. -# Specify TensorRT cache path -export ORT_TENSORRT_CACHE_PATH="/path/to/cache" +* `trt_build_heuristics_enable`: Build engine using heuristics to reduce build time. -# Dump out subgraphs to run on TensorRT -export ORT_TENSORRT_DUMP_SUBGRAPHS=1 +* `trt_sparsity_enable`: Control if sparsity can be used by TRT. + * Check `--sparsity` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). -# Enable context memory sharing between TensorRT subgraphs. Default 0 = false, nonzero = true -export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 -``` +* `trt_builder_optimization_level`: Set the builder optimization level. + > WARNING: levels below 3 do not guarantee good engine performance, but greatly improve build time. Default 3, valid range [0-5]. Check `--builderOptimizationLevel` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). -### Execution Provider Options +* `trt_auxiliary_streams`: Set maximum number of auxiliary streams per inference stream. + * Setting this value to 0 will lead to optimal memory usage. + * Default -1 = heuristics. + * Check `--maxAuxStreams` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). -TensorRT configurations can also be set by execution provider option APIs. It's useful when each model and inference session have their own configurations. In this case, execution provider option settings will override any environment variable settings. All configurations should be set explicitly, otherwise default value will be taken. +* `trt_tactic_sources`: Specify the tactics to be used by adding (+) or removing (-) tactics from the default tactic sources (default = all available tactics) + * e.g. "-CUDNN,+CUBLAS" available keys: "CUBLAS", "CUBLAS_LT", "CUDNN" or "EDGE_MASK_CONVOLUTIONS". -There are one-to-one mappings between **environment variables** and **execution provider options APIs** shown as below: +* `trt_extra_plugin_lib_paths`: Specify extra TensorRT plugin library paths. + * ORT TRT by default supports any TRT plugins registered in TRT registry in TRT plugin library (i.e., `libnvinfer_plugin.so`). + * Moreover, if users want to use other TRT plugins that are not in TRT plugin library, + * for example, FasterTransformer has many TRT plugin implementations for different models, user can specify like this `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS=libvit_plugin.so;libvit_int8_plugin.so`. -> Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. +* `trt_profile_min_shapes`, `trt_profile_max_shapes` and `trt_profile_opt_shapes` : Build with dynamic shapes using a profile with the min/max/opt shapes provided. + * The format of the profile shapes is "input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,..." and these three flags should all be provided in order to enable explicit profile shapes feature. + * Check [Explicit shape range for dynamic shape input](#explicit-shape-range-for-dynamic-shape-input) and TRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details. -| environment variables | execution provider option APIs | type | -| ---------------------------------------------- | ------------------------------------- | ------ | -| ORT_TENSORRT_MAX_WORKSPACE_SIZE | trt_max_workspace_size | int | -| ORT_TENSORRT_MAX_PARTITION_ITERATIONS | trt_max_partition_iterations | int | -| ORT_TENSORRT_MIN_SUBGRAPH_SIZE | trt_min_subgraph_size | int | -| ORT_TENSORRT_FP16_ENABLE | trt_fp16_enable | bool | -| ORT_TENSORRT_INT8_ENABLE | trt_int8_enable | bool | -| ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME | trt_int8_calibration_table_name | string | -| ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE | trt_int8_use_native_calibration_table | bool | -| ORT_TENSORRT_DLA_ENABLE | trt_dla_enable | bool | -| ORT_TENSORRT_DLA_CORE | trt_dla_core | int | -| ORT_TENSORRT_ENGINE_CACHE_ENABLE | trt_engine_cache_enable | bool | -| ORT_TENSORRT_CACHE_PATH | trt_engine_cache_path | string | -| ORT_TENSORRT_DUMP_SUBGRAPHS | trt_dump_subgraphs | bool | -| ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD | trt_force_sequential_engine_build | bool | -| ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE | trt_context_memory_sharing_enable | bool | -| ORT_TENSORRT_LAYER_NORM_FP32_FALLBACK | trt_layer_norm_fp32_fallback | bool | -| ORT_TENSORRT_TIMING_CACHE_ENABLE | trt_timing_cache_enable | bool | -| ORT_TENSORRT_FORCE_TIMING_CACHE_ENABLE | trt_force_timing_cache | bool | -| ORT_TENSORRT_DETAILED_BUILD_LOG_ENABLE | trt_detailed_build_log | bool | -| ORT_TENSORRT_BUILD_HEURISTICS_ENABLE | trt_build_heuristics_enable | bool | -| ORT_TENSORRT_SPARSITY_ENABLE | trt_sparsity_enable | bool | -| ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL | trt_builder_optimization_level | bool | -| ORT_TENSORRT_AUXILIARY_STREAMS | trt_auxiliary_streams | bool | -| ORT_TENSORRT_TACTIC_SOURCES | trt_tactic_sources | string | -| ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS | trt_extra_plugin_lib_paths | string | -| ORT_TENSORRT_PROFILE_MIN_SHAPES | trt_profile_min_shapes | string | -| ORT_TENSORRT_PROFILE_MAX_SHAPES | trt_profile_max_shapes | string | -| ORT_TENSORRT_PROFILE_OPT_SHAPES | trt_profile_opt_shapes | string | Besides, `device_id` can also be set by execution provider option. -#### C++ API example +
+ Click to expand C++ API example + ```c++ Ort::SessionOptions session_options; @@ -263,7 +258,11 @@ trt_options.trt_dump_subgraphs = 1; session_options.AppendExecutionProvider_TensorRT(trt_options); ``` -#### Python API example +
+ +
+ Click to expand Python API example + ```python import onnxruntime as ort @@ -289,6 +288,106 @@ sess_opt = ort.SessionOptions() sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers) ``` +
+ +### Environment Variables +Following environment variables can be set for TensorRT execution provider. + +* `ORT_TENSORRT_MAX_WORKSPACE_SIZE`: maximum workspace size for TensorRT engine. Default value: 1073741824 (1GB). + +* `ORT_TENSORRT_MAX_PARTITION_ITERATIONS`: maximum number of iterations allowed in model partitioning for TensorRT. If target model can't be successfully partitioned when the maximum number of iterations is reached, the whole model will fall back to other execution providers such as CUDA or CPU. Default value: 1000. + +* `ORT_TENSORRT_MIN_SUBGRAPH_SIZE`: minimum node size in a subgraph after partitioning. Subgraphs with smaller size will fall back to other execution providers. Default value: 1. + +* `ORT_TENSORRT_FP16_ENABLE`: Enable FP16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support FP16 precision. + +* `ORT_TENSORRT_INT8_ENABLE`: Enable INT8 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support INT8 precision. + +* `ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME`: Specify INT8 calibration table file for non-QDQ models in INT8 mode. Note calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty. + +* `ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE`: Select what calibration table is used for non-QDQ models in INT8 mode. If 1, native TensorRT generated calibration table is used; if 0, ONNXRUNTIME tool generated calibration table is used. Default value: 0. + * **Note: Please copy up-to-date calibration table file to `ORT_TENSORRT_CACHE_PATH` before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced.** + +* `ORT_TENSORRT_DLA_ENABLE`: Enable DLA (Deep Learning Accelerator). 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support DLA. + +* `ORT_TENSORRT_DLA_CORE`: Specify DLA core to execute on. Default value: 0. + +* `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0. + * **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:** + * Model changes (if there are any changes to the model topology, opset version, operators etc.) + * ORT version changes (i.e. moving from ORT version 1.8 to 1.9) + * TensorRT version changes (i.e. moving from TensorRT 7.0 to 8.0) + * Hardware changes. (Engine and profile files are not portable and optimized for specific Nvidia hardware) + +* `ORT_TENSORRT_CACHE_PATH`: Specify path for TensorRT engine and profile files if `ORT_TENSORRT_ENGINE_CACHE_ENABLE` is 1, or path for INT8 calibration table file if ORT_TENSORRT_INT8_ENABLE is 1. + +* `ORT_TENSORRT_DUMP_SUBGRAPHS`: Dumps the subgraphs that are transformed into TRT engines in onnx format to the filesystem. This can help debugging subgraphs, e.g. by using `trtexec --onnx my_model.onnx` and check the outputs of the parser. 1: enabled, 0: disabled. Default value: 0. + +* `ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD`: Sequentially build TensorRT engines across provider instances in multi-GPU environment. 1: enabled, 0: disabled. Default value: 0. + +* `ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE`: Share execution context memory between TensorRT subgraphs. Default 0 = false, nonzero = true. + +* `ORT_TENSORRT_LAYER_NORM_FP32_FALLBACK`: Force Pow + Reduce ops in layer norm to FP32. Default 0 = false, nonzero = true. + +* `ORT_TENSORRT_TIMING_CACHE_ENABLE`: Enable TensorRT timing cache. Default 0 = false, nonzero = true. Check [Timing cache](#timing-cache) for details. + +* `ORT_TENSORRT_FORCE_TIMING_CACHE_ENABLE`: Force the TensorRT timing cache to be used even if device profile does not match. Default 0 = false, nonzero = true. + +* `ORT_TENSORRT_DETAILED_BUILD_LOG_ENABLE`: Enable detailed build step logging on TensorRT EP with timing for each engine build. Default 0 = false, nonzero = true. + +* `ORT_TENSORRT_BUILD_HEURISTICS_ENABLE`: Build engine using heuristics to reduce build time. Default 0 = false, nonzero = true. + +* `ORT_TENSORRT_SPARSITY_ENABLE`: Control if sparsity can be used by TRT. Default 0 = false, 1 = true. Check `--sparsity` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). + +* `ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL`: Set the builder optimization level. WARNING: levels below 3 do not guarantee good engine performance, but greatly improve build time. Default 3, valid range [0-5]. Check `--builderOptimizationLevel` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). + +* `ORT_TENSORRT_AUXILIARY_STREAMS`: Set maximum number of auxiliary streams per inference stream. Setting this value to 0 will lead to optimal memory usage. Default -1 = heuristics. Check `--maxAuxStreams` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags). + +* `ORT_TENSORRT_TACTIC_SOURCES`: Specify the tactics to be used by adding (+) or removing (-) tactics from the default tactic sources (default = all available tactics) e.g. "-CUDNN,+CUBLAS" available keys: "CUBLAS", "CUBLAS_LT", "CUDNN" or "EDGE_MASK_CONVOLUTIONS". + +* `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS`: Specify extra TensorRT plugin library paths. ORT TRT by default supports any TRT plugins registered in TRT registry in TRT plugin library (i.e., `libnvinfer_plugin.so`). Moreover, if users want to use other TRT plugins that are not in TRT plugin library, for example, FasterTransformer has many TRT plugin implementations for different models, user can specify like this `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS=libvit_plugin.so;libvit_int8_plugin.so`. + +* `ORT_TENSORRT_PROFILE_MIN_SHAPES`, `ORT_TENSORRT_PROFILE_MAX_SHAPES` and `ORT_TENSORRT_PROFILE_OPT_SHAPES` : Build with dynamic shapes using a profile with the min/max/opt shapes provided. The format of the profile shapes is "input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,..." and these three flags should all be provided in order to enable explicit profile shapes feature. Check [Explicit shape range for dynamic shape input](#explicit-shape-range-for-dynamic-shape-input) and TRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details. + +One can override default values by setting environment variables. +
+ Click to expand examples on Linux: + + # Override default max workspace size to 2GB + export ORT_TENSORRT_MAX_WORKSPACE_SIZE=2147483648 + + # Override default maximum number of iterations to 10 + export ORT_TENSORRT_MAX_PARTITION_ITERATIONS=10 + + # Override default minimum subgraph node size to 5 + export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5 + + # Enable FP16 mode in TensorRT + export ORT_TENSORRT_FP16_ENABLE=1 + + # Enable INT8 mode in TensorRT + export ORT_TENSORRT_INT8_ENABLE=1 + + # Use native TensorRT calibration table + export ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE=1 + + # Enable TensorRT engine caching + export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1 + # Please Note warning above. This feature is experimental. + # Engine cache files must be invalidated if there are any changes to the model, ORT version, TensorRT version or if the underlying hardware changes. Engine files are not portable across devices. + + # Specify TensorRT cache path + export ORT_TENSORRT_CACHE_PATH="/path/to/cache" + + # Dump out subgraphs to run on TensorRT + export ORT_TENSORRT_DUMP_SUBGRAPHS=1 + + # Enable context memory sharing between TensorRT subgraphs. Default 0 = false, nonzero = true + export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 + +
+ + ## Performance Tuning For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md) From 63cfa93ed6e5953dc68f09bfb69cad526409b6d0 Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 12:48:52 -0700 Subject: [PATCH 02/14] restore sample code UI --- .../TensorRT-ExecutionProvider.md | 66 ++++++++----------- 1 file changed, 28 insertions(+), 38 deletions(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index aedf8feeddc8a..bce74d260db9d 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -106,9 +106,8 @@ sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider' ``` ## Configurations -There are two ways to configure TensorRT settings, either by TensorRT execution provider options(recommended) or environment variables. +There are two ways to configure TensorRT settings, either by **TensorRT Execution Provider Session Options(recommended)** or **Environment Variables** shown as below: -There are one-to-one mappings between **TensorRT Execution Provider Session Options** and **Environment Variables** shown as below: | TensorRT EP Session Options | Environment Variables | Type | |:--------------------------------------|:-----------------------------------------------|:-------| @@ -236,8 +235,7 @@ TensorRT configurations can be set by execution provider options. It's useful wh Besides, `device_id` can also be set by execution provider option. -
- Click to expand C++ API example +#### C++ API EXAMPLE ```c++ @@ -258,10 +256,7 @@ trt_options.trt_dump_subgraphs = 1; session_options.AppendExecutionProvider_TensorRT(trt_options); ``` -
- -
- Click to expand Python API example +#### PYTHON API EXAMPLE ```python import onnxruntime as ort @@ -288,8 +283,6 @@ sess_opt = ort.SessionOptions() sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers) ``` -
- ### Environment Variables Following environment variables can be set for TensorRT execution provider. @@ -349,44 +342,41 @@ Following environment variables can be set for TensorRT execution provider. * `ORT_TENSORRT_PROFILE_MIN_SHAPES`, `ORT_TENSORRT_PROFILE_MAX_SHAPES` and `ORT_TENSORRT_PROFILE_OPT_SHAPES` : Build with dynamic shapes using a profile with the min/max/opt shapes provided. The format of the profile shapes is "input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,..." and these three flags should all be provided in order to enable explicit profile shapes feature. Check [Explicit shape range for dynamic shape input](#explicit-shape-range-for-dynamic-shape-input) and TRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details. -One can override default values by setting environment variables. -
- Click to expand examples on Linux: - - # Override default max workspace size to 2GB - export ORT_TENSORRT_MAX_WORKSPACE_SIZE=2147483648 +One can override default values by setting environment variables. e.g. on Linux: - # Override default maximum number of iterations to 10 - export ORT_TENSORRT_MAX_PARTITION_ITERATIONS=10 +```bash +# Override default max workspace size to 2GB +export ORT_TENSORRT_MAX_WORKSPACE_SIZE=2147483648 - # Override default minimum subgraph node size to 5 - export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5 +# Override default maximum number of iterations to 10 +export ORT_TENSORRT_MAX_PARTITION_ITERATIONS=10 - # Enable FP16 mode in TensorRT - export ORT_TENSORRT_FP16_ENABLE=1 +# Override default minimum subgraph node size to 5 +export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5 - # Enable INT8 mode in TensorRT - export ORT_TENSORRT_INT8_ENABLE=1 +# Enable FP16 mode in TensorRT +export ORT_TENSORRT_FP16_ENABLE=1 - # Use native TensorRT calibration table - export ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE=1 +# Enable INT8 mode in TensorRT +export ORT_TENSORRT_INT8_ENABLE=1 - # Enable TensorRT engine caching - export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1 - # Please Note warning above. This feature is experimental. - # Engine cache files must be invalidated if there are any changes to the model, ORT version, TensorRT version or if the underlying hardware changes. Engine files are not portable across devices. +# Use native TensorRT calibration table +export ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE=1 - # Specify TensorRT cache path - export ORT_TENSORRT_CACHE_PATH="/path/to/cache" +# Enable TensorRT engine caching +export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1 +# Please Note warning above. This feature is experimental. +# Engine cache files must be invalidated if there are any changes to the model, ORT version, TensorRT version or if the underlying hardware changes. Engine files are not portable across devices. - # Dump out subgraphs to run on TensorRT - export ORT_TENSORRT_DUMP_SUBGRAPHS=1 +# Specify TensorRT cache path +export ORT_TENSORRT_CACHE_PATH="/path/to/cache" - # Enable context memory sharing between TensorRT subgraphs. Default 0 = false, nonzero = true - export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 - -
+# Dump out subgraphs to run on TensorRT +export ORT_TENSORRT_DUMP_SUBGRAPHS=1 +# Enable context memory sharing between TensorRT subgraphs. Default 0 = false, nonzero = true +export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 +``` ## Performance Tuning For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md) From 8fe71720db3e0a5e1127aa94357aa456febfb380 Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 15:52:44 -0700 Subject: [PATCH 03/14] mark Environment Variables as deprecated --- docs/execution-providers/TensorRT-ExecutionProvider.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index bce74d260db9d..45924d3d30f4d 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -106,7 +106,7 @@ sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider' ``` ## Configurations -There are two ways to configure TensorRT settings, either by **TensorRT Execution Provider Session Options(recommended)** or **Environment Variables** shown as below: +There are two ways to configure TensorRT settings, either by **TensorRT Execution Provider Session Options(recommended)** or **Environment Variables(deprecated)** shown as below: | TensorRT EP Session Options | Environment Variables | Type | @@ -283,7 +283,11 @@ sess_opt = ort.SessionOptions() sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers) ``` -### Environment Variables +### Environment Variables(deprecated) + +
+Click to expand: + Following environment variables can be set for TensorRT execution provider. * `ORT_TENSORRT_MAX_WORKSPACE_SIZE`: maximum workspace size for TensorRT engine. Default value: 1073741824 (1GB). @@ -378,6 +382,8 @@ export ORT_TENSORRT_DUMP_SUBGRAPHS=1 export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 ``` +
+ ## Performance Tuning For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md) From d43ddf7b28b9d54e0843e1226189517fbc8106da Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 16:10:23 -0700 Subject: [PATCH 04/14] test --- docs/execution-providers/TensorRT-ExecutionProvider.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index 45924d3d30f4d..f9f1509c93cb3 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -286,7 +286,6 @@ sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=provide ### Environment Variables(deprecated)
-Click to expand: Following environment variables can be set for TensorRT execution provider. From 259d262a0b38c5b9ccefc5f8939d40ff874ff18a Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 16:13:13 -0700 Subject: [PATCH 05/14] test --- docs/execution-providers/TensorRT-ExecutionProvider.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index f9f1509c93cb3..50639abb47a28 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -286,6 +286,11 @@ sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=provide ### Environment Variables(deprecated)
+ Click to expand! + + + +
Following environment variables can be set for TensorRT execution provider. @@ -381,8 +386,6 @@ export ORT_TENSORRT_DUMP_SUBGRAPHS=1 export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 ``` -
- ## Performance Tuning For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md) From 940eb4b088f24b91d3b16c772e717bd13281a432 Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 16:29:07 -0700 Subject: [PATCH 06/14] Fold example code --- .../TensorRT-ExecutionProvider.md | 133 +++++++++++------- 1 file changed, 79 insertions(+), 54 deletions(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index 50639abb47a28..f2562b772c096 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -55,56 +55,22 @@ Ort::Session session(env, model_path, sf); The C API details are [here](../get-started/with-c.md). -### Shape Inference for TensorRT Subgraphs -If some operators in the model are not supported by TensorRT, ONNX Runtime will partition the graph and only send supported subgraphs to TensorRT execution provider. Because TensorRT requires that all inputs of the subgraphs have shape specified, ONNX Runtime will throw error if there is no input shape info. In this case please run shape inference for the entire model first by running script [here](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/symbolic_shape_infer.py) (Check below for sample). - -### TensorRT Plugins Support -ORT TRT can leverage the TRT plugins which come with TRT plugin library in official release. To use TRT plugins, firstly users need to create the custom node (a one-to-one mapping to TRT plugin) with a registered plugin name and `trt.plugins` domain in the ONNX model. So, ORT TRT can recognize this custom node and pass the node together with the subgraph to TRT. Please see following python example to create a new custom node in the ONNX model: - -```python -from onnx import TensorProto, helper - -def generate_model(model_name): - nodes = [ - helper.make_node( - "DisentangledAttention_TRT", # The registered name is from https://github.com/NVIDIA/TensorRT/blob/main/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.cpp#L36 - ["input1", "input2", "input3"], - ["output"], - "DisentangledAttention_TRT", - domain="trt.plugins", # The domain has to be "trt.plugins" - factor=0.123, - span=128, - ), - ] - - graph = helper.make_graph( - nodes, - "trt_plugin_custom_op", - [ # input - helper.make_tensor_value_info("input1", TensorProto.FLOAT, [12, 256, 256]), - helper.make_tensor_value_info("input2", TensorProto.FLOAT, [12, 256, 256]), - helper.make_tensor_value_info("input3", TensorProto.FLOAT, [12, 256, 256]), - ], - [ # output - helper.make_tensor_value_info("output", TensorProto.FLOAT, [12, 256, 256]), - ], - ) - - model = helper.make_model(graph) - onnx.save(model, model_name) -``` -Note: If users want to use TRT plugins that are not in the TRT plugin library in official release, please see the ORT TRT provider option `trt_extra_plugin_lib_paths` for more details. - ### Python To use TensorRT execution provider, you must explicitly register TensorRT execution provider when instantiating the `InferenceSession`. Note that it is recommended you also register `CUDAExecutionProvider` to allow Onnx Runtime to assign nodes to CUDA execution provider that TensorRT does not support. +Click below for Python API example: + +
+ ```python import onnxruntime as ort # set providers to ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] with TensorrtExecutionProvider having the higher priority. sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider']) ``` +
+ ## Configurations There are two ways to configure TensorRT settings, either by **TensorRT Execution Provider Session Options(recommended)** or **Environment Variables(deprecated)** shown as below: @@ -235,7 +201,9 @@ TensorRT configurations can be set by execution provider options. It's useful wh Besides, `device_id` can also be set by execution provider option. -#### C++ API EXAMPLE +#### Click below for C++ API example: + +
```c++ @@ -256,7 +224,11 @@ trt_options.trt_dump_subgraphs = 1; session_options.AppendExecutionProvider_TensorRT(trt_options); ``` -#### PYTHON API EXAMPLE +
+ +#### Click below for Python API example: + +
```python import onnxruntime as ort @@ -283,16 +255,15 @@ sess_opt = ort.SessionOptions() sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers) ``` +
+ ### Environment Variables(deprecated) -
- Click to expand! +Following environment variables can be set for TensorRT execution provider. Click below for more details. - - -
- -Following environment variables can be set for TensorRT execution provider. +
+ + Click to expand * `ORT_TENSORRT_MAX_WORKSPACE_SIZE`: maximum workspace size for TensorRT engine. Default value: 1073741824 (1GB). @@ -386,11 +357,60 @@ export ORT_TENSORRT_DUMP_SUBGRAPHS=1 export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 ``` +
+ ## Performance Tuning For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md) When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e tensorrt`. Check below for sample. +### Shape Inference for TensorRT Subgraphs +If some operators in the model are not supported by TensorRT, ONNX Runtime will partition the graph and only send supported subgraphs to TensorRT execution provider. Because TensorRT requires that all inputs of the subgraphs have shape specified, ONNX Runtime will throw error if there is no input shape info. In this case please run shape inference for the entire model first by running script [here](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/symbolic_shape_infer.py) (Check below for sample). + +### TensorRT Plugins Support +ORT TRT can leverage the TRT plugins which come with TRT plugin library in official release. To use TRT plugins, firstly users need to create the custom node (a one-to-one mapping to TRT plugin) with a registered plugin name and `trt.plugins` domain in the ONNX model. So, ORT TRT can recognize this custom node and pass the node together with the subgraph to TRT. Please see following python example to create a new custom node in the ONNX model: + +Click below for Python API example: + +
+ +```python +from onnx import TensorProto, helper + +def generate_model(model_name): + nodes = [ + helper.make_node( + "DisentangledAttention_TRT", # The registered name is from https://github.com/NVIDIA/TensorRT/blob/main/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.cpp#L36 + ["input1", "input2", "input3"], + ["output"], + "DisentangledAttention_TRT", + domain="trt.plugins", # The domain has to be "trt.plugins" + factor=0.123, + span=128, + ), + ] + + graph = helper.make_graph( + nodes, + "trt_plugin_custom_op", + [ # input + helper.make_tensor_value_info("input1", TensorProto.FLOAT, [12, 256, 256]), + helper.make_tensor_value_info("input2", TensorProto.FLOAT, [12, 256, 256]), + helper.make_tensor_value_info("input3", TensorProto.FLOAT, [12, 256, 256]), + ], + [ # output + helper.make_tensor_value_info("output", TensorProto.FLOAT, [12, 256, 256]), + ], + ) + + model = helper.make_model(graph) + onnx.save(model, model_name) +``` + +
+ +Note: If users want to use TRT plugins that are not in the TRT plugin library in official release, please see the ORT TRT provider option `trt_extra_plugin_lib_paths` for more details. + ### Timing cache Enabling `trt_timing_cache_enable` will enable ORT TRT to use TensorRT timing cache to accelerate engine build time on a device with the same compute capability. This will work across models as it simply stores kernel latencies for specific configurations. Those files are usually very small (only a few KB or MB) which makes them very easy to ship with an application to accelerate the build time on the user end. @@ -401,7 +421,9 @@ The following examples shows build time reduction with timing cache: |efficientnet-lite4-11 | 34.6 s | 7.7 s| |yolov4 | 108.62 s | 9.4 s| -Here is a python example: +Click below for Python example: + +
```python import onnxruntime as ort @@ -429,11 +451,10 @@ sess.run( None, {"input_ids": np.zeros((1, 77), dtype=np.int32)} ) - - - ``` +
+ ### Explicit shape range for dynamic shape input ORT TRT lets you explicitly specify min/max/opt shapes for each dynamic shape input through three provider options, `trt_profile_min_shapes`, `trt_profile_max_shapes` and `trt_profile_opt_shapes`. If these three provider options are not specified @@ -442,7 +463,9 @@ and model has dynamic shape input, ORT TRT will determine the min/max/opt shapes To use the engine cache built with optimization profiles specified by explicit shape ranges, user still needs to provide those three provider options as well as engine cache enable flag. ORT TRT will firstly compare the shape ranges of those three provider options with the shape ranges saved in the .profile file, and then rebuild the engine if the shape ranges don't match. -Here is a python example: +Click below for Python example: + +
```python import onnxruntime as ort @@ -489,6 +512,8 @@ sess.run(None, args) ``` +
+ Please note that there is a constraint of using this explicit shape range feature, i.e., all the dynamic shape inputs should be provided with corresponding min/max/opt shapes. From 6f8698a4d757103332da9bda38697c0b48252c3f Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 16:32:28 -0700 Subject: [PATCH 07/14] clean --- docs/execution-providers/TensorRT-ExecutionProvider.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index f2562b772c096..4fb4eb95ddde8 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -262,9 +262,7 @@ sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=provide Following environment variables can be set for TensorRT execution provider. Click below for more details.
- - Click to expand - + * `ORT_TENSORRT_MAX_WORKSPACE_SIZE`: maximum workspace size for TensorRT engine. Default value: 1073741824 (1GB). * `ORT_TENSORRT_MAX_PARTITION_ITERATIONS`: maximum number of iterations allowed in model partitioning for TensorRT. If target model can't be successfully partitioned when the maximum number of iterations is reached, the whole model will fall back to other execution providers such as CUDA or CPU. Default value: 1000. From 5a6d2228516a8412631c27cfd1a99f66ee3554a1 Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 16:37:11 -0700 Subject: [PATCH 08/14] update --- docs/execution-providers/TensorRT-ExecutionProvider.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index 4fb4eb95ddde8..2730ed9f810e5 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -72,10 +72,10 @@ sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider'
## Configurations -There are two ways to configure TensorRT settings, either by **TensorRT Execution Provider Session Options(recommended)** or **Environment Variables(deprecated)** shown as below: +There are two ways to configure TensorRT settings, either by **TensorRT Execution Provider Session Options** or **Environment Variables(deprecated)** shown as below: -| TensorRT EP Session Options | Environment Variables | Type | +| TensorRT EP Session Options | Environment Variables(deprecated) | Type | |:--------------------------------------|:-----------------------------------------------|:-------| | trt_max_workspace_size | ORT_TENSORRT_MAX_WORKSPACE_SIZE | int | | trt_max_partition_iterations | ORT_TENSORRT_MAX_PARTITION_ITERATIONS | int | @@ -195,7 +195,8 @@ TensorRT configurations can be set by execution provider options. It's useful wh * for example, FasterTransformer has many TRT plugin implementations for different models, user can specify like this `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS=libvit_plugin.so;libvit_int8_plugin.so`. * `trt_profile_min_shapes`, `trt_profile_max_shapes` and `trt_profile_opt_shapes` : Build with dynamic shapes using a profile with the min/max/opt shapes provided. - * The format of the profile shapes is "input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,..." and these three flags should all be provided in order to enable explicit profile shapes feature. + * The format of the profile shapes is `input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,...` + * These three flags should all be provided in order to enable explicit profile shapes feature. * Check [Explicit shape range for dynamic shape input](#explicit-shape-range-for-dynamic-shape-input) and TRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details. @@ -206,7 +207,6 @@ Besides, `device_id` can also be set by execution provider option.
```c++ - Ort::SessionOptions session_options; OrtTensorRTProviderOptions trt_options{}; From 29b7bba198594f95c1ccd850a95efa35b5adb37c Mon Sep 17 00:00:00 2001 From: yf711 Date: Wed, 26 Jul 2023 16:41:19 -0700 Subject: [PATCH 09/14] revert --- docs/execution-providers/TensorRT-ExecutionProvider.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index 2730ed9f810e5..1df0c467b1560 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -59,18 +59,12 @@ The C API details are [here](../get-started/with-c.md). To use TensorRT execution provider, you must explicitly register TensorRT execution provider when instantiating the `InferenceSession`. Note that it is recommended you also register `CUDAExecutionProvider` to allow Onnx Runtime to assign nodes to CUDA execution provider that TensorRT does not support. -Click below for Python API example: - -
- ```python import onnxruntime as ort # set providers to ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] with TensorrtExecutionProvider having the higher priority. sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider']) ``` -
- ## Configurations There are two ways to configure TensorRT settings, either by **TensorRT Execution Provider Session Options** or **Environment Variables(deprecated)** shown as below: From 76bc556d588e9266b151c3d30099ca24f5917c63 Mon Sep 17 00:00:00 2001 From: yf711 Date: Mon, 30 Oct 2023 15:59:22 -0700 Subject: [PATCH 10/14] version update --- docs/build/eps.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index 32990f5dad959..43b5d890a3d5e 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -142,7 +142,7 @@ Dockerfile instructions are available [here](https://github.com/microsoft/onnxru ### Build Instructions {: .no_toc } -These instructions are for JetPack SDK 4.6.1. +These instructions are for JetPack SDK 5.1.2. 1. Clone the ONNX Runtime repo on the Jetson host @@ -165,17 +165,17 @@ These instructions are for JetPack SDK 4.6.1. export PATH="/usr/local/cuda/bin:${PATH}" ``` -3. Install the ONNX Runtime build dependencies on the Jetpack 4.6.1 host: +3. Install the ONNX Runtime build dependencies on the Jetpack 5.1.2 host: ```bash sudo apt install -y --no-install-recommends \ build-essential software-properties-common libopenblas-dev \ - libpython3.6-dev python3-pip python3-dev python3-setuptools python3-wheel + libpython3.8-dev python3-pip python3-dev python3-setuptools python3-wheel ``` -4. Cmake is needed to build ONNX Runtime. Because the minimum required version is 3.18, +4. Cmake is needed to build ONNX Runtime. Because the minimum required version is 3.26, it is necessary to build CMake from source. Download Unix/Linux sources from https://cmake.org/download/ - and follow https://cmake.org/install/ to build from source. Version 3.23.0 has been tested on Jetson. + and follow https://cmake.org/install/ to build from source. Version 3.27 has been tested on Jetson. 5. Build the ONNX Runtime Python wheel: From 2d9327ca8b16aa1bd0df391b21456cac6e66d38f Mon Sep 17 00:00:00 2001 From: yf711 Date: Tue, 31 Oct 2023 14:48:29 -0700 Subject: [PATCH 11/14] dep update --- docs/build/eps.md | 76 +++++++++++++++++++++++++++-------------------- 1 file changed, 44 insertions(+), 32 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index 43b5d890a3d5e..53cfc9bc24883 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -142,58 +142,70 @@ Dockerfile instructions are available [here](https://github.com/microsoft/onnxru ### Build Instructions {: .no_toc } -These instructions are for JetPack SDK 5.1.2. +These instructions are for the latest [JetPack SDK 5.1.2](https://developer.nvidia.com/embedded/jetpack-sdk-512). 1. Clone the ONNX Runtime repo on the Jetson host - ```bash - git clone --recursive https://github.com/microsoft/onnxruntime - ``` + ```bash + git clone --recursive https://github.com/microsoft/onnxruntime + ``` 2. Specify the CUDA compiler, or add its location to the PATH. - Cmake can't automatically find the correct nvcc if it's not in the PATH. + 1. Starting with **CUDA 11.8**, Jetson users on **JetPack 5.0+** can upgrade to the latest CUDA release without updating the JetPack version or Jetson Linux BSP (Board Support Package). CUDA version 11.8 with JetPack 5.1.2 has been tested on Jetson when building ONNX Runtime 1.16. - ```bash - export CUDACXX="/usr/local/cuda/bin/nvcc" + 1. Check [this official blog](https://developer.nvidia.com/blog/simplifying-cuda-upgrades-for-nvidia-jetson-users/) for CUDA 11.8 upgrade instruction. - ``` + 2. CUDA 12.x might only be available to the latest Jetson Orin series. - or: + 2. CMake can't automatically find the correct `nvcc` if it's not in the `PATH`. `nvcc` can be added to `PATH` via: - ```bash - export PATH="/usr/local/cuda/bin:${PATH}" - ``` + ```bash + export PATH="/usr/local/cuda/bin:${PATH}" + ``` + + or: + + ```bash + export CUDACXX="/usr/local/cuda/bin/nvcc" + ``` 3. Install the ONNX Runtime build dependencies on the Jetpack 5.1.2 host: - ```bash - sudo apt install -y --no-install-recommends \ - build-essential software-properties-common libopenblas-dev \ - libpython3.8-dev python3-pip python3-dev python3-setuptools python3-wheel - ``` + ```bash + sudo apt install -y --no-install-recommends \ + build-essential software-properties-common libopenblas-dev \ + libpython3.8-dev python3-pip python3-dev python3-setuptools python3-wheel + ``` -4. Cmake is needed to build ONNX Runtime. Because the minimum required version is 3.26, - it is necessary to build CMake from source. Download Unix/Linux sources from https://cmake.org/download/ - and follow https://cmake.org/install/ to build from source. Version 3.27 has been tested on Jetson. +4. Cmake is needed to build ONNX Runtime. For ONNX Runtime 1.16, the minimum required CMake version is 3.26 (version 3.27.4 has been tested). This can be either installed by: -5. Build the ONNX Runtime Python wheel: + 1. (Unix/Linux) Build from source. Download sources from https://cmake.org/download/ + and follow https://cmake.org/install/ to build from source. + 2. (Ubuntu) Install deb package via apt repository: e.g https://apt.kitware.com/ - ```bash - ./build.sh --config Release --update --build --parallel --build_wheel \ - --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu - ``` +5. Build the ONNX Runtime Python wheel (update path to CUDA/CUDNN/TensorRT libraries if necessary): - Note: You may optionally build with TensorRT support. + 1. Build `onnxruntime-gpu` wheel with CUDA support: - ```bash - ./build.sh --config Release --update --build --parallel --build_wheel \ - --use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu \ - --tensorrt_home /usr/lib/aarch64-linux-gnu - ``` + ```bash + ./build.sh --config Release --update --build --parallel --build_wheel \ + --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu + ``` ---- + 2. Build `onnxruntime-gpu` wheel with additional TensorRT support: + + ```bash + ./build.sh --config Release --update --build --parallel --build_wheel \ + --use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu \ + --tensorrt_home /usr/lib/aarch64-linux-gnu + ``` + +​ Notes: + +* By default, `onnxruntime-gpu` wheel file will be captured under `path_to/onnxruntime/build/Linux/Release/dist/` (build path can be customized by adding `--build_dir` followed by a customized path to the build command above). +* For a portion of Jetson devices like the Xavier series, higher power mode involves more cores (up to 6) to compute but it consumes more resources when building ONNX Runtime. Update `--parallel` with smaller value if system resource is limited and OOM happens. ## oneDNN See more information on oneDNN (formerly DNNL) [here](../execution-providers/oneDNN-ExecutionProvider.md). From 915ab77c3dbcba5ffefefef2878389d444f474b3 Mon Sep 17 00:00:00 2001 From: yf711 Date: Tue, 31 Oct 2023 15:00:33 -0700 Subject: [PATCH 12/14] link --- docs/build/eps.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index 53cfc9bc24883..cfd291c7b9cfd 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -156,7 +156,7 @@ These instructions are for the latest [JetPack SDK 5.1.2](https://developer.nvid 1. Check [this official blog](https://developer.nvidia.com/blog/simplifying-cuda-upgrades-for-nvidia-jetson-users/) for CUDA 11.8 upgrade instruction. - 2. CUDA 12.x might only be available to the latest Jetson Orin series. + 2. CUDA 12.x might only be available to Jetson Orin and newer series. 2. CMake can't automatically find the correct `nvcc` if it's not in the `PATH`. `nvcc` can be added to `PATH` via: @@ -180,9 +180,9 @@ These instructions are for the latest [JetPack SDK 5.1.2](https://developer.nvid 4. Cmake is needed to build ONNX Runtime. For ONNX Runtime 1.16, the minimum required CMake version is 3.26 (version 3.27.4 has been tested). This can be either installed by: - 1. (Unix/Linux) Build from source. Download sources from https://cmake.org/download/ - and follow https://cmake.org/install/ to build from source. - 2. (Ubuntu) Install deb package via apt repository: e.g https://apt.kitware.com/ + 1. (Unix/Linux) Build from source. Download sources from [https://cmake.org/download/](https://cmake.org/download/) + and follow [https://cmake.org/install/](https://cmake.org/install/) to build from source. + 2. (Ubuntu) Install deb package via apt repository: e.g [https://apt.kitware.com/](https://apt.kitware.com/) 5. Build the ONNX Runtime Python wheel (update path to CUDA/CUDNN/TensorRT libraries if necessary): From 656eb93bfbe68406d55db146c063fb2db6e10d29 Mon Sep 17 00:00:00 2001 From: yf711 Date: Tue, 31 Oct 2023 17:22:35 -0700 Subject: [PATCH 13/14] apply comments --- docs/build/eps.md | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index cfd291c7b9cfd..5e876f05f84d6 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -156,7 +156,7 @@ These instructions are for the latest [JetPack SDK 5.1.2](https://developer.nvid 1. Check [this official blog](https://developer.nvidia.com/blog/simplifying-cuda-upgrades-for-nvidia-jetson-users/) for CUDA 11.8 upgrade instruction. - 2. CUDA 12.x might only be available to Jetson Orin and newer series. + 2. CUDA 12.x is only available to Jetson Orin and newer series (CUDA compute capability >= 8.7). Check [here](https://developer.nvidia.com/cuda-gpus#collapse5) for compute capability datasheet. 2. CMake can't automatically find the correct `nvcc` if it's not in the `PATH`. `nvcc` can be added to `PATH` via: @@ -186,14 +186,7 @@ These instructions are for the latest [JetPack SDK 5.1.2](https://developer.nvid 5. Build the ONNX Runtime Python wheel (update path to CUDA/CUDNN/TensorRT libraries if necessary): - 1. Build `onnxruntime-gpu` wheel with CUDA support: - - ```bash - ./build.sh --config Release --update --build --parallel --build_wheel \ - --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu - ``` - - 2. Build `onnxruntime-gpu` wheel with additional TensorRT support: + 1. Build `onnxruntime-gpu` wheel with CUDA and TensorRT support: ```bash ./build.sh --config Release --update --build --parallel --build_wheel \ From 4011e3fb3d69ef4b443d337a7d43349a9916c0c9 Mon Sep 17 00:00:00 2001 From: yf711 Date: Tue, 31 Oct 2023 17:48:56 -0700 Subject: [PATCH 14/14] add detail --- docs/build/eps.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index 5e876f05f84d6..b6138ecab1f2a 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -198,7 +198,7 @@ These instructions are for the latest [JetPack SDK 5.1.2](https://developer.nvid * By default, `onnxruntime-gpu` wheel file will be captured under `path_to/onnxruntime/build/Linux/Release/dist/` (build path can be customized by adding `--build_dir` followed by a customized path to the build command above). -* For a portion of Jetson devices like the Xavier series, higher power mode involves more cores (up to 6) to compute but it consumes more resources when building ONNX Runtime. Update `--parallel` with smaller value if system resource is limited and OOM happens. +* For a portion of Jetson devices like the Xavier series, higher power mode involves more cores (up to 6) to compute but it consumes more resource when building ONNX Runtime. Set `--parallel 2` or smaller in the build command if system is hanging and OOM happens. ## oneDNN See more information on oneDNN (formerly DNNL) [here](../execution-providers/oneDNN-ExecutionProvider.md).