Improve TRT doc and advance the EP session option

microsoft · yf711 · Nov 2, 2023 · Jul 26, 2023 · Jul 26, 2023 · Jul 26, 2023
commit bb8d633a71eb6b3d8e23e62d6792b28bcf95e817
diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md
@@ -106,144 +106,139 @@ sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider'
 ```
 
 ## Configurations
-There are two ways to configure TensorRT settings, either by environment variables or by execution provider option APIs.
+There are two ways to configure TensorRT settings, either by TensorRT execution provider options(recommended) or environment variables.
+
+There are one-to-one mappings between **TensorRT Execution Provider Session Options** and **Environment Variables** shown as below:
+
+| TensorRT EP Session Options           | Environment Variables                          | Type   |
+|:--------------------------------------|:-----------------------------------------------|:-------|
+| trt_max_workspace_size                | ORT_TENSORRT_MAX_WORKSPACE_SIZE                | int    |
+| trt_max_partition_iterations          | ORT_TENSORRT_MAX_PARTITION_ITERATIONS          | int    |
+| trt_min_subgraph_size                 | ORT_TENSORRT_MIN_SUBGRAPH_SIZE                 | int    |
+| trt_fp16_enable                       | ORT_TENSORRT_FP16_ENABLE                       | bool   |
+| trt_int8_enable                       | ORT_TENSORRT_INT8_ENABLE                       | bool   |
+| trt_int8_calibration_table_name       | ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME       | string |
+| trt_int8_use_native_calibration_table | ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE | bool   |
+| trt_dla_enable                        | ORT_TENSORRT_DLA_ENABLE                        | bool   |
+| trt_dla_core                          | ORT_TENSORRT_DLA_CORE                          | int    |
+| trt_engine_cache_enable               | ORT_TENSORRT_ENGINE_CACHE_ENABLE               | bool   |
+| trt_engine_cache_path                 | ORT_TENSORRT_CACHE_PATH                        | string |
+| trt_dump_subgraphs                    | ORT_TENSORRT_DUMP_SUBGRAPHS                    | bool   |
+| trt_force_sequential_engine_build     | ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD     | bool   |
+| trt_context_memory_sharing_enable     | ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE     | bool   |
+| trt_layer_norm_fp32_fallback          | ORT_TENSORRT_LAYER_NORM_FP32_FALLBACK          | bool   |
+| trt_timing_cache_enable               | ORT_TENSORRT_TIMING_CACHE_ENABLE               | bool   |
+| trt_force_timing_cache                | ORT_TENSORRT_FORCE_TIMING_CACHE_ENABLE         | bool   |
+| trt_detailed_build_log                | ORT_TENSORRT_DETAILED_BUILD_LOG_ENABLE         | bool   |
+| trt_build_heuristics_enable           | ORT_TENSORRT_BUILD_HEURISTICS_ENABLE           | bool   |
+| trt_sparsity_enable                   | ORT_TENSORRT_SPARSITY_ENABLE                   | bool   |
+| trt_builder_optimization_level        | ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL        | int    |
+| trt_auxiliary_streams                 | ORT_TENSORRT_AUXILIARY_STREAMS                 | int    |
+| trt_tactic_sources                    | ORT_TENSORRT_TACTIC_SOURCES                    | string |
+| trt_extra_plugin_lib_paths            | ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS            | string |
+| trt_profile_min_shapes                | ORT_TENSORRT_PROFILE_MIN_SHAPES                | string |
+| trt_profile_max_shapes                | ORT_TENSORRT_PROFILE_MAX_SHAPES                | string |
+| trt_profile_opt_shapes                | ORT_TENSORRT_PROFILE_OPT_SHAPES                | string |
 
-### Environment Variables
-Following environment variables can be set for TensorRT execution provider.
-
-* `ORT_TENSORRT_MAX_WORKSPACE_SIZE`: maximum workspace size for TensorRT engine. Default value: 1073741824 (1GB).
-
-* `ORT_TENSORRT_MAX_PARTITION_ITERATIONS`: maximum number of iterations allowed in model partitioning for TensorRT. If target model can't be successfully partitioned when the maximum number of iterations is reached, the whole model will fall back to other execution providers such as CUDA or CPU. Default value: 1000.
-
-* `ORT_TENSORRT_MIN_SUBGRAPH_SIZE`: minimum node size in a subgraph after partitioning. Subgraphs with smaller size will fall back to other execution providers. Default value: 1.
-
-* `ORT_TENSORRT_FP16_ENABLE`: Enable FP16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support FP16 precision.
-
-* `ORT_TENSORRT_INT8_ENABLE`: Enable INT8 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support INT8 precision.
-
-* `ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME`: Specify INT8 calibration table file for non-QDQ models in INT8 mode. Note calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty.
-
-* `ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE`: Select what calibration table is used for non-QDQ models in INT8 mode. If 1, native TensorRT generated calibration table is used; if 0, ONNXRUNTIME tool generated calibration table is used. Default value: 0.
-    * **Note: Please copy up-to-date calibration table file to `ORT_TENSORRT_CACHE_PATH` before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced.**
-
-* `ORT_TENSORRT_DLA_ENABLE`: Enable DLA (Deep Learning Accelerator). 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support DLA. 
-
-* `ORT_TENSORRT_DLA_CORE`: Specify DLA core to execute on. Default value: 0.
+> Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. 
 
-* `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.
-    * **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
-        * Model changes (if there are any changes to the model topology, opset version, operators etc.)
-        * ORT version changes (i.e. moving from ORT version 1.8 to 1.9)
-        * TensorRT version changes (i.e. moving from TensorRT 7.0 to 8.0)
-        * Hardware changes. (Engine and profile files are not portable and optimized for specific Nvidia hardware)
+### Execution Provider Options
 
-* `ORT_TENSORRT_CACHE_PATH`: Specify path for TensorRT engine and profile files if `ORT_TENSORRT_ENGINE_CACHE_ENABLE` is 1, or path for INT8 calibration table file if ORT_TENSORRT_INT8_ENABLE is 1.
+TensorRT configurations can be set by execution provider options. It's useful when each model and inference session have their own configurations. In this case, execution provider option settings will override any environment variable settings. All configurations should be set explicitly, otherwise default value will be taken. 
 
-* `ORT_TENSORRT_DUMP_SUBGRAPHS`: Dumps the subgraphs that are transformed into TRT engines in onnx format to the filesystem. This can help debugging subgraphs, e.g. by using  `trtexec --onnx my_model.onnx` and check the outputs of the parser. 1: enabled, 0: disabled. Default value: 0.
+* `trt_max_workspace_size`: maximum workspace size for TensorRT engine. 
+    * Default value: 1073741824 (1GB).
 
-* `ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD`: Sequentially build TensorRT engines across provider instances in multi-GPU environment. 1: enabled, 0: disabled. Default value: 0.
+* `trt_max_partition_iterations`: maximum number of iterations allowed in model partitioning for TensorRT. 
+    * If target model can't be successfully partitioned when the maximum number of iterations is reached, the whole model will fall back to other execution providers such as CUDA or CPU. 
+    * Default value: 1000.
 
-* `ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE`: Share execution context memory between TensorRT subgraphs. Default 0 = false, nonzero = true.
+* `trt_min_subgraph_size`: minimum node size in a subgraph after partitioning. 
+  * Subgraphs with smaller size will fall back to other execution providers. 
+  * Default value: 1.
 
-* `ORT_TENSORRT_LAYER_NORM_FP32_FALLBACK`: Force Pow + Reduce ops in layer norm to FP32. Default 0 = false, nonzero = true.
+* `trt_fp16_enable`: Enable FP16 mode in TensorRT. 
+    > Note: not all Nvidia GPUs support FP16 precision.
 
-* `ORT_TENSORRT_TIMING_CACHE_ENABLE`: Enable TensorRT timing cache. Default 0 = false, nonzero = true. Check [Timing cache](#timing-cache) for details.
+* `trt_int8_enable`: Enable INT8 mode in TensorRT. 
+    > Note:  not all Nvidia GPUs support INT8 precision.
 
-* `ORT_TENSORRT_FORCE_TIMING_CACHE_ENABLE`: Force the TensorRT timing cache to be used even if device profile does not match. Default 0 = false, nonzero = true.
+* `trt_int8_calibration_table_name`: Specify INT8 calibration table file for non-QDQ models in INT8 mode.
+    > Note: calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty.
 
-* `ORT_TENSORRT_DETAILED_BUILD_LOG_ENABLE`: Enable detailed build step logging on TensorRT EP with timing for each engine build. Default 0 = false, nonzero = true.
+* `trt_int8_use_native_calibration_table`: Select what calibration table is used for non-QDQ models in INT8 mode. 
+  * If `True`, native TensorRT generated calibration table is used; 
+  * If `False`, ONNXRUNTIME tool generated calibration table is used.
+  > Note: Please copy up-to-date calibration table file to `trt_engine_cache_path` before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced.
 
-* `ORT_TENSORRT_BUILD_HEURISTICS_ENABLE`: Build engine using heuristics to reduce build time. Default 0 = false, nonzero = true.
+* `trt_dla_enable`: Enable DLA (Deep Learning Accelerator). 
+    > Note: Not all Nvidia GPUs support DLA.
 
-* `ORT_TENSORRT_SPARSITY_ENABLE`: Control if sparsity can be used by TRT. Default 0 = false, 1 = true. Check `--sparsity` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
+* `trt_dla_core`: Specify DLA core to execute on. Default value: 0.
 
-* `ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL`: Set the builder optimization level. WARNING: levels below 3 do not guarantee good engine performance, but greatly improve build time.  Default 3, valid range [0-5]. Check `--builderOptimizationLevel` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
+* `trt_engine_cache_enable`: Enable TensorRT engine caching. 
 
-* `ORT_TENSORRT_AUXILIARY_STREAMS`: Set maximum number of auxiliary streams per inference stream. Setting this value to 0 will lead to optimal memory usage. Default -1 = heuristics. Check `--maxAuxStreams` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
+    * The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. 
 
-* `ORT_TENSORRT_TACTIC_SOURCES`: Specify the tactics to be used by adding (+) or removing (-) tactics from the default tactic sources (default = all available tactics) e.g. "-CUDNN,+CUBLAS" available keys: "CUBLAS", "CUBLAS_LT", "CUDNN" or "EDGE_MASK_CONVOLUTIONS".
+    * Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). 
 
-* `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS`: Specify extra TensorRT plugin library paths. ORT TRT by default supports any TRT plugins registered in TRT registry in TRT plugin library (i.e., `libnvinfer_plugin.so`). Moreover, if users want to use other TRT plugins that are not in TRT plugin library, for example, FasterTransformer has many TRT plugin implementations for different models, user can specify like this `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS=libvit_plugin.so;libvit_int8_plugin.so`.  
+        * Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again.
 
-* `ORT_TENSORRT_PROFILE_MIN_SHAPES`, `ORT_TENSORRT_PROFILE_MAX_SHAPES` and `ORT_TENSORRT_PROFILE_OPT_SHAPES` : Build with dynamic shapes using a profile with the min/max/opt shapes provided. The format of the profile shapes is "input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,..." and these three flags should all be provided in order to enable explicit profile shapes feature. Check [Explicit shape range for dynamic shape input](#explicit-shape-range-for-dynamic-shape-input) and TRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details.
+  > **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
+  >
+  >    * Model changes (if there are any changes to the model topology, opset version, operators etc.)
+  >    * ORT version changes (i.e. moving from ORT version 1.8 to 1.9)
+  >    * TensorRT version changes (i.e. moving from TensorRT 7.0 to 8.0)
 
-One can override default values by setting environment variables. e.g. on Linux:
+* `trt_engine_cache_path`: Specify path for TensorRT engine and profile files if `trt_engine_cache_enable` is `True`, or path for INT8 calibration table file if `trt_int8_enable` is `True`.
 
-```bash
-# Override default max workspace size to 2GB
-export ORT_TENSORRT_MAX_WORKSPACE_SIZE=2147483648
+* `trt_dump_subgraphs`: Dumps the subgraphs that are transformed into TRT engines in onnx format to the filesystem. 
+  * This can help debugging subgraphs, e.g. by using  `trtexec --onnx my_model.onnx` and check the outputs of the parser.
 
-# Override default maximum number of iterations to 10 
-export ORT_TENSORRT_MAX_PARTITION_ITERATIONS=10
+* `trt_force_sequential_engine_build`: Sequentially build TensorRT engines across provider instances in multi-GPU environment.
 
-# Override default minimum subgraph node size to 5
-export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5
+* `trt_context_memory_sharing_enable`: Share execution context memory between TensorRT subgraphs.
 
-# Enable FP16 mode in TensorRT
-export ORT_TENSORRT_FP16_ENABLE=1
+* `trt_layer_norm_fp32_fallback`: Force Pow + Reduce ops in layer norm to FP32.
 
-# Enable INT8 mode in TensorRT
-export ORT_TENSORRT_INT8_ENABLE=1
+* `trt_timing_cache_enable`: Enable TensorRT timing cache. 
+  * Check [Timing cache](#timing-cache) for details.
 
-# Use native TensorRT calibration table
-export ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE=1
+* `trt_force_timing_cache`: Force the TensorRT timing cache to be used even if device profile does not match.
 
-# Enable TensorRT engine caching
-export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
-# Please Note warning above. This feature is experimental. 
-# Engine cache files must be invalidated if there are any changes to the model, ORT version, TensorRT version or if the underlying hardware changes. Engine files are not portable across devices.
+* `trt_detailed_build_log`: Enable detailed build step logging on TensorRT EP with timing for each engine build.
 
-# Specify TensorRT cache path
-export ORT_TENSORRT_CACHE_PATH="/path/to/cache"
+* `trt_build_heuristics_enable`: Build engine using heuristics to reduce build time.
 
-# Dump out subgraphs to run on TensorRT
-export ORT_TENSORRT_DUMP_SUBGRAPHS=1
+* `trt_sparsity_enable`: Control if sparsity can be used by TRT. 
+  * Check `--sparsity` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
 
-# Enable context memory sharing between TensorRT subgraphs. Default 0 = false, nonzero = true
-export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1
-```
+* `trt_builder_optimization_level`: Set the builder optimization level. 
+  > WARNING: levels below 3 do not guarantee good engine performance, but greatly improve build time.  Default 3, valid range [0-5]. Check `--builderOptimizationLevel` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
 
-### Execution Provider Options
+* `trt_auxiliary_streams`: Set maximum number of auxiliary streams per inference stream. 
+  * Setting this value to 0 will lead to optimal memory usage. 
+  * Default -1 = heuristics. 
+  * Check `--maxAuxStreams` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
 
-TensorRT configurations can also be set by execution provider option APIs. It's useful when each model and inference session have their own configurations. In this case, execution provider option settings will override any environment variable settings. All configurations should be set explicitly, otherwise default value will be taken. 
+* `trt_tactic_sources`: Specify the tactics to be used by adding (+) or removing (-) tactics from the default tactic sources (default = all available tactics) 
+  * e.g. "-CUDNN,+CUBLAS" available keys: "CUBLAS", "CUBLAS_LT", "CUDNN" or "EDGE_MASK_CONVOLUTIONS".
 
-There are one-to-one mappings between **environment variables** and **execution provider options APIs** shown as below:
+* `trt_extra_plugin_lib_paths`: Specify extra TensorRT plugin library paths. 
+  * ORT TRT by default supports any TRT plugins registered in TRT registry in TRT plugin library (i.e., `libnvinfer_plugin.so`). 
+  * Moreover, if users want to use other TRT plugins that are not in TRT plugin library, 
+    * for example, FasterTransformer has many TRT plugin implementations for different models, user can specify like this `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS=libvit_plugin.so;libvit_int8_plugin.so`.  
 
-> Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. 
+* `trt_profile_min_shapes`, `trt_profile_max_shapes` and `trt_profile_opt_shapes` : Build with dynamic shapes using a profile with the min/max/opt shapes provided. 
+  * The format of the profile shapes is "input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,..." and these three flags should all be provided in order to enable explicit profile shapes feature. 
+  * Check [Explicit shape range for dynamic shape input](#explicit-shape-range-for-dynamic-shape-input) and TRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details.
 
-| environment variables                          | execution provider option APIs        | type   |
-| ---------------------------------------------- | ------------------------------------- | ------ |
-| ORT_TENSORRT_MAX_WORKSPACE_SIZE                | trt_max_workspace_size                | int    |
-| ORT_TENSORRT_MAX_PARTITION_ITERATIONS          | trt_max_partition_iterations          | int    |
-| ORT_TENSORRT_MIN_SUBGRAPH_SIZE                 | trt_min_subgraph_size                 | int    |
-| ORT_TENSORRT_FP16_ENABLE                       | trt_fp16_enable                       | bool   |
-| ORT_TENSORRT_INT8_ENABLE                       | trt_int8_enable                       | bool   |
-| ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME       | trt_int8_calibration_table_name       | string |
-| ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE | trt_int8_use_native_calibration_table | bool   |
-| ORT_TENSORRT_DLA_ENABLE                        | trt_dla_enable                        | bool   |
-| ORT_TENSORRT_DLA_CORE                          | trt_dla_core                          | int    |
-| ORT_TENSORRT_ENGINE_CACHE_ENABLE               | trt_engine_cache_enable               | bool   |
-| ORT_TENSORRT_CACHE_PATH                        | trt_engine_cache_path                 | string |
-| ORT_TENSORRT_DUMP_SUBGRAPHS                    | trt_dump_subgraphs                    | bool   |
-| ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD     | trt_force_sequential_engine_build     | bool   |
-| ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE     | trt_context_memory_sharing_enable     | bool   |
-| ORT_TENSORRT_LAYER_NORM_FP32_FALLBACK          | trt_layer_norm_fp32_fallback          | bool   |
-| ORT_TENSORRT_TIMING_CACHE_ENABLE               | trt_timing_cache_enable               | bool   |
-| ORT_TENSORRT_FORCE_TIMING_CACHE_ENABLE         | trt_force_timing_cache                | bool   |
-| ORT_TENSORRT_DETAILED_BUILD_LOG_ENABLE         | trt_detailed_build_log                | bool   |
-| ORT_TENSORRT_BUILD_HEURISTICS_ENABLE           | trt_build_heuristics_enable           | bool   |
-| ORT_TENSORRT_SPARSITY_ENABLE                   | trt_sparsity_enable                   | bool   |
-| ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL        | trt_builder_optimization_level        | bool   |
-| ORT_TENSORRT_AUXILIARY_STREAMS                 | trt_auxiliary_streams                 | bool   |
-| ORT_TENSORRT_TACTIC_SOURCES                    | trt_tactic_sources                    | string |
-| ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS            | trt_extra_plugin_lib_paths            | string |
-| ORT_TENSORRT_PROFILE_MIN_SHAPES                | trt_profile_min_shapes                | string |
-| ORT_TENSORRT_PROFILE_MAX_SHAPES                | trt_profile_max_shapes                | string |
-| ORT_TENSORRT_PROFILE_OPT_SHAPES                | trt_profile_opt_shapes                | string |
 
 Besides, `device_id` can also be set by execution provider option.
 
-#### C++ API example
+<details>
+  <summary> Click to expand C++ API example</summary>
+
 ```c++
 
 Ort::SessionOptions session_options;
@@ -263,7 +258,11 @@ trt_options.trt_dump_subgraphs = 1;
 session_options.AppendExecutionProvider_TensorRT(trt_options);
 ```
 
-#### Python API example
+</details>
+
+<details>
+  <summary> Click to expand Python API example</summary>
+
 ```python
 import onnxruntime as ort
 
@@ -289,6 +288,106 @@ sess_opt = ort.SessionOptions()
 sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers)
 ```
 
+</details>
+
+### Environment Variables
+Following environment variables can be set for TensorRT execution provider.
+
+* `ORT_TENSORRT_MAX_WORKSPACE_SIZE`: maximum workspace size for TensorRT engine. Default value: 1073741824 (1GB).
+
+* `ORT_TENSORRT_MAX_PARTITION_ITERATIONS`: maximum number of iterations allowed in model partitioning for TensorRT. If target model can't be successfully partitioned when the maximum number of iterations is reached, the whole model will fall back to other execution providers such as CUDA or CPU. Default value: 1000.
+
+* `ORT_TENSORRT_MIN_SUBGRAPH_SIZE`: minimum node size in a subgraph after partitioning. Subgraphs with smaller size will fall back to other execution providers. Default value: 1.
+
+* `ORT_TENSORRT_FP16_ENABLE`: Enable FP16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support FP16 precision.
+
+* `ORT_TENSORRT_INT8_ENABLE`: Enable INT8 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support INT8 precision.
+
+* `ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME`: Specify INT8 calibration table file for non-QDQ models in INT8 mode. Note calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty.
+
+* `ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE`: Select what calibration table is used for non-QDQ models in INT8 mode. If 1, native TensorRT generated calibration table is used; if 0, ONNXRUNTIME tool generated calibration table is used. Default value: 0.
+    * **Note: Please copy up-to-date calibration table file to `ORT_TENSORRT_CACHE_PATH` before inference. Calibration table is specific to models and calibration data sets. Whenever new calibration table is generated, old file in the path should be cleaned up or be replaced.**
+
+* `ORT_TENSORRT_DLA_ENABLE`: Enable DLA (Deep Learning Accelerator). 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support DLA. 
+
+* `ORT_TENSORRT_DLA_CORE`: Specify DLA core to execute on. Default value: 0.
+
+* `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.
+    * **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
+        * Model changes (if there are any changes to the model topology, opset version, operators etc.)
+        * ORT version changes (i.e. moving from ORT version 1.8 to 1.9)
+        * TensorRT version changes (i.e. moving from TensorRT 7.0 to 8.0)
+        * Hardware changes. (Engine and profile files are not portable and optimized for specific Nvidia hardware)
+
+* `ORT_TENSORRT_CACHE_PATH`: Specify path for TensorRT engine and profile files if `ORT_TENSORRT_ENGINE_CACHE_ENABLE` is 1, or path for INT8 calibration table file if ORT_TENSORRT_INT8_ENABLE is 1.
+
+* `ORT_TENSORRT_DUMP_SUBGRAPHS`: Dumps the subgraphs that are transformed into TRT engines in onnx format to the filesystem. This can help debugging subgraphs, e.g. by using  `trtexec --onnx my_model.onnx` and check the outputs of the parser. 1: enabled, 0: disabled. Default value: 0.
+
+* `ORT_TENSORRT_FORCE_SEQUENTIAL_ENGINE_BUILD`: Sequentially build TensorRT engines across provider instances in multi-GPU environment. 1: enabled, 0: disabled. Default value: 0.
+
+* `ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE`: Share execution context memory between TensorRT subgraphs. Default 0 = false, nonzero = true.
+
+* `ORT_TENSORRT_LAYER_NORM_FP32_FALLBACK`: Force Pow + Reduce ops in layer norm to FP32. Default 0 = false, nonzero = true.
+
+* `ORT_TENSORRT_TIMING_CACHE_ENABLE`: Enable TensorRT timing cache. Default 0 = false, nonzero = true. Check [Timing cache](#timing-cache) for details.
+
+* `ORT_TENSORRT_FORCE_TIMING_CACHE_ENABLE`: Force the TensorRT timing cache to be used even if device profile does not match. Default 0 = false, nonzero = true.
+
+* `ORT_TENSORRT_DETAILED_BUILD_LOG_ENABLE`: Enable detailed build step logging on TensorRT EP with timing for each engine build. Default 0 = false, nonzero = true.
+
+* `ORT_TENSORRT_BUILD_HEURISTICS_ENABLE`: Build engine using heuristics to reduce build time. Default 0 = false, nonzero = true.
+
+* `ORT_TENSORRT_SPARSITY_ENABLE`: Control if sparsity can be used by TRT. Default 0 = false, 1 = true. Check `--sparsity` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
+
+* `ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL`: Set the builder optimization level. WARNING: levels below 3 do not guarantee good engine performance, but greatly improve build time.  Default 3, valid range [0-5]. Check `--builderOptimizationLevel` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
+
+* `ORT_TENSORRT_AUXILIARY_STREAMS`: Set maximum number of auxiliary streams per inference stream. Setting this value to 0 will lead to optimal memory usage. Default -1 = heuristics. Check `--maxAuxStreams` in `trtexec` command-line flags for [details](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags).
+
+* `ORT_TENSORRT_TACTIC_SOURCES`: Specify the tactics to be used by adding (+) or removing (-) tactics from the default tactic sources (default = all available tactics) e.g. "-CUDNN,+CUBLAS" available keys: "CUBLAS", "CUBLAS_LT", "CUDNN" or "EDGE_MASK_CONVOLUTIONS".
+
+* `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS`: Specify extra TensorRT plugin library paths. ORT TRT by default supports any TRT plugins registered in TRT registry in TRT plugin library (i.e., `libnvinfer_plugin.so`). Moreover, if users want to use other TRT plugins that are not in TRT plugin library, for example, FasterTransformer has many TRT plugin implementations for different models, user can specify like this `ORT_TENSORRT_EXTRA_PLUGIN_LIB_PATHS=libvit_plugin.so;libvit_int8_plugin.so`.  
+
+* `ORT_TENSORRT_PROFILE_MIN_SHAPES`, `ORT_TENSORRT_PROFILE_MAX_SHAPES` and `ORT_TENSORRT_PROFILE_OPT_SHAPES` : Build with dynamic shapes using a profile with the min/max/opt shapes provided. The format of the profile shapes is "input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,..." and these three flags should all be provided in order to enable explicit profile shapes feature. Check [Explicit shape range for dynamic shape input](#explicit-shape-range-for-dynamic-shape-input) and TRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details.
+
+One can override default values by setting environment variables.
+<details>
+  <summary> Click to expand examples on Linux:</summary>
+
+    # Override default max workspace size to 2GB
+    export ORT_TENSORRT_MAX_WORKSPACE_SIZE=2147483648
+
+    # Override default maximum number of iterations to 10 
+    export ORT_TENSORRT_MAX_PARTITION_ITERATIONS=10
+
+    # Override default minimum subgraph node size to 5
+    export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5
+
+    # Enable FP16 mode in TensorRT
+    export ORT_TENSORRT_FP16_ENABLE=1
+
+    # Enable INT8 mode in TensorRT
+    export ORT_TENSORRT_INT8_ENABLE=1
+
+    # Use native TensorRT calibration table
+    export ORT_TENSORRT_INT8_USE_NATIVE_CALIBRATION_TABLE=1
+
+    # Enable TensorRT engine caching
+    export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
+    # Please Note warning above. This feature is experimental. 
+    # Engine cache files must be invalidated if there are any changes to the model, ORT version, TensorRT version or if the underlying hardware changes. Engine files are not portable across devices.
+
+    # Specify TensorRT cache path
+    export ORT_TENSORRT_CACHE_PATH="/path/to/cache"
+
+    # Dump out subgraphs to run on TensorRT
+    export ORT_TENSORRT_DUMP_SUBGRAPHS=1
+
+    # Enable context memory sharing between TensorRT subgraphs. Default 0 = false, nonzero = true
+    export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1
+
+</details>
+
+
 ## Performance Tuning
 For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md)