diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index a106067eade05..538eff28e4ccc 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -439,6 +439,67 @@ export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 +## TensorRT EP Caches +There are three major TRT EP cahces: +* TRT timing cache +* TRT engine cache +* Embedded engine model / EPContext model + +### Caches can help reduce session creation time from minutes to seconds + +Following numbers are measured from initializing session with TRT EP for SD UNet model. +* No cache (default)  – 384 seconds + - The first run (warmup) can be very long because building engine involves exhaustive profiling for every kernels to select the optimal one. +* Timing cache used – 42 seconds + - Keep layer-profiling information and reuse them to expedite build time + - Timing cache can be shared across multiple models if layers are the same +* Engine cache used – 9 seconds + - Serialize engine from memory to disk for later use + - Skip entire engine build and deserialize engine cache to memory +* Embedded engine used (no builder instantiation) - 1.9 seconds + - The serialized engine cache is wrapped inside an ONNX model + - No builder will be instantiated, nor engine will be built + - Quickly load engine with less processes needed + +![image](https://github.com/microsoft/onnxruntime/assets/54722500/ef1ce168-74f7-4df4-beac-b14bf2cb3e00) + +### How to set caches +* Use Timing cache (.timing): + - `trt_timing_cache_enable = true` + - `trt_timing_cache_path = .\` + - `trt_force_timing_cache = true (accept slight GPU mismatch within CC)` +* Use Engine Cache (.engine): + - `trt_engine_cache_enable = true` + - `trt_engine_cache_path = .\trt_engines` +* Use Embed Engine (_ctx.onnx): + - Get the embed engine model via warmup run with the original model + - `trt_engine_cache_enable = true` + - `trt_dump_ep_context_model = true` + - `trt_ep_context_file_path = .\` + - Will be generated with inputs/outputs identical to original model + - Run the embed engine model as the original model ! + +The folder structure of the caches: + +![image](https://github.com/microsoft/onnxruntime/assets/54722500/5be4a087-79c8-4d34-af8b-75138642079c) + + +With the following command, the embedded engine model (`model_ctx.onnx`) will be generated along with the engine cache in the same directory. + +Note: The example does not specify `trt_engine_cache_path` because `onnxruntime_perf_test` requires a specific folder structure to run the inference. However, we still recommend specifying `trt_engine_cache_path` to better organize the caches. +```bash +$./onnxruntime_perf_test -e tensorrt -r 1 -i "trt_engine_cache_enable|true trt_dump_ep_context_model|true" /model_database/transformer_model/model.onnx +``` +Once the inference is complete, the embedded engine model is saved to disk. User can then run this model just like the original one, but with a significantly quicker session creation time. +```bask +$./onnxruntime_perf_test -e tensorrt -r 1 /model_database/transformer_model/model_ctx.onnx +``` + +### More about Embedded engine model / EPContext model +* One constraint is that the entire model needs to be TRT eligible +* When running the embedded engine model, the default setting is `trt_ep_context_embed_mode=0`, where the engine cache path is embedded and TRT EP will look for the engine cache on the disk. Alternatively, users can set `trt_ep_context_embed_mode=1`, embedding the entire engine binary data as a string in the model. However, this mode increases initialization time due to ORT graph optimization hashing the long string. Therefore, we recommend using `trt_ep_context_embed_mode=0`. +* The default name of an embedded engine model will have `_ctx.onnx` appended to the end. Users can specify `trt_ep_context_file_path=my_ep_context_model.onnx` to overwrite this default name. + ## Performance Tuning For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md)