[TensorRT EP] TRT EP cache docs (#20785)

Add a section for how to use three major TRT EP caches: TRT timing cache, TRT engine cache and Embedded engine model /EPContext model. Some materials are from my GTC slides.
microsoft · May 25, 2024 · e3e4f21 · e3e4f21
1 parent 3ae8137
commit e3e4f21
Showing 1 changed file with 61 additions and 0 deletions.
diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md
@@ -439,6 +439,67 @@ export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1
 
 </details>
 
+## TensorRT EP Caches
+There are three major TRT EP cahces:
+* TRT timing cache
+* TRT engine cache
+* Embedded engine model / EPContext model
+
+### Caches can help reduce session creation time from minutes to seconds
+
+Following numbers are measured from initializing session with TRT EP for SD UNet model.
+* No cache (default)  – 384 seconds
+  - The first run (warmup) can be very long because building engine involves exhaustive profiling for every kernels to select the optimal one.
+* Timing cache used – 42 seconds
+  - Keep layer-profiling information and reuse them to expedite build time
+  - Timing cache can be shared across multiple models if layers are the same
+* Engine cache used – 9 seconds
+  - Serialize engine from memory to disk for later use
+  - Skip entire engine build and deserialize engine cache to memory
+* Embedded engine used (no builder instantiation) - 1.9 seconds
+  - The serialized engine cache is wrapped inside an ONNX model
+  - No builder will be instantiated, nor engine will be built
+  - Quickly load engine with less processes needed
+
+![image](https://github.com/microsoft/onnxruntime/assets/54722500/ef1ce168-74f7-4df4-beac-b14bf2cb3e00)
+
+### How to set caches
+* Use Timing cache (.timing):
+  - `trt_timing_cache_enable = true`
+  - `trt_timing_cache_path = .\`
+  - `trt_force_timing_cache = true (accept slight GPU mismatch within CC)`
+* Use Engine Cache (.engine):
+  - `trt_engine_cache_enable = true`
+  - `trt_engine_cache_path = .\trt_engines`
+* Use Embed Engine (_ctx.onnx):
+  - Get the embed engine model via warmup run with the original model
+  - `trt_engine_cache_enable = true`
+  - `trt_dump_ep_context_model = true`
+  - `trt_ep_context_file_path = .\`
+  - Will be generated with inputs/outputs identical to original model
+  - Run the embed engine model as the original model !
+
+The folder structure of the caches:
+
+![image](https://github.com/microsoft/onnxruntime/assets/54722500/5be4a087-79c8-4d34-af8b-75138642079c)
+
+
+With the following command, the embedded engine model (`model_ctx.onnx`) will be generated along with the engine cache in the same directory.
+
+Note: The example does not specify `trt_engine_cache_path` because `onnxruntime_perf_test` requires a specific folder structure to run the inference. However, we still recommend specifying `trt_engine_cache_path` to better organize the caches.
+```bash
+$./onnxruntime_perf_test -e tensorrt -r 1 -i "trt_engine_cache_enable|true trt_dump_ep_context_model|true" /model_database/transformer_model/model.onnx
+```
+Once the inference is complete, the embedded engine model is saved to disk. User can then run this model just like the original one, but with a significantly quicker session creation time.
+```bask
+$./onnxruntime_perf_test -e tensorrt -r 1 /model_database/transformer_model/model_ctx.onnx
+```
+
+### More about Embedded engine model / EPContext model
+* One constraint is that the entire model needs to be TRT eligible
+* When running the embedded engine model, the default setting is `trt_ep_context_embed_mode=0`, where the engine cache path is embedded and TRT EP will look for the engine cache on the disk. Alternatively, users can set `trt_ep_context_embed_mode=1`, embedding the entire engine binary data as a string in the model. However, this mode increases initialization time due to ORT graph optimization hashing the long string. Therefore, we recommend using `trt_ep_context_embed_mode=0`.
+* The default name of an embedded engine model will have `_ctx.onnx` appended to the end. Users can specify `trt_ep_context_file_path=my_ep_context_model.onnx` to overwrite this default name.
+
 ## Performance Tuning
 For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md)