Skip to content

Commit

Permalink
[TensorRT EP] TRT EP cache docs (#20785)
Browse files Browse the repository at this point in the history
Add a section for how to use three major TRT EP caches: TRT timing
cache, TRT engine cache and Embedded engine model /EPContext model.
Some materials are from my GTC slides.
  • Loading branch information
chilo-ms authored May 25, 2024
1 parent 3ae8137 commit e3e4f21
Showing 1 changed file with 61 additions and 0 deletions.
61 changes: 61 additions & 0 deletions docs/execution-providers/TensorRT-ExecutionProvider.md
Original file line number Diff line number Diff line change
Expand Up @@ -439,6 +439,67 @@ export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1

</details>

## TensorRT EP Caches
There are three major TRT EP cahces:
* TRT timing cache
* TRT engine cache
* Embedded engine model / EPContext model

### Caches can help reduce session creation time from minutes to seconds

Following numbers are measured from initializing session with TRT EP for SD UNet model.
* No cache (default)  – 384 seconds
- The first run (warmup) can be very long because building engine involves exhaustive profiling for every kernels to select the optimal one.
* Timing cache used – 42 seconds
- Keep layer-profiling information and reuse them to expedite build time
- Timing cache can be shared across multiple models if layers are the same
* Engine cache used – 9 seconds
- Serialize engine from memory to disk for later use
- Skip entire engine build and deserialize engine cache to memory
* Embedded engine used (no builder instantiation) - 1.9 seconds
- The serialized engine cache is wrapped inside an ONNX model
- No builder will be instantiated, nor engine will be built
- Quickly load engine with less processes needed

![image](https://github.com/microsoft/onnxruntime/assets/54722500/ef1ce168-74f7-4df4-beac-b14bf2cb3e00)

### How to set caches
* Use Timing cache (.timing):
- `trt_timing_cache_enable = true`
- `trt_timing_cache_path = .\`
- `trt_force_timing_cache = true (accept slight GPU mismatch within CC)`
* Use Engine Cache (.engine):
- `trt_engine_cache_enable = true`
- `trt_engine_cache_path = .\trt_engines`
* Use Embed Engine (_ctx.onnx):
- Get the embed engine model via warmup run with the original model
- `trt_engine_cache_enable = true`
- `trt_dump_ep_context_model = true`
- `trt_ep_context_file_path = .\`
- Will be generated with inputs/outputs identical to original model
- Run the embed engine model as the original model !

The folder structure of the caches:

![image](https://github.com/microsoft/onnxruntime/assets/54722500/5be4a087-79c8-4d34-af8b-75138642079c)


With the following command, the embedded engine model (`model_ctx.onnx`) will be generated along with the engine cache in the same directory.

Note: The example does not specify `trt_engine_cache_path` because `onnxruntime_perf_test` requires a specific folder structure to run the inference. However, we still recommend specifying `trt_engine_cache_path` to better organize the caches.
```bash
$./onnxruntime_perf_test -e tensorrt -r 1 -i "trt_engine_cache_enable|true trt_dump_ep_context_model|true" /model_database/transformer_model/model.onnx
```
Once the inference is complete, the embedded engine model is saved to disk. User can then run this model just like the original one, but with a significantly quicker session creation time.
```bask
$./onnxruntime_perf_test -e tensorrt -r 1 /model_database/transformer_model/model_ctx.onnx
```

### More about Embedded engine model / EPContext model
* One constraint is that the entire model needs to be TRT eligible
* When running the embedded engine model, the default setting is `trt_ep_context_embed_mode=0`, where the engine cache path is embedded and TRT EP will look for the engine cache on the disk. Alternatively, users can set `trt_ep_context_embed_mode=1`, embedding the entire engine binary data as a string in the model. However, this mode increases initialization time due to ORT graph optimization hashing the long string. Therefore, we recommend using `trt_ep_context_embed_mode=0`.
* The default name of an embedded engine model will have `_ctx.onnx` appended to the end. Users can specify `trt_ep_context_file_path=my_ep_context_model.onnx` to overwrite this default name.

## Performance Tuning
For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md)

Expand Down

0 comments on commit e3e4f21

Please sign in to comment.