Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TensorRT EP] Switch to enqueueV3 with support DDS output #17751

Closed
wants to merge 31 commits into from

Conversation

chilo-ms
Copy link
Contributor

@chilo-ms chilo-ms commented Sep 30, 2023

There are 2 phases to switch to enqueueV3. This PR is the 2nd phases. (The 1st phase PR is here)

One of the ways TRT handles data-dependent shape (DDS) output is relying on user to provide an allocator as a callback. TRT calls this allocator when it knows the shape of the tensor during runtime to allocate output memory. So, here, we need a way to bind the allocation output to the kernel context output.

"If the output tensor has data-dependent shape, TRT EP will provide an IOutputAllocator for enqueueV3 to dynamically allocate memory buffer.
Once enqueueV3 returns, TRT EP will then bind the output allocation to ORT kernel context output.
(Please note that we take strategy A mentioned in https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#dynamic-shaped-output,
which we defer allocation until the size is known and don't call IExecution::setTensorAddress)
Otherwise, if the shape of the output tensor is known prior to the runtime, ORT will pre-allocate memory buffer for the output tensor for enqueueV3."

Add new ORT KernelContext_SetOutput() API where it calls the existed SetOutputMLValue() which only being used by training before.
This is because Compiled based EP's can only use the public OrtKernelContext api's and not the internal OpKernelContext api.

@yf711 yf711 self-requested a review October 2, 2023 23:25
@chilo-ms chilo-ms changed the title [TensorRT EP] Switch to use new TRT APIs from deprecated ones [TensorRT EP] Switch to enqueueV3 with support DDS output Oct 18, 2023
@chilo-ms chilo-ms requested a review from jywu-msft October 18, 2023 00:36
@chilo-ms chilo-ms marked this pull request as ready for review October 18, 2023 00:36
@chilo-ms chilo-ms requested review from jslhcl and souptc October 20, 2023 17:33
jywu-msft
jywu-msft previously approved these changes Dec 4, 2023
*
* \since Version 1.17.
*/
ORT_API2_STATUS(KernelContext_SetOutput, _Inout_ OrtKernelContext* context, _In_ size_t index,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KernelContext_SetOutput

why we want to expose it to CAPI? we have TRT based custom op?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KernelContext_SetOutput

why we want to expose it to CAPI? we have TRT based custom op?

compile api based EP's need to implement compute_func which only has access to public OrtKernelContext api , not the internal OpKernelContext api.
we need to use SetOutputMLValue() so that's why it's plumbed thru to the public api in this PR

souptc
souptc previously approved these changes Dec 4, 2023
Copy link
Member

@souptc souptc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jywu-msft
Copy link
Member

this will be replaced by version which copies output rather than binds output to kernel context since we don't want to expose that api publicly. #18714

@jywu-msft jywu-msft closed this Dec 8, 2023
jywu-msft added a commit that referenced this pull request Dec 14, 2023
…on) (#18714)

It's branched off from
#17751 but removes
KernelContext_SetOutput() API. It copies output allocation buffer to
kernel context.

---------

Co-authored-by: George Wu <[email protected]>
jywu-msft pushed a commit that referenced this pull request Jan 12, 2024
When the TRT engine cache (precompiled engine) is present, it doesn't
make sense to go over the processes of model verification, model
optimization, TRT EP's GetCapability(), TRT EP's model proto
reconstruction, calling TRT parser and engine compilation.
This PR makes TRT EP skip those processes and directly load the engine
to perform inference.

The feature request:
#18072

Features:

- Replace original model with TRT engine wrapped ONNX model. It can save
a lot of time as mentioned above.

- How to get TRT engine wrapped ONNX model?
1. Set `trt_dump_ep_context_model` provider option to "true" and run the
inference. You will find the "xxx_wrapper.onnx" at the engine cache
path. (The same logic of generating engine cache)
    2. Use gen_trt_engine_wrapper_onnx_model.py

- Three provider options are added, 
`trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP
`trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine
cache path, 1 means engine binary data.
`trt_ep_context_compute_capability_enable`: Add hardware_arch as
attribute. When running the model, TRT EP will check consistency between
model's hardware_arch and GPU's compute capability.

- When the engine cache path is given in the wrapped model, TRT EP will
first search for the engine file using the path (relative to model
path), if it can't find it, it will change to use the path as it is
(depends on user, could be relative to working dir or absolute path)

Note: 

1. This PR includes the change of
#17751


Constraints:

1. The whole model should be fully supported by TRT. 
4. Users need to make sure the engine is built with min/max/opt
optimization profiles that large enough to cover the range of all
inputs. TRT EP will simply fail and won't rebuild the engine if the
input shape is out of range during runtime.
mszhanyi pushed a commit that referenced this pull request Jan 15, 2024
When the TRT engine cache (precompiled engine) is present, it doesn't
make sense to go over the processes of model verification, model
optimization, TRT EP's GetCapability(), TRT EP's model proto
reconstruction, calling TRT parser and engine compilation.
This PR makes TRT EP skip those processes and directly load the engine
to perform inference.

The feature request:
#18072

Features:

- Replace original model with TRT engine wrapped ONNX model. It can save
a lot of time as mentioned above.

- How to get TRT engine wrapped ONNX model?
1. Set `trt_dump_ep_context_model` provider option to "true" and run the
inference. You will find the "xxx_wrapper.onnx" at the engine cache
path. (The same logic of generating engine cache)
    2. Use gen_trt_engine_wrapper_onnx_model.py

- Three provider options are added, 
`trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP
`trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine
cache path, 1 means engine binary data.
`trt_ep_context_compute_capability_enable`: Add hardware_arch as
attribute. When running the model, TRT EP will check consistency between
model's hardware_arch and GPU's compute capability.

- When the engine cache path is given in the wrapped model, TRT EP will
first search for the engine file using the path (relative to model
path), if it can't find it, it will change to use the path as it is
(depends on user, could be relative to working dir or absolute path)

Note: 

1. This PR includes the change of
#17751


Constraints:

1. The whole model should be fully supported by TRT. 
4. Users need to make sure the engine is built with min/max/opt
optimization profiles that large enough to cover the range of all
inputs. TRT EP will simply fail and won't rebuild the engine if the
input shape is out of range during runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants