Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TensorRT EP] Fix bug for DDS output handling for empty tensor #19575

Merged
merged 12 commits into from
Mar 5, 2024

Conversation

chilo-ms
Copy link
Contributor

@chilo-ms chilo-ms commented Feb 20, 2024

When the DDS output is empty tensor (i.e. any of the dimension is 0), TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to prevent accidentally overwriting other location that might belong to other tensors.

This PR also refactors the code to only allocate single bytes for all empty tensors.

#TODO: add unit tests to cover the DDS code paths or doing more testing with concurrent,sequential, threaded faster-rcnn using onnx_test_runner and verifying outputs

@jywu-msft
Copy link
Member

pls fix lint issues

@jywu-msft
Copy link
Member

jywu-msft commented Feb 27, 2024

i think we can experiment with just allocating single byte for all zero tensor cases (currently we allocate number of bytes for that datatype).
or i wonder if we can even pre-allocate single byte and use the same address everywhere? (probably need to ask Nvidia/TensorRT what the requirement is)
that can potentially simplify the code even more.

@chilo-ms
Copy link
Contributor Author

chilo-ms commented Feb 28, 2024

i think we can experiment with just allocating single byte for all zero tensor cases (currently we allocate number of bytes for that datatype). or i wonder if we can even pre-allocate single byte and use the same address everywhere? (probably need to ask Nvidia/TensorRT what the requirement is) that can potentially simplify the code even more.

I did test with ORT allocating only one byte allocation for empty tensor with faster-rcnn model and it works good.
Also, per TRT doc

If an engine binding is an empty tensor, it still needs a non-null memory address, and different tensors should have different addresses. This is consistent with the C++ rule that every object has a unique address, for example, new float[0] returns a non-null pointer. If using a memory allocator that might return a null pointer for zero bytes, ask for at least one byte instead.

Different tensor should have different address but we can have TRT EP to only allocate one dummy byte for each empty tensor

@chilo-ms chilo-ms marked this pull request as ready for review March 1, 2024 02:40
@chilo-ms chilo-ms merged commit d9730c7 into main Mar 5, 2024
94 of 95 checks passed
@chilo-ms chilo-ms deleted the chi/trt_dds_fix branch March 5, 2024 22:39
zz002 pushed a commit to zz002/onnxruntime that referenced this pull request Mar 7, 2024
…soft#19575)

When the DDS output is empty tensor (i.e. any of the dimension is 0),
TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to
prevent accidentally overwriting other location that might belong to
other tensors.

This PR also refactors the code to only allocate single bytes for all
empty tensors.

#TODO: add unit tests to cover the DDS code paths or doing more testing
with concurrent,sequential, threaded faster-rcnn using onnx_test_runner
and verifying outputs

---------

Co-authored-by: Chi Lo <[email protected]>
yf711 added a commit that referenced this pull request Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants