Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DML session initialization time & fix error when unpacking Tensors in DML Backend #19220

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mtavenrath
Copy link
Contributor

Description

Avoid temporary buffer when reading external tensors to decrease startup time.

Motivation and Context

The motivation is to decrease startup time when loading (huge) inference models where IO bw and memory bw can make up a significant amount of the whole startup time. Also worst case memory consumption is decreased by the size of the largest external tensor.

This change consists of two changes, one where memset is being eliminated by replacing std::vector with std::unique_ptr and new[] for tempory data. The second change completely removes the need for a temporary buffer reading an external tensor.

Long term the tensors shouldn't be read into CPU memory at all, but into GPU memory with DirectStorage. For new high speed NVMe devices a single core memcpy is often slower than doing a direct DMA transfer making the DirectStorage path faster for all cases where the NVMe device is faster than memcpy. The downside of DirectStorage is that for cases where storage is slower than a memcpy startup time will decrease.

@mtavenrath
Copy link
Contributor Author

@gedoensmax Can you please take a look at this change?

@mtavenrath
Copy link
Contributor Author

@jeffbloo @PatriceVignola Hi, can you take a look at this change to start a discussion how to implement this in the best manner?

@mtavenrath mtavenrath changed the title WIP: Improve DML session initialization time & fix error when unpacking Tensors in DML Backend Improve DML session initialization time & fix error when unpacking Tensors in DML Backend Jan 29, 2024
// Uses the tensor_proto_dir to construct the full path for external data. If tensor_proto_dir == nullptr
// then uses the current directory instead.
// This function does not unpack string_data of an initializer tensor
Status ReadExternalDataForTensor(const ONNX_NAMESPACE::TensorProto& tensor_proto,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function need the overflow vector? It feels like the only place this function is used in this file is with a known byte size, and I'm not sure we should silently succeed if the expected byte size doesn't match the actual byte size. Maybe there's something I'm missing here (e.g. alignment, padding)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original function had this interface for some reason I am not aware of. Probably it is to call this function with an expected_input_size of 0 so that this function takes care of memory allocation? I decided to be conservative here for now to keep existing functionality.

For the CUDA / DML execution provider we will eventually end up with a code path with optionally uses DirectStorage to upload a RAW buffer (and hopefully an inlined buffer as well) onto the GPU if NVMe storage is detected or if it is explicitly requested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides this open question, is there anything left to do to merge this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, everything seems to make sense. I started a CI run.

@mtavenrath mtavenrath force-pushed the unpack_tensor_perf branch from 6405a7c to aa0c01c Compare July 11, 2024 13:19
@mtavenrath
Copy link
Contributor Author

Load time perf is coming up again. I have reapplied my changes to master to get the optimizations back.

mtavenrath and others added 2 commits September 27, 2024 15:09
Also read unpacked tensor data into destination buffer so that the endian conversion is in place making the conversion a nop on little endian systems and improving reducing memory consumption + cache efficiency and big endian systems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants