-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve DML session initialization time & fix error when unpacking Tensors in DML Backend #19220
base: main
Are you sure you want to change the base?
Conversation
@gedoensmax Can you please take a look at this change? |
fa38170
to
89ae5d1
Compare
@jeffbloo @PatriceVignola Hi, can you take a look at this change to start a discussion how to implement this in the best manner? |
// Uses the tensor_proto_dir to construct the full path for external data. If tensor_proto_dir == nullptr | ||
// then uses the current directory instead. | ||
// This function does not unpack string_data of an initializer tensor | ||
Status ReadExternalDataForTensor(const ONNX_NAMESPACE::TensorProto& tensor_proto, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this function need the overflow vector? It feels like the only place this function is used in this file is with a known byte size, and I'm not sure we should silently succeed if the expected byte size doesn't match the actual byte size. Maybe there's something I'm missing here (e.g. alignment, padding)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original function had this interface for some reason I am not aware of. Probably it is to call this function with an expected_input_size
of 0 so that this function takes care of memory allocation? I decided to be conservative here for now to keep existing functionality.
For the CUDA / DML execution provider we will eventually end up with a code path with optionally uses DirectStorage to upload a RAW buffer (and hopefully an inlined buffer as well) onto the GPU if NVMe storage is detected or if it is explicitly requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides this open question, is there anything left to do to merge this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, everything seems to make sense. I started a CI run.
6405a7c
to
aa0c01c
Compare
Load time perf is coming up again. I have reapplied my changes to master to get the optimizations back. |
Also read unpacked tensor data into destination buffer so that the endian conversion is in place making the conversion a nop on little endian systems and improving reducing memory consumption + cache efficiency and big endian systems.
aa0c01c
to
99ecfbc
Compare
Description
Avoid temporary buffer when reading external tensors to decrease startup time.
Motivation and Context
The motivation is to decrease startup time when loading (huge) inference models where IO bw and memory bw can make up a significant amount of the whole startup time. Also worst case memory consumption is decreased by the size of the largest external tensor.
This change consists of two changes, one where memset is being eliminated by replacing
std::vector
withstd::unique_ptr
andnew[]
for tempory data. The second change completely removes the need for a temporary buffer reading an external tensor.Long term the tensors shouldn't be read into CPU memory at all, but into GPU memory with DirectStorage. For new high speed NVMe devices a single core memcpy is often slower than doing a direct DMA transfer making the DirectStorage path faster for all cases where the NVMe device is faster than memcpy. The downside of DirectStorage is that for cases where storage is slower than a memcpy startup time will decrease.