Improve DML session initialization time & fix error when unpacking Tensors in DML Backend #19220

mtavenrath · 2024-01-22T10:07:08Z

Description

Avoid temporary buffer when reading external tensors to decrease startup time.

Motivation and Context

The motivation is to decrease startup time when loading (huge) inference models where IO bw and memory bw can make up a significant amount of the whole startup time. Also worst case memory consumption is decreased by the size of the largest external tensor.

This change consists of two changes, one where memset is being eliminated by replacing std::vector with std::unique_ptr and new[] for tempory data. The second change completely removes the need for a temporary buffer reading an external tensor.

Long term the tensors shouldn't be read into CPU memory at all, but into GPU memory with DirectStorage. For new high speed NVMe devices a single core memcpy is often slower than doing a direct DMA transfer making the DirectStorage path faster for all cases where the NVMe device is faster than memcpy. The downside of DirectStorage is that for cases where storage is slower than a memcpy startup time will decrease.

mtavenrath · 2024-01-22T10:10:56Z

@gedoensmax Can you please take a look at this change?

onnxruntime/core/framework/endian_utils.cc

mtavenrath · 2024-01-22T18:00:49Z

@jeffbloo @PatriceVignola Hi, can you take a look at this change to start a discussion how to implement this in the best manner?

onnxruntime/core/framework/tensorprotoutils.cc

PatriceVignola · 2024-02-06T07:27:37Z

onnxruntime/core/framework/tensorprotoutils.cc

+// Uses the tensor_proto_dir to construct the full path for external data. If tensor_proto_dir == nullptr
+// then uses the current directory instead.
+// This function does not unpack string_data of an initializer tensor
+Status ReadExternalDataForTensor(const ONNX_NAMESPACE::TensorProto& tensor_proto,


Does this function need the overflow vector? It feels like the only place this function is used in this file is with a known byte size, and I'm not sure we should silently succeed if the expected byte size doesn't match the actual byte size. Maybe there's something I'm missing here (e.g. alignment, padding)?

The original function had this interface for some reason I am not aware of. Probably it is to call this function with an expected_input_size of 0 so that this function takes care of memory allocation? I decided to be conservative here for now to keep existing functionality.

For the CUDA / DML execution provider we will eventually end up with a code path with optionally uses DirectStorage to upload a RAW buffer (and hopefully an inlined buffer as well) onto the GPU if NVMe storage is detected or if it is explicitly requested.

Besides this open question, is there anything left to do to merge this PR?

I don't think so, everything seems to make sense. I started a CI run.

mtavenrath · 2024-07-11T13:20:38Z

Load time perf is coming up again. I have reapplied my changes to master to get the optimizations back.

Also read unpacked tensor data into destination buffer so that the endian conversion is in place making the conversion a nop on little endian systems and improving reducing memory consumption + cache efficiency and big endian systems.

mtavenrath commented Jan 22, 2024

View reviewed changes

onnxruntime/core/framework/endian_utils.cc Show resolved Hide resolved

mtavenrath force-pushed the unpack_tensor_perf branch from fa38170 to 89ae5d1 Compare January 22, 2024 10:23

mtavenrath changed the title ~~WIP: Improve DML session initialization time & fix error when unpacking Tensors in DML Backend~~ Improve DML session initialization time & fix error when unpacking Tensors in DML Backend Jan 29, 2024