-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Memory Commit Savings. Possible total memory savings. Allow fully optimized model to be serialized to disk and used as-is without large heap allocs #21448
Labels
feature request
request for unsupported feature or enhancement
model:transformer
issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
platform:windows
issues related to the Windows platform
Comments
github-actions
bot
added
model:transformer
issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
platform:windows
issues related to the Windows platform
labels
Jul 22, 2024
Related: |
frank-dong-ms
added a commit
that referenced
this issue
Oct 25, 2024
### Description part of #21448 This change is intend to save CPU memory during model load for inference. Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on: 1. optimize model with inference session, prepacked external initializer will be saved into data file. 2. load optimized model and external data file with prepacked initializer, no prepack is needed 3. run inference with optimized model and data file Tested with model Phi-3-mini-instruct-onnx, with ORT 1.12.0: ![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef) with this change: ![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43) Peak memory usage dropped from **5.438 GB to 2.726GB**. This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers). next step: Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
ishwar-raut1
pushed a commit
to ishwar-raut1/onnxruntime
that referenced
this issue
Nov 19, 2024
### Description part of microsoft#21448 This change is intend to save CPU memory during model load for inference. Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on: 1. optimize model with inference session, prepacked external initializer will be saved into data file. 2. load optimized model and external data file with prepacked initializer, no prepack is needed 3. run inference with optimized model and data file Tested with model Phi-3-mini-instruct-onnx, with ORT 1.12.0: ![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef) with this change: ![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43) Peak memory usage dropped from **5.438 GB to 2.726GB**. This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers). next step: Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
ankitm3k
pushed a commit
to intel/onnxruntime
that referenced
this issue
Dec 11, 2024
### Description part of microsoft#21448 This change is intend to save CPU memory during model load for inference. Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on: 1. optimize model with inference session, prepacked external initializer will be saved into data file. 2. load optimized model and external data file with prepacked initializer, no prepack is needed 3. run inference with optimized model and data file Tested with model Phi-3-mini-instruct-onnx, with ORT 1.12.0: ![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef) with this change: ![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43) Peak memory usage dropped from **5.438 GB to 2.726GB**. This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers). next step: Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
ankitm3k
pushed a commit
to intel/onnxruntime
that referenced
this issue
Dec 11, 2024
### Description part of microsoft#21448 This change is intend to save CPU memory during model load for inference. Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on: 1. optimize model with inference session, prepacked external initializer will be saved into data file. 2. load optimized model and external data file with prepacked initializer, no prepack is needed 3. run inference with optimized model and data file Tested with model Phi-3-mini-instruct-onnx, with ORT 1.12.0: ![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef) with this change: ![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43) Peak memory usage dropped from **5.438 GB to 2.726GB**. This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers). next step: Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
ankitm3k
pushed a commit
to intel/onnxruntime
that referenced
this issue
Dec 11, 2024
### Description part of microsoft#21448 This change is intend to save CPU memory during model load for inference. Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on: 1. optimize model with inference session, prepacked external initializer will be saved into data file. 2. load optimized model and external data file with prepacked initializer, no prepack is needed 3. run inference with optimized model and data file Tested with model Phi-3-mini-instruct-onnx, with ORT 1.12.0: ![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef) with this change: ![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43) Peak memory usage dropped from **5.438 GB to 2.726GB**. This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers). next step: Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature request
request for unsupported feature or enhancement
model:transformer
issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
platform:windows
issues related to the Windows platform
Describe the feature request
Overview
Users want as low as possible memory utilization and great system performance when running AI models.
The benefits are large multi GB Process Memory Commit savings & possible total overall memory savings in some models.
The feature ask is to allow fully optimized model to be serialized to disk and used as-is without large heap allocs.
In the following, Windows examples are used, but this would likely apply to other OSes and ORT as well; since memory mapping APIs are available on other OSes and the ORT code is cross platform in this respect.
About
If all of a model weights are not always used (sparse tensors?) then the weights actually used could only be read in from disk, and only occupy memory when accessed. In this case, total memory usage for running an AI model is less than the on disk size of the model.
For most Large Language Models (LLMs) all the weights during the attention mechanism are usually needed & accessed during inference. However, the method that they are accessed and read in from disk has performance and memory implications.
Reducing Process Memory Commit Usage
Using the example for (2) LLMs, there are techniques to reduce the memory commit usage of a process (using OS memory mapping APIs), and sometimes obtain higher performance including inference perf especially under low memory conditions. Some AI models are large and much more likely to push a system to its memory limits. If you use these memory mappings APIs, and don't need to allocate heap memory, then say weights / initializer data from an ONNX model can just be loaded when accessed and would not occupy process commit.
This is very beneficial because System Total Commit is a precious resource. Commit Limit is physical memory + pagefile size. E.g. 16GB RAM + 16GB pagefile = Max 32GB of memory that can be allocated. Once this limit is reached no more memory across the system can be allocated at all. See more
Commit_charge
Pushing the Limits of Windows: Virtual Memory
Virtual Address Space and Physical Storage
For all further examples, we are going to use an example SLM (Small Language Model) Phi Silica example of around 1.85GB on disk size (3.2B) params.
Part 1 - Use ONNX External Data file with proper alignment + disable ORT Arena allocator
For our first experiment, we used ONNX External Data files with proper alignment fixes to generate a file which could be successfully mapped on Windows for all large initializers.
See External Data Conversion is not saving most data with alignment support. Therefore, mmap support disabled for these initializers
We also disabled the Arena memory allocator, as on CPU, the process consumes a lot more memory greedily and clouds the memory picture
m_session_options.DisableCpuMemArena();
With this in place ONNXRuntime was able to save a few hundred MBs (233MB) of process commit. This is just from the change of having an aligned external data file and thus letting ORT use map file support. However, this is not saving that much commit memory compared to the entire size of the model.
Part 2 - Disable pre-packing
For our second experiment, in addition to the technique & settings above, we disabled ORT pre-packing; which we determined from tracing was allocating the largest memory still - SessionState::PrepackConstantInitializedTensors
// Disable pre-packing - saves commit but REALLY REALLY bad for inf perf and overall runtime m_session_options.AddConfigEntry(kOrtSessionOptionsConfigDisablePrepacking, "1");
With this the commit memory savings were large, and in line with most of the size of the model 77%, in this case around 1436MB of commit. The issue here though is that disabling prepacking had severe runtime inference performance (200x worse), making the model unusable performance wise, but great memory wise.
General framework for implementing the feature
What follows is technical information on the general approach that ORT might use to both prepack a model, and then serialize that to disk, such that memory mapping could work AND large memory allocations were not needed by ORT. This would have the benefit of the best of both worlds, great runtime performance while getting the best utilization of system memory.
Changes in how the model weights are accessed
What would happen is simply the OS would page in the initializers and weights as needed on demand during inference. Weights that were routinely accessed would be kept in physical memory, not much differently than how heap memory for active working sets is kept in physical memory when needed. The difference from before that would take place is
Feature request suggestions
So how to go about implementing this feature request?
ONNX Runtime already has the notion of graph optimizations that can be serialized/written to disk, for example in offline mode tied to a specific class of hardware - graph-optimizations.
However, even when using this offline optimized model, large memory allocations will still take occur in ORT due to something called PrePacking. Prepacking has large positive runtime inference performance benefits. However, our view is that these pre-packing optimizations should be done once, and be able to be serialized to disk so the data structure on disk matches the most optimized in memory layout that ORT will use.
Once prepacking is serialized on disk and used during session load, then the data structures needed for inference are already mapped into the process address space when using memory mapping and MapViewOfFile . With no other major allocations needed, when ORT attempts to access a data structure or weights, then simply the OS would page in from disk to memory those weights.
FYI @pranavsharma and @yuslepukhin whom we have already been working with on ORT on this
Describe scenario use case
This will be useful to optimize memory usage for on-device client scenarios with limited physical RAM with large CPU models.
Larger models on disk (1GB+) for example with Billions of parameters would utilize memory better with fully working memory map support.
One such example is Large Language Models (LLMs) or Small Language Models such as Phi Silica 3
The text was updated successfully, but these errors were encountered: