Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable QNN HTP spill fill buffer setting to save RAM usage. #22853

Merged
merged 12 commits into from
Dec 6, 2024

Conversation

HectorSVC
Copy link
Contributor

@HectorSVC HectorSVC commented Nov 15, 2024

Description

Enable QNN HTP spill fill buffer setting to save RAM usage.
This feature is available after QNN 2.28. Need to re-generate QNN context binary.
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api

Requirements:

  1. Linux x86_64 platform + QNN QNN 2.28.2.241116
  2. Need weight sharing enabled
  3. Need to re-generate the Onnx model with QNN context binary by set the EP option enable_htp_spill_fill_buffer = 1.
  4. Works for a model with multiple Context binaries. Need manually merge multiple Onnx models with context binary into 1 Onnx model.

command example to generate context model:
./onnxruntime_qnn_ctx_gen -i "soc_model|60 htp_graph_finalization_optimization_mode|3 enable_htp_spill_fill_buffer|1" /mnt/model/share1_part_1.onnx,/mnt/model/share2_part_1.onnx

No need to do extra thing while running the model inference.

The generated EPContext node will have a max_size attribute with the maximum spill fill buffer size for the context binary
image

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@HectorSVC HectorSVC closed this Nov 15, 2024
@HectorSVC HectorSVC reopened this Nov 15, 2024
@HectorSVC HectorSVC added the ep:QNN issues related to QNN exeution provider label Nov 15, 2024
@HectorSVC
Copy link
Contributor Author

@chiwwang, could you help to take a look?

@chiwwang
Copy link

Hi Hector,
This looks good for me but let me ping others and see if they can also take a look.

@HectorSVC
Copy link
Contributor Author

HectorSVC commented Nov 22, 2024

Comments from QC: The approach has the limitation that it always gets the max spill fill buffer size form the 1st QNN context. The max spill file buffer size should be across all QNN contexts. To fill the gap, we need to go through all QNN context to:

  1. Load the QNN context binary buffer and extract the max spill fille buffer size for each QNN context
  2. Compare the max spill fille buffer size across all QNN context and track the index of the QNN context
  3. Load and deserialize the QNN context (to get the graph info for future execute) which has the max spill fille buffer size first, also set the max spill fill buffer, set the group handle to 0.
  4. Load and deserialize other QNN contexts, set the max spill buffer size, and set the group handle to the context in step3.

Considering this feature is mostly target for large models which has large context binary size, so there will be big overhead for step 1 & 2. Another approach is we dump the max spill fill buffer size for each Qnn context in EPContext node when we generate the model to make this information ready ahead of time instead of get it during normal session creation time. We can get the information from all EPContext nodes to get the max size and load that one first.

…ad QnnSystem lib which is not available for Windows x86_64 platform, so that not breaking existing workflow on x86_64 system
@quic-ashigarg
Copy link

Based on my understanding, the following is occurring:

  • Determine the maximum size of spill-fill around the context (there could be multiple graphs).
  • Identify the maximum size among all the contexts associated with a group. The first index will always be the context with the maximum size.
  • For each context in the list, if it is the first one in the group (i.e., with the maximum spill-fill value), use the context ID of 0x0; otherwise, use the context ID of the first one in the group.

If so, this seems good.

@HectorSVC HectorSVC closed this Dec 6, 2024
@HectorSVC HectorSVC reopened this Dec 6, 2024
@HectorSVC HectorSVC merged commit 401d16c into main Dec 6, 2024
116 checks passed
@HectorSVC HectorSVC deleted the qnn_spill_fill branch December 6, 2024 19:36
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
…t#22853)

### Description
Enable QNN HTP spill fill buffer setting to save RAM usage.
This feature is available after QNN 2.28. Need to re-generate QNN
context binary.

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api

Requirements:
1. Need to re-generate the Onnx model with QNN context binary by set the
EP option enable_htp_spill_fill_buffer = 1.
2. Works for a model with multiple Context binaries. Need manually merge
2 Onnx model with context binary into 1 Onnx model.
3. Requires Linux platform if generate the context binary offline since
QnnSystem lib is not available for Windows x86_64 platform.
No need to do extra thing while running the model inference.

The generated EPContext node will have a max_size attribute with the
maximum spill fill buffer size for the context binary
<img width="353" alt="image"
src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d">

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
…t#22853)

### Description
Enable QNN HTP spill fill buffer setting to save RAM usage.
This feature is available after QNN 2.28. Need to re-generate QNN
context binary.

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api

Requirements:
1. Need to re-generate the Onnx model with QNN context binary by set the
EP option enable_htp_spill_fill_buffer = 1.
2. Works for a model with multiple Context binaries. Need manually merge
2 Onnx model with context binary into 1 Onnx model.
3. Requires Linux platform if generate the context binary offline since
QnnSystem lib is not available for Windows x86_64 platform.
No need to do extra thing while running the model inference.

The generated EPContext node will have a max_size attribute with the
maximum spill fill buffer size for the context binary
<img width="353" alt="image"
src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d">

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
…t#22853)

### Description
Enable QNN HTP spill fill buffer setting to save RAM usage.
This feature is available after QNN 2.28. Need to re-generate QNN
context binary.

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api

Requirements:
1. Need to re-generate the Onnx model with QNN context binary by set the
EP option enable_htp_spill_fill_buffer = 1.
2. Works for a model with multiple Context binaries. Need manually merge
2 Onnx model with context binary into 1 Onnx model.
3. Requires Linux platform if generate the context binary offline since
QnnSystem lib is not available for Windows x86_64 platform.
No need to do extra thing while running the model inference.

The generated EPContext node will have a max_size attribute with the
maximum spill fill buffer size for the context binary
<img width="353" alt="image"
src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d">

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
tarekziade pushed a commit to tarekziade/onnxruntime that referenced this pull request Jan 10, 2025
…t#22853)

### Description
Enable QNN HTP spill fill buffer setting to save RAM usage.
This feature is available after QNN 2.28. Need to re-generate QNN
context binary.

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api

Requirements:
1. Need to re-generate the Onnx model with QNN context binary by set the
EP option enable_htp_spill_fill_buffer = 1.
2. Works for a model with multiple Context binaries. Need manually merge
2 Onnx model with context binary into 1 Onnx model.
3. Requires Linux platform if generate the context binary offline since
QnnSystem lib is not available for Windows x86_64 platform.
No need to do extra thing while running the model inference.

The generated EPContext node will have a max_size attribute with the
maximum spill fill buffer size for the context binary
<img width="353" alt="image"
src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d">

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:QNN issues related to QNN exeution provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants