Support for Qualcomm AI Hub sliding window models #1138

baijumeswani · 2024-12-10T21:12:15Z

Language models sourced from Qualcomm AI Hub have a sliding window structure. This pull-request adds support for sliding window models to the decoder-pipeline.

Working with the model sourced from the Qualcomm AI Hub, the model execution goes through two stages:

Context (prompt) processing: The model runs through several iterations of the model processing window_size tokens at a time to compute the key-value cache and generate the first token. In each iteration, we slide the window (by window_size over the key-value cache buffer so the new key-value cache can be computed.
Token generation: After the input prompt has been processed, the execution switches over to the token generation phase where new tokens are processed one at a time. In this phase, we slide the key-value cache buffer by 1 at each iteration.

For example, let's assume the following inputs:

Prompt: ["What", "is", "the", "square", "root", "of", "16", "?"]
Window size: 5
Pad token: 0
context length: 12

Given the prompt has 8 tokens (assuming a word is a token), to process the entire prompt, 2 iterations will be required.

Iteration 1:
- input_ids: [0, 0, "What", "is", "the"]
- position_ids: [0, 0, 0, 1, 2]
- attention_mask: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
Iteration 2:
- input_ids: ["square", "root", "of", "16", "?"]
- position_ids: [3, 4, 5, 6, 7]
- attention_mask: [[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

The above two iterations complete the prompt processing phase. Subsequently, the token generation phase begins. The token generation phase processes 1 token at a time.

At each iteration, the key-value cache is updated by sliding the key-value cache buffer by window size tokens during the prompt processing phase and by 1 token during the token generation phase.

This pull-request introduces the above changes to the decoder-pipeline to support the sliding window models from the Qualcomm AI Hub

In addition to the above changes, the pull request also introduces the QNN_WITH_SHARED_MEMORY device type so memory can be allocated using the qnn allocator. Thanks to @edgchen1 for making these changes.

Open items:

Continuous decoding support for these model types.
Incorrect results when the generated tokens size becomes large enough.
Performance.
Tests

…_VERBOSE_LOGGING environment variable

…into baijumeswani/decoder-pipeline

src/models/model.cpp

src/models/threadpool.cpp

src/models/debugging.cpp

src/models/input_ids.h

src/models/kv_cache.cpp

src/models/kv_cache.h

src/models/model.cpp

src/models/position_inputs.h

cmake/cxx_standard.cmake

…into baijumeswani/decoder-pipeline

src/models/input_ids.cpp

src/models/input_ids.h

src/models/kv_cache.h

src/models/kv_cache.cpp

src/models/model.cpp

src/models/position_inputs.cpp

src/config.h

src/generators.cpp

src/ort_genai_c.cpp

src/models/input_ids.cpp

src/generators.cpp

src/generators.h

…into baijumeswani/decoder-pipeline

baijumeswani · 2024-12-18T20:55:32Z

Thank you for the review. :)

baijumeswani and others added 9 commits December 5, 2024 00:20

Updates for decoder-pipeline to work with other split models

3c9d563

Sync changes with main

4212d19

Allow adjustments to the sliding window kv cache

5e4ab3d

enable setting default ORT logging level to verbose with ORTGENAI_ORT…

0a0f98d

…_VERBOSE_LOGGING environment variable

hack to run with qnn shared memory allocator

7a331ef

Make kv cache updates parallel

aca6622

Support num tokens > 128

ad737df

Merge branch 'main' of https://github.com/microsoft/onnxruntime-genai …

74c4746

…into baijumeswani/decoder-pipeline

Documentation and create kv cache interface

5ef9658

baijumeswani requested review from RyanUnderhill and edgchen1 December 10, 2024 21:12

baijumeswani added 3 commits December 10, 2024 22:03

Always assign allocator_kv_cache_

a68827d

Avoid using front()

2d2c3fb

link against pthreads

3acbfc0

edgchen1 reviewed Dec 11, 2024

View reviewed changes

src/models/model.cpp Show resolved Hide resolved

edgchen1 reviewed Dec 11, 2024

View reviewed changes

src/models/threadpool.cpp Show resolved Hide resolved

edgchen1 reviewed Dec 11, 2024

View reviewed changes

src/models/threadpool.cpp Show resolved Hide resolved

RyanUnderhill reviewed Dec 11, 2024

View reviewed changes

baijumeswani changed the title ~~Support for sliding window models~~ Support for Qualcomm AI Hub sliding window models Dec 11, 2024

Address pull-request review comments

c5ee9c0

edgchen1 reviewed Dec 12, 2024

View reviewed changes

cmake/cxx_standard.cmake Outdated Show resolved Hide resolved

baijumeswani added 3 commits December 12, 2024 23:51

Address pull-request review comments

0ccc668

Throw meaningful exception when user tries continous decoding

11dbed2

Merge branch 'main' of https://github.com/microsoft/onnxruntime-genai …

0130a51

…into baijumeswani/decoder-pipeline

edgchen1 reviewed Dec 16, 2024

View reviewed changes

kunal-vaishnavi reviewed Dec 16, 2024

View reviewed changes

src/models/input_ids.cpp Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Dec 16, 2024

View reviewed changes

src/generators.cpp Show resolved Hide resolved

kunal-vaishnavi reviewed Dec 16, 2024

View reviewed changes

src/generators.h Outdated Show resolved Hide resolved

baijumeswani added 2 commits December 17, 2024 10:14

Address pull request review comments

43af9aa

Merge branch 'main' of https://github.com/microsoft/onnxruntime-genai …

49f4345

…into baijumeswani/decoder-pipeline

Merge branch 'main' of https://github.com/microsoft/onnxruntime-genai …

d301fc5

…into baijumeswani/decoder-pipeline

edgchen1 approved these changes Dec 18, 2024

View reviewed changes

baijumeswani added 4 commits December 18, 2024 10:19

Merge branch 'main' of https://github.com/microsoft/onnxruntime-genai …

82de85a

…into baijumeswani/decoder-pipeline

Rename InputIDsDefault with DefaultInputIDs

2a00a26

More merge conflicts

ae436ef

Rename structs

5a8fae9

edgchen1 approved these changes Dec 18, 2024

View reviewed changes

kunal-vaishnavi approved these changes Dec 18, 2024

View reviewed changes

baijumeswani merged commit 7a5e7f1 into main Dec 18, 2024
13 of 14 checks passed

baijumeswani deleted the baijumeswani/decoder-pipeline branch December 18, 2024 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Qualcomm AI Hub sliding window models #1138

Support for Qualcomm AI Hub sliding window models #1138

baijumeswani commented Dec 10, 2024 •

edited

Loading

baijumeswani commented Dec 18, 2024

Support for Qualcomm AI Hub sliding window models #1138

Support for Qualcomm AI Hub sliding window models #1138

Conversation

baijumeswani commented Dec 10, 2024 • edited Loading

baijumeswani commented Dec 18, 2024

baijumeswani commented Dec 10, 2024 •

edited

Loading