-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Qualcomm AI Hub sliding window models #1138
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…_VERBOSE_LOGGING environment variable
…into baijumeswani/decoder-pipeline
edgchen1
reviewed
Dec 11, 2024
edgchen1
reviewed
Dec 11, 2024
edgchen1
reviewed
Dec 11, 2024
baijumeswani
changed the title
Support for sliding window models
Support for Qualcomm AI Hub sliding window models
Dec 11, 2024
edgchen1
reviewed
Dec 12, 2024
edgchen1
reviewed
Dec 16, 2024
…into baijumeswani/decoder-pipeline
…into baijumeswani/decoder-pipeline
edgchen1
approved these changes
Dec 18, 2024
…into baijumeswani/decoder-pipeline
edgchen1
approved these changes
Dec 18, 2024
kunal-vaishnavi
approved these changes
Dec 18, 2024
Thank you for the review. :) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Language models sourced from Qualcomm AI Hub have a sliding window structure. This pull-request adds support for sliding window models to the decoder-pipeline.
Working with the model sourced from the Qualcomm AI Hub, the model execution goes through two stages:
window_size
tokens at a time to compute the key-value cache and generate the first token. In each iteration, we slide the window (bywindow_size
over the key-value cache buffer so the new key-value cache can be computed.For example, let's assume the following inputs:
Given the prompt has 8 tokens (assuming a word is a token), to process the entire prompt, 2 iterations will be required.
Iteration 1:
Iteration 2:
The above two iterations complete the prompt processing phase. Subsequently, the token generation phase begins. The token generation phase processes 1 token at a time.
At each iteration, the key-value cache is updated by sliding the key-value cache buffer by window size tokens during the prompt processing phase and by 1 token during the token generation phase.
This pull-request introduces the above changes to the decoder-pipeline to support the sliding window models from the Qualcomm AI Hub
In addition to the above changes, the pull request also introduces the
QNN_WITH_SHARED_MEMORY
device type so memory can be allocated using the qnn allocator. Thanks to @edgchen1 for making these changes.Open items: