Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Qualcomm AI Hub sliding window models #1138

Merged
merged 23 commits into from
Dec 18, 2024

Conversation

baijumeswani
Copy link
Collaborator

@baijumeswani baijumeswani commented Dec 10, 2024

Language models sourced from Qualcomm AI Hub have a sliding window structure. This pull-request adds support for sliding window models to the decoder-pipeline.

Working with the model sourced from the Qualcomm AI Hub, the model execution goes through two stages:

  • Context (prompt) processing: The model runs through several iterations of the model processing window_size tokens at a time to compute the key-value cache and generate the first token. In each iteration, we slide the window (by window_size over the key-value cache buffer so the new key-value cache can be computed.
  • Token generation: After the input prompt has been processed, the execution switches over to the token generation phase where new tokens are processed one at a time. In this phase, we slide the key-value cache buffer by 1 at each iteration.

For example, let's assume the following inputs:

  • Prompt: ["What", "is", "the", "square", "root", "of", "16", "?"]
  • Window size: 5
  • Pad token: 0
  • context length: 12

Given the prompt has 8 tokens (assuming a word is a token), to process the entire prompt, 2 iterations will be required.

  • Iteration 1:

    • input_ids: [0, 0, "What", "is", "the"]
    • position_ids: [0, 0, 0, 1, 2]
    • attention_mask: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
  • Iteration 2:

    • input_ids: ["square", "root", "of", "16", "?"]
    • position_ids: [3, 4, 5, 6, 7]
    • attention_mask: [[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

The above two iterations complete the prompt processing phase. Subsequently, the token generation phase begins. The token generation phase processes 1 token at a time.

At each iteration, the key-value cache is updated by sliding the key-value cache buffer by window size tokens during the prompt processing phase and by 1 token during the token generation phase.

This pull-request introduces the above changes to the decoder-pipeline to support the sliding window models from the Qualcomm AI Hub

In addition to the above changes, the pull request also introduces the QNN_WITH_SHARED_MEMORY device type so memory can be allocated using the qnn allocator. Thanks to @edgchen1 for making these changes.

Open items:

  • Continuous decoding support for these model types.
  • Incorrect results when the generated tokens size becomes large enough.
  • Performance.
  • Tests

src/models/debugging.cpp Outdated Show resolved Hide resolved
src/models/input_ids.h Outdated Show resolved Hide resolved
src/models/input_ids.h Outdated Show resolved Hide resolved
src/models/input_ids.h Outdated Show resolved Hide resolved
src/models/kv_cache.cpp Outdated Show resolved Hide resolved
src/models/kv_cache.h Outdated Show resolved Hide resolved
src/models/kv_cache.h Outdated Show resolved Hide resolved
src/models/model.cpp Show resolved Hide resolved
src/models/position_inputs.h Outdated Show resolved Hide resolved
src/models/position_inputs.h Outdated Show resolved Hide resolved
@baijumeswani baijumeswani changed the title Support for sliding window models Support for Qualcomm AI Hub sliding window models Dec 11, 2024
cmake/cxx_standard.cmake Outdated Show resolved Hide resolved
src/models/input_ids.cpp Outdated Show resolved Hide resolved
src/models/input_ids.h Show resolved Hide resolved
src/models/kv_cache.h Outdated Show resolved Hide resolved
src/models/kv_cache.h Outdated Show resolved Hide resolved
src/models/kv_cache.cpp Show resolved Hide resolved
src/models/model.cpp Show resolved Hide resolved
src/models/position_inputs.cpp Outdated Show resolved Hide resolved
src/config.h Outdated Show resolved Hide resolved
src/generators.cpp Outdated Show resolved Hide resolved
src/ort_genai_c.cpp Outdated Show resolved Hide resolved
src/generators.h Outdated Show resolved Hide resolved
@baijumeswani baijumeswani merged commit 7a5e7f1 into main Dec 18, 2024
13 of 14 checks passed
@baijumeswani baijumeswani deleted the baijumeswani/decoder-pipeline branch December 18, 2024 20:55
@baijumeswani
Copy link
Collaborator Author

Thank you for the review. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants