Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation about exporting pte on Qualcomm backend without quantization #7133

Open
DongGeun123 opened this issue Dec 2, 2024 · 3 comments
Assignees
Labels
actionable Items in the backlog waiting for an appropriate impl/fix enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: qnn Related to Qualcomm's QNN delegate partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@DongGeun123
Copy link

DongGeun123 commented Dec 2, 2024

The documentation for the Qualcomm backend (found here) only describes exporting to a .pte file with quantization.

Is there a way to use prequantized models provided in the Llama3 repository? For example, the XNNPACK backend supports using prequantized checkpoints (SpinQuant) from Hugging Face.

@JacobSzwejbka JacobSzwejbka added the module: qnn Related to Qualcomm's QNN delegate label Dec 2, 2024
@JacobSzwejbka
Copy link
Contributor

@cccclai or @larryliu0820 do you know the answer to this?

@cccclai
Copy link
Contributor

cccclai commented Dec 2, 2024

The prequantized models provided in the Llama3 repository (if you meant https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/) is mainly targeting for CPU instead of general NPU use cases, as it bakes in quite a bit of assumption.

@chiwwang in case you know more about QC's plan for the quantized model.

@cccclai cccclai added the partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm label Dec 2, 2024
@dbort dbort changed the title export pte on Qualcomm backend without quantization Add documentation about exporting pte on Qualcomm backend without quantization Dec 2, 2024
@dbort dbort added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix actionable Items in the backlog waiting for an appropriate impl/fix triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 2, 2024
@chiwwang
Copy link
Collaborator

chiwwang commented Dec 3, 2024

@haowhsu-quic @shewu-quic

mmmm.... I think that would be good if we can accept pre-quantized models. However, llama3.2 is special. In current situations, we need quite a few customizations on the model structure. Some of them are even hard to do by transformation, e.g., moving KV to graph I/O.

So there is still a (not small) gap at the front of us to support the quantized LLM in the official llama repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable Items in the backlog waiting for an appropriate impl/fix enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: qnn Related to Qualcomm's QNN delegate partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants