Add documentation about exporting pte on Qualcomm backend without quantization #7133

DongGeun123 · 2024-12-02T05:08:07Z

The documentation for the Qualcomm backend (found here) only describes exporting to a .pte file with quantization.

Is there a way to use prequantized models provided in the Llama3 repository? For example, the XNNPACK backend supports using prequantized checkpoints (SpinQuant) from Hugging Face.

JacobSzwejbka · 2024-12-02T18:25:23Z

@cccclai or @larryliu0820 do you know the answer to this?

cccclai · 2024-12-02T18:36:27Z

The prequantized models provided in the Llama3 repository (if you meant https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/) is mainly targeting for CPU instead of general NPU use cases, as it bakes in quite a bit of assumption.

@chiwwang in case you know more about QC's plan for the quantized model.

chiwwang · 2024-12-03T02:58:23Z

@haowhsu-quic @shewu-quic

mmmm.... I think that would be good if we can accept pre-quantized models. However, llama3.2 is special. In current situations, we need quite a few customizations on the model structure. Some of them are even hard to do by transformation, e.g., moving KV to graph I/O.

So there is still a (not small) gap at the front of us to support the quantized LLM in the official llama repo.

JacobSzwejbka added the module: qnn Related to Qualcomm's QNN delegate label Dec 2, 2024

cccclai added the partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm label Dec 2, 2024

dbort changed the title ~~export pte on Qualcomm backend without quantization~~ Add documentation about exporting pte on Qualcomm backend without quantization Dec 2, 2024

dbort added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix actionable Items in the backlog waiting for an appropriate impl/fix triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 2, 2024

dbort assigned chiwwang Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation about exporting pte on Qualcomm backend without quantization #7133

Add documentation about exporting pte on Qualcomm backend without quantization #7133

DongGeun123 commented Dec 2, 2024 •

edited

Loading

JacobSzwejbka commented Dec 2, 2024

cccclai commented Dec 2, 2024

chiwwang commented Dec 3, 2024

Add documentation about exporting pte on Qualcomm backend without quantization #7133

Add documentation about exporting pte on Qualcomm backend without quantization #7133

Comments

DongGeun123 commented Dec 2, 2024 • edited Loading

JacobSzwejbka commented Dec 2, 2024

cccclai commented Dec 2, 2024

chiwwang commented Dec 3, 2024

DongGeun123 commented Dec 2, 2024 •

edited

Loading