Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fusing a mat mul op followed by a scale op on the CPU #5

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Jul 27, 2024

This is useful for Bitnet here we have almost all matrix multiplications be followed by scale operations.
As a result, we get a ~2% boost in Bitnet PP performance.

Implementation is easy when the matrix multiplication is done by iqk_mul_mat. But if iqk_mul_mat is not implemented for the quant type/architecture, we need to add the scaling to llamafile sgemm and to ggml itself, which is way more messy, so I didn't do it yet.

Given that Bitnet is just a niche thing for now, I'll just leave it on a draft PR for now.

This is useful for Bitnet here we have almost all matricx
multiplications be followed by scale operations.
As a result, we get a ~2% boost in Bitnet PP performance.

Implementation is easy when the matrix multiplication is done
by iqk_mul_mat. But if iqk_mul_mat is not implemented for the
quant type/architecture, we need to add the scaling to
llamafile sgemm and to ggml itself, which is way more
messy, so I didn't do it yet.
Given that Bitnet is just a niche thing for now, I'll just
leave it on a branch for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants