Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation #21193

Merged
merged 14 commits into from
Jul 10, 2024

Conversation

edgchen1
Copy link
Contributor

@edgchen1 edgchen1 commented Jun 27, 2024

Description

Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.

Also moved some code around as it was getting big for a single file.

Measurements

Baseline: 9eb1c2a
Updated: 3d8fe4d

Microbenchmarks

Run on Azure VM (ARM64 Linux) with compute type: CompInt8, number of threads: 4, M:128/K:4096/N:4096

blklen symmetric baseline time (ns) updated time (ns)
16 1 76617120 51562668
16 0 83473648 58836170
32 1 35161580 29884869
32 0 42832905 33177712
64 1 35889788 30322006
64 0 38865041 31257731
E2E test

Run onnxruntime-genai benchmark with Phi-3 mini using 4 threads.

machine baseline prompt processing tokens/second updated pp t/s
Samsung Galaxy S21 11.95 15.40
Surface Pro 9 23.24 31.51
Azure VM 16.06 18.89

Motivation and Context

Improve prompt processing (M>1) performance.

@edgchen1 edgchen1 changed the title [MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation [WIP][MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation Jun 27, 2024
@edgchen1 edgchen1 marked this pull request as ready for review June 29, 2024 02:18
@edgchen1 edgchen1 requested a review from a team as a code owner June 29, 2024 02:18
@edgchen1
Copy link
Contributor Author

edgchen1 commented Jun 29, 2024

in microbenchmark measurements, why is blklen 64 asymmetric faster than symmetric?

edit: in 3d8fe4d symmetric is faster.

@edgchen1 edgchen1 changed the title [WIP][MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation [MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation Jun 29, 2024
liqunfu
liqunfu previously approved these changes Jul 3, 2024
@edgchen1 edgchen1 merged commit 20cd339 into main Jul 10, 2024
98 of 100 checks passed
@edgchen1 edgchen1 deleted the edgchen1/sqnbitgemm_multi_row branch July 10, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants