[MLAS][AArch64] SQNBitGemm CompInt8 - Use 4x2 tiles #21380

edgchen1 · 2024-07-17T03:31:10Z

Update SQNBitGemm ARM NEON kernel to compute 4x2 tile of output.

Note: Also tried 2x4 and 4x4 tiles but observed the best microbenchmark results with 4x2 tiles.

Baseline: 20cd339
Updated: aecb18a

Run on Azure VM (ARM64 Linux) with compute type: CompInt8, number of threads: 4, M:128/K:4096/N:4096

blklen	symmetric	baseline time (ns)	updated time (ns)
16	1	51511766	44803227
16	0	58870228	49002040
32	1	29887367	25812083
32	0	33208816	26632430
64	1	30344972	26624130
64	0	31460702	25966747

Run onnxruntime-genai benchmark with Phi-3 mini using 4 threads.

machine	baseline prompt processing tokens/second	updated pp t/s
Samsung Galaxy S21	15.93	17.93
Surface Pro 9	33.36	36.60
Azure VM	18.92	21.84

Improve prompt processing (M>1) performance.

This reverts commit f837239. the 4x2 impl was faster in microbenchmark measurements

fajin-corp

edgchen1 added 7 commits July 10, 2024 16:56

add Compute4x2 impl for blklen 32

ab134ac

add 2x4 impl for blklen 32

f837239

Revert "add 2x4 impl for blklen 32"

277df8d

This reverts commit f837239. the 4x2 impl was faster in microbenchmark measurements

use 4x2 tiles for blklen > 32

a2f5c6b

4x2 impl for blklen 16

768050a

move functions around

d796966

make dot variables const

aecb18a

edgchen1 marked this pull request as ready for review July 17, 2024 22:55

edgchen1 requested a review from a team as a code owner July 17, 2024 22:55

liqunfu approved these changes Jul 18, 2024

View reviewed changes

fajin-corp approved these changes Jul 18, 2024

View reviewed changes

edgchen1 merged commit 05fc0c6 into main Jul 18, 2024
99 checks passed

edgchen1 deleted the edgchen1/sqnbitgemm_larger_tiles branch July 18, 2024 20:37

Provide feedback