[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation #21193

edgchen1 · 2024-06-27T18:48:40Z

Description

Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.

Also moved some code around as it was getting big for a single file.

Measurements

Baseline: 9eb1c2a
Updated: 3d8fe4d

Microbenchmarks

Run on Azure VM (ARM64 Linux) with compute type: CompInt8, number of threads: 4, M:128/K:4096/N:4096

blklen	symmetric	baseline time (ns)	updated time (ns)
16	1	76617120	51562668
16	0	83473648	58836170
32	1	35161580	29884869
32	0	42832905	33177712
64	1	35889788	30322006
64	0	38865041	31257731

E2E test

Run onnxruntime-genai benchmark with Phi-3 mini using 4 threads.

machine	baseline prompt processing tokens/second	updated pp t/s
Samsung Galaxy S21	11.95	15.40
Surface Pro 9	23.24	31.51
Azure VM	16.06	18.89

Motivation and Context

Improve prompt processing (M>1) performance.

…blklen > 32

…ulti_row

…face

edgchen1 · 2024-06-29T02:29:11Z

in microbenchmark measurements, why is blklen 64 asymmetric faster than symmetric?

edit: in 3d8fe4d symmetric is faster.

onnxruntime/core/mlas/lib/sqnbitgemm.cpp

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_neon_int8.cpp

onnxruntime/core/mlas/lib/sqnbitgemm.h

edgchen1 added 10 commits June 18, 2024 14:34

initial impl for m=2 kernel that computes 2x2 outputs at a time, for …

0b3db2b

…blklen > 32

support blklen 32, zero point

e338ef0

implement tiling for blklen 32

10505a6

move to tiling approach for sqnbitgemm compint8 impl

b7bb4d2

fix returned registered test count

fdfd25a

use variable for HasZeroPoint template parameter value

19f13ff

split out sqnbitgemm ARM NEON impl into multiple files

abc7010

Merge remote-tracking branch 'origin/main' into edgchen1/sqnbitgemm_m…

36c47c7

…ulti_row

update sqnbitgemm avx code to use new SQ4BitGemmKernel_CompInt8 inter…

f616808

…face

put impl into unnamed namespace, comment

f5b1817

edgchen1 changed the title ~~[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation~~ [WIP][MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation Jun 27, 2024

fix zp loading

e35f2b3

edgchen1 marked this pull request as ready for review June 29, 2024 02:18

edgchen1 requested a review from a team as a code owner June 29, 2024 02:18

edgchen1 changed the title ~~[WIP][MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation~~ [MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation Jun 29, 2024

edgchen1 commented Jun 29, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm.cpp Outdated Show resolved Hide resolved

edgchen1 added 2 commits July 1, 2024 12:41

fix post processor call arguments

fbc6c8a

helper functions for advancing row/col ptrs

3d8fe4d

liqunfu reviewed Jul 3, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp Show resolved Hide resolved

liqunfu reviewed Jul 3, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_neon_int8.cpp Show resolved Hide resolved

liqunfu reviewed Jul 3, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_neon_int8.cpp Show resolved Hide resolved

liqunfu previously approved these changes Jul 3, 2024

View reviewed changes

yufenglee reviewed Jul 4, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm.h Outdated Show resolved Hide resolved

fix indentation

828e8de

edgchen1 dismissed liqunfu’s stale review via 828e8de July 5, 2024 18:02

yufenglee approved these changes Jul 10, 2024

View reviewed changes

edgchen1 merged commit 20cd339 into main Jul 10, 2024
98 of 100 checks passed

edgchen1 deleted the edgchen1/sqnbitgemm_multi_row branch July 10, 2024 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation #21193

[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation #21193

edgchen1 commented Jun 27, 2024 •

edited

Loading

edgchen1 commented Jun 29, 2024 •

edited

Loading

[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation #21193

[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation #21193

Conversation

edgchen1 commented Jun 27, 2024 • edited Loading

Description

Measurements

Microbenchmarks

E2E test

Motivation and Context

edgchen1 commented Jun 29, 2024 • edited Loading

edgchen1 commented Jun 27, 2024 •

edited

Loading

edgchen1 commented Jun 29, 2024 •

edited

Loading