Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels #20163

liqunfu · 2024-03-31T23:05:44Z

Description

Perf data from (21a892b)

Avx2:
Int8

               NS(P)	MLAS(P)    MLASGain/Loss(P)	NS(T)	MLAS(T)  MLASGain/Loss(T)
Blklen16:	90.96	96.10		5%		7.65	12.55		64%
Blklen32:	90.73	79.84		-12%		7.86	15.11		92%
Blklen64:	89.49	98.01		9%		8.30	16.04		93%
Blklen128:  	87.38	102.04		16%		7.90	16.21		105%
Blklen256:  	89.45	94.13	 	5%		8.30	16.60		100%

Fp32		
               NS(P)	MLAS(P)    MLASGain/Loss(P)	NS(T)	MLAS(T)  MLASGain/Loss(T)
Blklen16:	91.36	102.60		12%		7.57	9.07		19%
Blklen32:	89.30	83.08		-6%		7.65	10.27		34%
Blklen64:	89.53	102.24		14%		7.97	10.24		28%
Blklen128:	85.23	102.94		20%		7.86	10.41		32%
Blklen256:	88.46	102.62		16%		8.32	10.72		28%

Avx512vnni:
Int8		
               NS(P)	MLAS(P)    MLASGain/Loss(P)	NS(T)	MLAS(T)  MLASGain/Loss(T)
Blklen16:	132.18	105.47		-20%		10.34	13.09		26%
Blklen32:	168.28	106.43		-36%		11.85	16.35		37%
Blklen64:	201.81	104.47		-48%		12.36	17.48		41%
Blklen128:	194.92	104.69		-46%		13.03	17.35		33%
Blklen256:	218.76	112.06		-48%		13.33	16.99		27%

Fp32		
               NS(P)	MLAS(P)    MLASGain/Loss(P)	NS(T)	MLAS(T)  MLASGain/Loss(T)
Blklen16:	102.81	117.29		14%		8.41	12.17		44%
Blklen32:	109.49	112.87		3%		8.83	13.40		51%
Blklen64:	104.13	111.07		6%		9.32	11.17		19%
Blklen128:	108.45	113.08		4%		9.58	11.39		18%
Blklen256:	109.43	113.46		3%		9.19	11.97		30%

(followings are perf results from 1d88398. leave it here for reference. Mlas Prompt compute for Int8 has then been speed up by routing it to fp32. perf results shown above)

Avx2:
Int8

               NS(P)	MLAS(P)    MLASGain/Loss(P)	NS(T)	MLAS(T)  MLASGain/Loss(T)
Blklen16: 	90.96		25.15	-72%			7.65		11.71	53%
Blklen32:	90.73		48.55	-46%			7.86		14.28	81%
Blklen64:	89.49		68.84	-23%			8.30		15.78	90%
Blklen128:  	87.38		78.37	-10%			7.90		16.05	103%
Blklen256:  	89.45		82.36	 -7%			8.30		16.56	99%

Fp32		
               NS(P)	MLAS(P)    MLASGain/Loss(P)	NS(T)	MLAS(T)  MLASGain/Loss(T)
Blklen16:	91.36	105.18		15%		7.57	9.52		25%
Blklen32:	89.30	105.99		18%		7.65	9.68		26%
Blklen64:	89.53	101.41		13%		7.97	9.84		23%
Blklen128:	85.23	99.71		16%		7.86	10.39		32%
Blklen256:	88.46	97.94		10%		8.32	10.23		22%

Avx512vnni:
Int8		
               NS(P)	MLAS(P)    MLASGain/Loss(P)	NS(T)	MLAS(T)  MLASGain/Loss(T)
Blklen16:	132.18	21.56		-83%		10.34	11.48		11%
Blklen32:	168.28	43.69		-74%		11.85	14.73		24%
Blklen64:	201.81	60.29		-70%		12.36	15.47		25%
Blklen128:	194.92	57.04		-71%		13.03	14.67		12%
Blklen256:	218.76	70.20		-68%		13.33	16.31		22%

Fp32		
               NS(P)	MLAS(P)    MLASGain/Loss(P)	NS(T)	MLAS(T)  MLASGain/Loss(T)
Blklen16:	102.81	92.74		-9%		8.41	9.18		9%
Blklen32:	109.49	97.08		-11%		8.83	11.51		30%
Blklen64:	104.13	101.57		-2%		9.32	12.00		28%
Blklen128:	108.45	103.69		-4%		9.58	12.45		29%
Blklen256:	109.43	106.43		-2%		9.19	12.2		32%

Signed-off-by: Liqun Fu <[email protected]>

Signed-off-by: liqunfu <[email protected]>

…d shall use 32) Signed-off-by: liqunfu <[email protected]>

…klen>=64 are roughly optimized, in USE_NCOLs=false, blklen>=32 are roughly optimized. both are improved over last commit Signed-off-by: liqunfu <[email protected]>

Signed-off-by: liqunfu <[email protected]>

liqunfu · 2024-04-10T05:08:03Z

2 implementations:
USE_NCOLs = false: process one row from A and one column from B,
USE_NCOLs = true : process one row from A and NCols(4) column from B.

here are the benchmark run results:
start /B /HIGH onnxruntime_mlas_benchmark.exe --benchmark_filter="SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time" --benchmark_repetitions=10

options:name	mean_real	mean_cpu
USE_NCOLs = false	SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean	1857795610ns
USE_NCOLs = true	SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4	1731348910ns

start /B /HIGH onnxruntime_mlas_benchmark.exe --benchmark_filter="SQNBITGEMM<4>/BlkLen:128/M:1024/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time" --benchmark_repetitions=10

options:name	mean_real	mean_cpu
USE_NCOLs = false	SQNBITGEMM<4>/BlkLen:128/M:1024/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4	614658820ns
USE_NCOLs = true	SQNBITGEMM<4>/BlkLen:128/M:1024/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4	559375000ns

Signed-off-by: liqunfu <[email protected]>

…M>1 20% improvement by using implementing simd dequantization. int8 blklen=16 significantly improved Signed-off-by: liqunfu <[email protected]>

Signed-off-by: liqunfu <[email protected]>

onnxruntime/core/mlas/lib/platform.cpp

onnxruntime/test/contrib_ops/matmul_4bits_test.cc

onnxruntime/test/mlas/unittest/test_sqnbitgemm.cpp

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512.cpp

Signed-off-by: Liqun Fu <[email protected]>

Signed-off-by: liqunfu <[email protected]>

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp

Signed-off-by: Liqun Fu <[email protected]>

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx_common.h

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx_common_int8.h

onnxruntime/test/mlas/unittest/test_sqnbitgemm.cpp

liqunfu · 2024-04-24T21:08:28Z

Description

Avx2: Int8
NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99%
Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22%
Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22%
Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32%

The prompt performance for fp32 is much better than the int8. I think we can use the fp32 prompt kernel for the int8 case.

good point! routed M>2 cases int8 compute to fp32. Will role bask to in8 compute after int8 compute for M>1 is improved.

Signed-off-by: liqunfu <[email protected]>

…nxruntime into liqun/mlas-4bit-cpu merged with other changes

Signed-off-by: Liqun Fu <[email protected]>

Signed-off-by: liqunfu <[email protected]>

…en32

Signed-off-by: Liqun Fu <[email protected]>

### Description ``` Avx2: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22% Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32% ``` --------- Signed-off-by: Liqun Fu <[email protected]> Signed-off-by: liqunfu <[email protected]> Co-authored-by: edgchen1 <[email protected]>

liqunfu and others added 2 commits March 30, 2024 14:18

sqnbitgemm_kernel_avx2/avx512

bbd7cf6

Signed-off-by: Liqun Fu <[email protected]>

avx512 port from q4gemm - pass M1N1K1 w/o bias

a605e6a

Signed-off-by: liqunfu <[email protected]>

liqunfu requested a review from a team as a code owner March 31, 2024 23:05

liqunfu marked this pull request as draft March 31, 2024 23:06

liqunfu added 14 commits April 1, 2024 07:48

draft M1 avx2

5ca97a7

Signed-off-by: liqunfu <[email protected]>

pass avx2 M1 tests

f3b1298

Signed-off-by: liqunfu <[email protected]>

port draft for M1Int8

868a4dc

Signed-off-by: liqunfu <[email protected]>

pass avx512 M1 int8 except blklen16

fe3f8fa

Signed-off-by: liqunfu <[email protected]>

draft dequant B. still layout is not ready for GemmFloatKernel

63eaea6

Signed-off-by: liqunfu <[email protected]>

pass M* Fp32 tests

b86157e

Signed-off-by: liqunfu <[email protected]>

pass blklen16/Int8

b5cd8f8

Signed-off-by: liqunfu <[email protected]>

Merge branch 'main' into liqun/mlas-4bit-cpu

652c172

fix FoldAccumulators

649f899

Signed-off-by: liqunfu <[email protected]>

try subblk 64

3b43b67

Signed-off-by: liqunfu <[email protected]>

subblk len=64 works with avx512(20% improvelent) avx2 (much slower an…

e01570c

…d shall use 32) Signed-off-by: liqunfu <[email protected]>

int8 Both USE_NCOLs true and false are passing. in USE_NCOLs=true, bl…

563eeec

…klen>=64 are roughly optimized, in USE_NCOLs=false, blklen>=32 are roughly optimized. both are improved over last commit Signed-off-by: liqunfu <[email protected]>

having USE_NCOLs options for NCols=4 and NCols=1: former is 5-10% faster

ce285aa

Signed-off-by: liqunfu <[email protected]>

rename avx2 to avx512 because avx512 ops are used. avx2 not supported

949bdbe

Signed-off-by: liqunfu <[email protected]>

liqunfu added 4 commits April 12, 2024 21:58

experiment load int4 and comments

021ed71

Signed-off-by: liqunfu <[email protected]>

fp32 avx512 M1 5 times improvement by porting existing q4 code, fp32 …

4b02e45

…M>1 20% improvement by using implementing simd dequantization. int8 blklen=16 significantly improved Signed-off-by: liqunfu <[email protected]>

MlasQ80BlkQuantRow_avx2

23e981c

Signed-off-by: liqunfu <[email protected]>

support avx2, refactor for avx2, avx512, vnni kernels

6620aab

Signed-off-by: liqunfu <[email protected]>

yufenglee reviewed Apr 17, 2024

View reviewed changes

onnxruntime/core/mlas/lib/platform.cpp Show resolved Hide resolved

yufenglee reviewed Apr 17, 2024

View reviewed changes

onnxruntime/test/contrib_ops/matmul_4bits_test.cc Outdated Show resolved Hide resolved

yufenglee reviewed Apr 17, 2024

View reviewed changes

onnxruntime/test/mlas/unittest/test_sqnbitgemm.cpp Outdated Show resolved Hide resolved

yufenglee reviewed Apr 17, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512.cpp Outdated Show resolved Hide resolved

fix avx2

c40bc23

Signed-off-by: Liqun Fu <[email protected]>

liqunfu marked this pull request as ready for review April 18, 2024 03:30

liqunfu changed the title ~~Liqun/mlas 4bit cpu~~ Liqun/mlas Gemm 4bit avx2, avx512, and avx512vnni kernels Apr 18, 2024

liqunfu added 2 commits April 23, 2024 16:42

remove a std::cout

7f5fa06

Signed-off-by: liqunfu <[email protected]>

bring back avx512 and vnni

1d88398

Signed-off-by: liqunfu <[email protected]>

yufenglee reviewed Apr 24, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp Show resolved Hide resolved

onnxruntime_USE_NEURAL_SPEED OFF

64d777e

Signed-off-by: Liqun Fu <[email protected]>

edgchen1 reviewed Apr 24, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx_common.h Outdated Show resolved Hide resolved

edgchen1 reviewed Apr 24, 2024

View reviewed changes

liqunfu and others added 14 commits April 24, 2024 15:32

use fp32 for M>1 cases for int8 compute

4a173d9

Signed-off-by: liqunfu <[email protected]>

Merge branch 'liqun/mlas-4bit-cpu' of https://github.com/microsoft/on…

d3a00a1

…nxruntime into liqun/mlas-4bit-cpu merged with other changes

incorrect use of _cvtepu8_epi16

4a09bf3

Signed-off-by: Liqun Fu <[email protected]>

avoid condition in loops

919dd76

Signed-off-by: Liqun Fu <[email protected]>

Merge branch 'main' into liqun/mlas-4bit-cpu

ddebcb0

update doc

7e1fef1

Signed-off-by: Liqun Fu <[email protected]>

fix b0ptr += count_half_4 / 2; missing

81b865f

Signed-off-by: liqunfu <[email protected]>

fix bug - used wrong pointer in SQ4BitGemmM1Kernel_CompInt8_Impl_BlkL…

53577a9

…en32

fix parameter ordering in test

13aa5c1

lint

9228531

Signed-off-by: Liqun Fu <[email protected]>

lint

21c7896

Signed-off-by: Liqun Fu <[email protected]>

move count_half_4 out of loop

3b2a4e9

Signed-off-by: Liqun Fu <[email protected]>

comments

8a69e1c

Signed-off-by: Liqun Fu <[email protected]>

reduce dml test to original

20649bc

Signed-off-by: Liqun Fu <[email protected]>

yufenglee approved these changes Apr 26, 2024

View reviewed changes

liqunfu merged commit cc26b2d into main Apr 26, 2024
89 of 94 checks passed

liqunfu deleted the liqun/mlas-4bit-cpu branch April 26, 2024 04:30

sophies927 added release:1.18.0 triage:approved Approved for cherrypicks for release labels May 1, 2024

yihonglyu added the cherry-picked Cherry-picked for a cherrypicks branch label May 4, 2024

yihonglyu added the rel-merged Cherrypicks merged into release label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels #20163

Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels #20163

liqunfu commented Mar 31, 2024 •

edited

Loading

liqunfu commented Apr 10, 2024 •

edited

Loading

liqunfu commented Apr 24, 2024

Description

Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels #20163

Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels #20163

Conversation

liqunfu commented Mar 31, 2024 • edited Loading

Description

liqunfu commented Apr 10, 2024 • edited Loading

liqunfu commented Apr 24, 2024

Description

liqunfu commented Mar 31, 2024 •

edited

Loading

liqunfu commented Apr 10, 2024 •

edited

Loading