-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels #20163
Conversation
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
…d shall use 32) Signed-off-by: liqunfu <[email protected]>
…klen>=64 are roughly optimized, in USE_NCOLs=false, blklen>=32 are roughly optimized. both are improved over last commit Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
2 implementations: here are the benchmark run results:
start /B /HIGH onnxruntime_mlas_benchmark.exe --benchmark_filter="SQNBITGEMM<4>/BlkLen:128/M:1024/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time" --benchmark_repetitions=10
|
Signed-off-by: liqunfu <[email protected]>
…M>1 20% improvement by using implementing simd dequantization. int8 blklen=16 significantly improved Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
good point! routed M>2 cases int8 compute to fp32. Will role bask to in8 compute after int8 compute for M>1 is improved. |
Signed-off-by: liqunfu <[email protected]>
…nxruntime into liqun/mlas-4bit-cpu merged with other changes
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: Liqun Fu <[email protected]>
### Description ``` Avx2: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22% Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32% ``` --------- Signed-off-by: Liqun Fu <[email protected]> Signed-off-by: liqunfu <[email protected]> Co-authored-by: edgchen1 <[email protected]>
### Description ``` Avx2: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22% Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32% ``` --------- Signed-off-by: Liqun Fu <[email protected]> Signed-off-by: liqunfu <[email protected]> Co-authored-by: edgchen1 <[email protected]>
Description
Perf data from (21a892b)
(followings are perf results from 1d88398. leave it here for reference. Mlas Prompt compute for Int8 has then been speed up by routing it to fp32. perf results shown above)