Integrate high-performance x64 gemm library to MLAS #17669

luoyu-intel · 2023-09-22T07:37:21Z

Description

Improve MLAS to support high-performance x64 INT4 kernels

Motivation and Context

improve LLM inference performance on Intel CPUs.
support more 4bit quantization types: nf4, fp4
support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension
support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16
support MatMulNBits' data format

Tasks

support block_size: 32, 128, -1(per channel)
get weight pack size without memory allocation
use ort's thread pool for parallelism
support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8

Benchmark

Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores

Benchmark	Time	CPU	Iterations
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time	47613	47401	12970
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time	6347792	`6317562`	109
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time	11814014	11757847	59
Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time	50222	50031	13759
Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time	`2038222`	2028743	341
Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time	3792832	3774485	191
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time	58717	58501	11467
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time	1360846	1354598	543
Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time	2564232	2551365	266
Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time	57929	57694	12047
Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time	5495330	5465810	126
Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time	10676240	10617817	66
Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time	68305	68047	10026
Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time	5504862	5476215	126
Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time	11758623	11697337	66
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time	67713	67451	10298
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time	5508325	5480237	126
Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time	10738528	10681656	64
Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time	60708	60486	11321
Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time	5523784	5495736	126
Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time	10829633	10772161	67

Reference:

Benchmark	Time	CPU	Iterations
Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time	53088	52911	13364
Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time	6268981	6230335	110
Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time	11701237	11632339	59

Win11+12900K 8 cores:

Benchmark	Time	CPU	Iterations
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time	215976	211295	2884
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time	60960590	60937500	10
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time	1.18E+08	1.19E+08	5
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time	470377	453059	1414
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time	1.54E+08	1.53E+08	5
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time	3.18E+08	3.13E+08	2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time	569072	559398	1229
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time	1.54E+08	1.52E+08	4
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time	3.22E+08	3.28E+08	2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time	1486055	1473325	403
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time	4.14E+08	4.14E+08	2
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time	8.88E+08	8.59E+08	1

cmake/CMakeLists.txt

onnxruntime/core/mlas/lib/x86_64/jblas/CMakeLists.txt

…clang-format all

Signed-off-by: Mengni Wang <[email protected]>

onnxruntime/core/mlas/lib/x86_64/jblas/jblas/kernel_jit.h

yufenglee · 2023-12-14T21:47:02Z

/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

yufenglee · 2023-12-14T21:47:26Z

/azp run Windows x64 QNN CI Pipeline

azure-pipelines · 2023-12-14T21:47:37Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2023-12-14T21:47:45Z

Azure Pipelines successfully started running 8 pipeline(s).

yufenglee · 2023-12-14T21:50:31Z

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

azure-pipelines · 2023-12-14T21:51:01Z

Azure Pipelines successfully started running 7 pipeline(s).

yihonglyu · 2023-12-15T00:00:12Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

azure-pipelines · 2023-12-15T00:03:33Z

Azure Pipelines successfully started running 9 pipeline(s).

yufenglee · 2023-12-19T17:36:44Z

Thanks Louyu!

…19015) Allow MatMulNBits `accuracy_level` attribute (added in #17669) to be set to a particular value when the model is quantized.

### Description  Revert PR#19016 #19016 Revert PR#17669 #17669

…icrosoft#19015) Allow MatMulNBits `accuracy_level` attribute (added in microsoft#17669) to be set to a particular value when the model is quantized.

snnn reviewed Sep 25, 2023

View reviewed changes

cmake/CMakeLists.txt Outdated Show resolved Hide resolved

snnn reviewed Sep 25, 2023

View reviewed changes

onnxruntime/core/mlas/lib/x86_64/jblas/CMakeLists.txt Outdated Show resolved Hide resolved

mengniwang95 mentioned this pull request Nov 10, 2023

Support INT4 weight only quantize, including RTN and GPTQ 2 algorithms #17390

Merged

luoyu-intel force-pushed the main branch from a8e292e to 62bd95c Compare November 14, 2023 13:10

luoyu-intel marked this pull request as ready for review November 15, 2023 15:03

luoyu-intel requested a review from a team as a code owner November 15, 2023 15:03

luoyu-intel and others added 24 commits November 16, 2023 11:18

add jblas; add q4 perchannel of jblas to kernel, benchmark and test; …

41a8f2a

…clang-format all

workaround the compilation on gcc9

82720d0

sync compiler flags

c71490a

get memory size without memory allocation

0ee5377

add amx_int8 kernel for computation

44846fe

fix path error

2f9b26a

no need for AMX syscall

8a30f65

Add MatMulNBitsCPU op

483818c

Signed-off-by: Mengni Wang <[email protected]>

pass compile

e48430a

update jblas. add comp_dtype=fp32

a15442f

add more test cases

438fb72

add ut for mamtmul_nbits_cpu

b122560

fix compile errors

c758d53

clang-format. update UT

20f2bb2

update UT reference value.

4b81a1d

update jblas for MatMulNBits prepack

cb12f02

add jblas execute path in matmul_nbits.cc

a31a24d

add UT for matmul_nbits

0e5c032

pass pre-pack UT of matmul_nbits

9aea379

pass UT for comp_int8

37a800c

revert matmul_nbits_cpu

31df05b

revert code

35db583

revert code

438a9e0

fix benchmark error

c77403d

yihonglyu reviewed Dec 14, 2023

View reviewed changes

onnxruntime/core/mlas/lib/x86_64/jblas/jblas/kernel_jit.h Outdated Show resolved Hide resolved

luoyu-intel added 9 commits December 14, 2023 18:02

Merge branch 'microsoft:main' into main

48ba857

fix val names

3d85292

add CompUndef to UT case

2abed6d

use assert zps==nullptr

291ef3b

change assert unrollk

a146fa5

use size_t for M, N ,K ...

4bd9d04

add name conversion explaination for launcher templates

c64ccee

use unique_ptr instead of raw poitner

9536f15

remove const_cast

1f25140

luoyu-intel dismissed yufenglee’s stale review via 1f25140 December 14, 2023 11:53

luoyu-intel added 2 commits December 14, 2023 19:57

change threshold to 16

f3872b8

fix typo

474d2c8

yufenglee approved these changes Dec 19, 2023

View reviewed changes

yufenglee merged commit 5f00bc9 into microsoft:main Dec 19, 2023
65 of 66 checks passed

edgchen1 mentioned this pull request Jan 5, 2024

Add MatMulNBits accuracy_level parameter to quantization utilities. #19015

Merged

edgchen1 added a commit that referenced this pull request Jan 5, 2024

Add MatMulNBits accuracy_level parameter to quantization utilities. (#…

4190c29

…19015) Allow MatMulNBits `accuracy_level` attribute (added in #17669) to be set to a particular value when the model is quantized.

luoyu-intel mentioned this pull request Feb 2, 2024

Revert NeuralSpeed code for x64 MatMulNBits #19382

Merged

snnn pushed a commit that referenced this pull request Feb 7, 2024

Revert NeuralSpeed code for x64 MatMulNBits (#19382)

0d10c7f

### Description  Revert PR#19016 #19016 Revert PR#17669 #17669

skottmckay pushed a commit that referenced this pull request Feb 15, 2024

Revert NeuralSpeed code for x64 MatMulNBits (#19382)

c10d6a4

### Description  Revert PR#19016 #19016 Revert PR#17669 #17669

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate high-performance x64 gemm library to MLAS #17669

Integrate high-performance x64 gemm library to MLAS #17669

luoyu-intel commented Sep 22, 2023 •

edited

Loading

yufenglee commented Dec 14, 2023

yufenglee commented Dec 14, 2023

azure-pipelines bot commented Dec 14, 2023

azure-pipelines bot commented Dec 14, 2023

yufenglee commented Dec 14, 2023

azure-pipelines bot commented Dec 14, 2023

yihonglyu commented Dec 15, 2023

azure-pipelines bot commented Dec 15, 2023

yufenglee commented Dec 19, 2023

Integrate high-performance x64 gemm library to MLAS #17669

Integrate high-performance x64 gemm library to MLAS #17669

Conversation

luoyu-intel commented Sep 22, 2023 • edited Loading

Description

Motivation and Context

Tasks

Benchmark

yufenglee commented Dec 14, 2023

yufenglee commented Dec 14, 2023

azure-pipelines bot commented Dec 14, 2023

azure-pipelines bot commented Dec 14, 2023

yufenglee commented Dec 14, 2023

azure-pipelines bot commented Dec 14, 2023

yihonglyu commented Dec 15, 2023

azure-pipelines bot commented Dec 15, 2023

yufenglee commented Dec 19, 2023

luoyu-intel commented Sep 22, 2023 •

edited

Loading