Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate high-performance x64 gemm library to MLAS #17669

Merged
merged 115 commits into from
Dec 19, 2023

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented Sep 22, 2023

Description

Improve MLAS to support high-performance x64 INT4 kernels

Motivation and Context

  1. improve LLM inference performance on Intel CPUs.
  2. support more 4bit quantization types: nf4, fp4
  3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension
  4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16
  5. support MatMulNBits' data format

Tasks

  • support block_size: 32, 128, -1(per channel)
  • get weight pack size without memory allocation
  • use ort's thread pool for parallelism
  • support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8

Benchmark

Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores

Benchmark Time CPU Iterations
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time 47613 47401 12970
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time 6347792 6317562 109
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time 11814014 11757847 59
Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time 50222 50031 13759
Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time 2038222 2028743 341
Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time 3792832 3774485 191
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time 58717 58501 11467
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time 1360846 1354598 543
Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time 2564232 2551365 266
Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time 57929 57694 12047
Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time 5495330 5465810 126
Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time 10676240 10617817 66
Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time 68305 68047 10026
Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time 5504862 5476215 126
Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time 11758623 11697337 66
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time 67713 67451 10298
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time 5508325 5480237 126
Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time 10738528 10681656 64
Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time 60708 60486 11321
Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time 5523784 5495736 126
Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time 10829633 10772161 67

Reference:

Benchmark Time CPU Iterations
Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time 53088 52911 13364
Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time 6268981 6230335 110
Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time 11701237 11632339 59

Win11+12900K 8 cores:

Benchmark Time CPU Iterations
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time 215976 211295 2884
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time 60960590 60937500 10
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time 1.18E+08 1.19E+08 5
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time 470377 453059 1414
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time 1.54E+08 1.53E+08 5
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time 3.18E+08 3.13E+08 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time 569072 559398 1229
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time 1.54E+08 1.52E+08 4
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time 3.22E+08 3.28E+08 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time 1486055 1473325 403
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time 4.14E+08 4.14E+08 2
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time 8.88E+08 8.59E+08 1

cmake/CMakeLists.txt Outdated Show resolved Hide resolved
@yufenglee
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@yufenglee
Copy link
Member

/azp run Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@yufenglee
Copy link
Member

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@yihonglyu
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@yufenglee yufenglee merged commit 5f00bc9 into microsoft:main Dec 19, 2023
65 of 66 checks passed
@yufenglee
Copy link
Member

Thanks Louyu!

edgchen1 added a commit that referenced this pull request Jan 5, 2024
…19015)

Allow MatMulNBits `accuracy_level` attribute (added in #17669) to be set to a particular value when the model is quantized.
snnn pushed a commit that referenced this pull request Feb 7, 2024
### Description
<!-- Describe your changes. -->
Revert PR#19016 #19016
Revert PR#17669 #17669
skottmckay pushed a commit that referenced this pull request Feb 15, 2024
### Description
<!-- Describe your changes. -->
Revert PR#19016 #19016
Revert PR#17669 #17669
jslap-ubi pushed a commit to jslap-ubi/onnxruntime that referenced this pull request Apr 5, 2024
…icrosoft#19015)

Allow MatMulNBits `accuracy_level` attribute (added in microsoft#17669) to be set to a particular value when the model is quantized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants