Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLAS AArch64] SQNBitGemm CompInt8 kernel #18953

Merged
merged 36 commits into from
Jan 13, 2024
Merged

Conversation

edgchen1
Copy link
Contributor

@edgchen1 edgchen1 commented Dec 29, 2023

Description

Implement ARM NEON SQNBitGemm kernel that first block quantizes A to int8 and then does int8 multiplication.

Benchmarks

Note: ComputeType:1 is original float multiplication implementation, ComputeType:4 is new int8 multiplication implementation.

Benchmark runs from 2337375.

name median_real median_cpu
SQNBITGEMM<4>/BlkLen:16/M:1/N:4096/K:11008/Threads:8/Symmetric:1/ComputeType:1/real_time 1700949 ns 1700130 ns
SQNBITGEMM<4>/BlkLen:16/M:1/N:4096/K:11008/Threads:8/Symmetric:1/ComputeType:4/real_time 1154813 ns 1150777 ns
SQNBITGEMM<4>/BlkLen:256/M:1/N:4096/K:11008/Threads:8/Symmetric:1/ComputeType:1/real_time 1530306 ns 1535207 ns
SQNBITGEMM<4>/BlkLen:256/M:1/N:4096/K:11008/Threads:8/Symmetric:1/ComputeType:4/real_time 706529 ns 697831 ns
SQNBITGEMM<4>/BlkLen:16/M:1/N:11008/K:4096/Threads:8/Symmetric:1/ComputeType:1/real_time 1747413 ns 1744727 ns
SQNBITGEMM<4>/BlkLen:16/M:1/N:11008/K:4096/Threads:8/Symmetric:1/ComputeType:4/real_time 1141534 ns 1143927 ns
SQNBITGEMM<4>/BlkLen:256/M:1/N:11008/K:4096/Threads:8/Symmetric:1/ComputeType:1/real_time 1581004 ns 1562500 ns
SQNBITGEMM<4>/BlkLen:256/M:1/N:11008/K:4096/Threads:8/Symmetric:1/ComputeType:4/real_time 741127 ns 735486 ns
SQNBITGEMM<4>/BlkLen:16/M:1/N:4096/K:4096/Threads:8/Symmetric:1/ComputeType:1/real_time 618847 ns 617318 ns
SQNBITGEMM<4>/BlkLen:16/M:1/N:4096/K:4096/Threads:8/Symmetric:1/ComputeType:4/real_time 277728 ns 276930 ns
SQNBITGEMM<4>/BlkLen:256/M:1/N:4096/K:4096/Threads:8/Symmetric:1/ComputeType:1/real_time 570383 ns 558036 ns
SQNBITGEMM<4>/BlkLen:256/M:1/N:4096/K:4096/Threads:8/Symmetric:1/ComputeType:4/real_time 204531 ns 200321 ns

Motivation and Context

ARM NEON SQNBitGemm kernel performance.

@edgchen1 edgchen1 changed the title [MLAS AArch64] SQNBitGemm CompInt8 kernel [WIP][MLAS AArch64] SQNBitGemm CompInt8 kernel Dec 29, 2023
@edgchen1 edgchen1 changed the title [WIP][MLAS AArch64] SQNBitGemm CompInt8 kernel [MLAS AArch64] SQNBitGemm CompInt8 kernel Jan 2, 2024
@edgchen1 edgchen1 marked this pull request as ready for review January 2, 2024 23:20
@edgchen1 edgchen1 requested a review from a team as a code owner January 2, 2024 23:20
yufenglee
yufenglee previously approved these changes Jan 12, 2024
@edgchen1 edgchen1 merged commit 150c4cb into main Jan 13, 2024
92 of 94 checks passed
@edgchen1 edgchen1 deleted the edgchen1/sqnbitgemm_quantize_a branch January 13, 2024 01:58
mszhanyi pushed a commit that referenced this pull request Jan 15, 2024
Implement ARM NEON SQNBitGemm kernel that first block quantizes A to int8 and then does int8 multiplication.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants