[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

binarman · 2024-08-28T19:22:16Z

This PR:

Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases
Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes

This PR is a part of PR series. Final goal is to improve efficiency of small dot operations and bypass as much shared memory accesses as possible.

Rough list of PRs:

Basic FMA dot fixes, dot 3d support and relaxing small dimensions for dot [Backend] Improve dot support to target FMA #4516
Blocked->dotOp shared memory bypassing [Backend] Bypass conversion for suitable blocked to dotOperand layout #4538
Accelerate AMD Matmul + emit dot operations (this PR) [WIP] [AMD] Emit AMD specific intrinsics for dot #4594
Layout optimization, so operand B is loaded in proper mfma layout and do not need to go through LDS [WIP] Optimize fma dot #4581
Vectorization optimization of dot operands/results (in case llvm can not do this internally)
Reduction operation hoisting out of the K loop (reduction operation is a byproduct of layout optimization step) Hoist reduction outside a loop #4559

This PR: - Refactors FMA dot implementation - Supports dot3d in FMA path - Fixes several issues in operand offset computation - Enables small dot operands

…ompiltion time and reduce number of instructions in assembly, fix bug with wrong order field used for share mem load size computation

binarman · 2024-08-28T19:33:17Z

This PR depends on #4516

This PR: - Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases - Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes

binarman · 2024-11-18T13:13:16Z

Closing this PR for now.
Will reopen it if base PRs #4516 is merged.

binarman added 2 commits August 28, 2024 17:09

Relax dot operand constrains with FMA based dot

a908e92

This PR: - Refactors FMA dot implementation - Supports dot3d in FMA path - Fixes several issues in operand offset computation - Enables small dot operands

implement separate conversion path for unswizzled tensor to improve c…

199d2b1

…ompiltion time and reduce number of instructions in assembly, fix bug with wrong order field used for share mem load size computation

binarman changed the title ~~[AMD] Emit AMD specific intrinsics for dot~~ [WIP] [AMD] Emit AMD specific intrinsics for dot Aug 28, 2024

[AMD] Emit AMD specific intrinsics for dot

90a467a

This PR: - Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases - Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes

alefimov-amd force-pushed the v_dot_codegen branch from b3f384c to 90a467a Compare August 28, 2024 19:51

binarman closed this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

binarman commented Aug 28, 2024 •

edited

Loading

binarman commented Aug 28, 2024

binarman commented Nov 18, 2024

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

Conversation

binarman commented Aug 28, 2024 • edited Loading

binarman commented Aug 28, 2024

binarman commented Nov 18, 2024

binarman commented Aug 28, 2024 •

edited

Loading