Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

Closed
wants to merge 3 commits into from

Conversation

binarman
Copy link
Contributor

@binarman binarman commented Aug 28, 2024

This PR:

  • Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases
  • Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes

This PR is a part of PR series. Final goal is to improve efficiency of small dot operations and bypass as much shared memory accesses as possible.

Rough list of PRs:

This PR:
- Refactors FMA dot implementation
- Supports dot3d in FMA path
- Fixes several issues in operand offset computation
- Enables small dot operands
…ompiltion time and reduce number of instructions in assembly,

fix bug with wrong order field used for share mem load size computation
@binarman
Copy link
Contributor Author

This PR depends on #4516

This PR:

- Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases
- Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes
@binarman
Copy link
Contributor Author

Closing this PR for now.
Will reopen it if base PRs #4516 is merged.

@binarman binarman closed this Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant