Add INT4 quant/de-quant kernels #620

Add Perf Kernels This is a combination of 2 commits. Add Perf Kernels Add Perf Kernels This is a combination of 6 commits. add perf-kernels fix formating issues fix unused variables and other bugs fix other issues remove scripts save check changes format save save try pre-commit check save

Change all block pointers to tensor pointers Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported. Also cleaned up some code I came across along the way and updated comment at the top.

Add support for layouts commonly used by users. Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.

* remove on push for Integration Tests * rename * add post merge test * save * dtype params * skip bad config * fix more stuff

Increase CI timeout

Couple of FA optimizations Set SM scale multiplication to a constexpr. Minor asm improvement. Changed acc scaling to adjust for softmax division to multiplication with reciprocal. ~10% perf improvement. --------- Co-authored-by: Michael Melesse <[email protected]>

Commits on Jul 29, 2024

Add INT4 quant/de-quant kernels

Rahul Batra committed Jul 29, 2024

Configuration menu

View commit details

Copy full SHA for 71c5845

Browse repository at this point

Copy the full SHA

71c5845 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add INT4 quant/de-quant kernels #620

Add INT4 quant/de-quant kernels #620

Commits on Jul 16, 2024

Commits on Jul 18, 2024

Commits on Jul 19, 2024

Commits on Jul 29, 2024

Add INT4 quant/de-quant kernels #620

Are you sure you want to change the base?

Add INT4 quant/de-quant kernels #620

Commits on Jul 16, 2024

Commits on Jul 18, 2024

Commits on Jul 19, 2024

Commits on Jul 29, 2024