[Backend] Implement `scaled_dot(mxfp4, fp8)` #4904

lezcano · 2024-10-14T16:59:59Z

This PR includes #4891 and #4895. I will rebase once those have landed.

It includes a number of hacks to work around bugs in DotOperandEncodingAttr. All these are marked as FIXME [Dot LL] to be easy to grep for. @Jokeren is working on a comprehensive revamp of DotOperandEncodingAttr which will get rid of all these. #4895 is the first step in this direction.

lezcano · 2024-10-15T17:00:35Z

test/Analysis/test-allocation.mlir

@@ -39,7 +39,7 @@ tt.func @matmul_loop(%lb : index, %ub : index, %step : index, %A : !tt.ptr<f16>,
    // CHECK: offset = 0, size = 4608
    %a = triton_gpu.convert_layout %a_ : tensor<128x32xf16, #AL> -> tensor<128x32xf16, #A_DOT>
    %b_ = tt.load %b_ptr, %b_mask, %b_other : tensor<32x128x!tt.ptr<f16>, #BL>
-    // CHECK-NEXT: offset = 0, size = 4224
+    // CHECK-NEXT: offset = 0, size = 4352


nb. These changes are coming from the change in lib/Analysis/Allocation.cpp

It's OK this path was never tested anyway. It will be tested in my next PR.

Jokeren · 2024-10-15T17:06:25Z

lib/Analysis/Allocation.cpp

+  // This should be getElemOrder, but we don't have such a method
+  // TODO Implement getElemOrder and make sure it's consistent with
+  // getContigPerThread
+  auto inOrd = gpu::getThreadOrder(srcLayout);


I think we assume getElemOrder == getOrder

getThreadOrder is same as getOrder except for AMD's AMDMfmaEncodingAttr. I haven't taken a deep investigation.
pin @zhanglx13 for expertise maybe

See that I changed the definition of getThreadOrder in this PR.

To be specific I was referring to:

SmallVector<unsigned> AMDMfmaEncodingAttr::getThreadOrder() const { auto order = ::getOrder(*this); if (getIsTransposed()) std::swap(order[0], order[1]); return order; }

I'm not sure if we should use getOrder or getThreadOrder for this encoding

lib/Dialect/TritonGPU/IR/Dialect.cpp

Jokeren · 2024-10-15T17:13:10Z

test/Analysis/test-allocation.mlir

@@ -39,7 +39,7 @@ tt.func @matmul_loop(%lb : index, %ub : index, %step : index, %A : !tt.ptr<f16>,
    // CHECK: offset = 0, size = 4608
    %a = triton_gpu.convert_layout %a_ : tensor<128x32xf16, #AL> -> tensor<128x32xf16, #A_DOT>
    %b_ = tt.load %b_ptr, %b_mask, %b_other : tensor<32x128x!tt.ptr<f16>, #BL>
-    // CHECK-NEXT: offset = 0, size = 4224
+    // CHECK-NEXT: offset = 0, size = 4352


It's OK this path was never tested anyway. It will be tested in my next PR.

Jokeren · 2024-10-15T17:14:08Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/DotOpToLLVM/MMAv2.cpp

  auto ha = getValuesFromDotOperandLayoutStruct(
      typeConverter, loc, rewriter, loadedA, repBatch, repM, repK, aTensorTy);
+
+  // FIXME [Dot LL]
+  // max(repN / 2, 1) is wrong for repN = 1!


Can you elaborate on // max(repN / 2, 1) is wrong for repN = 1!?
Why repN=1 is wrong?

We are taking this max(repN / 2, 1) here, and then in the loop inside getValuesFromDotOperandLayoutStruct we are packing 4 elements at a time. Rather than that, the correct implementation packs 2 elements inside the function for opIdx=1 and iterates repN times.

This is a tentative PR to check how much breaks if we fix this.

ThomasRaoux

Looks good overall although I didn't look in details at the LL TODOs.
Just added few minor comments

ThomasRaoux · 2024-10-16T02:18:21Z

lib/Dialect/TritonGPU/IR/Dialect.cpp

+  // FIXME: mma should just return getOrderForDotOperand(0, order.size(),
+  // kMajor=false)


I'm also confused by this comment.

Here I just meant that the logic in mma is probably wrong and we just want this function to return what I wrote there. The point here is that, in terms of order, the mma layout is the same as the DotOperandEncoding(opIdx=0)

I had another go at the comment. Third's a charm

ThomasRaoux · 2024-10-16T02:25:43Z

lib/Dialect/TritonGPU/IR/Dialect.cpp

+    order = getOrderForDotOperand(dotOpLayout.getOpIdx(), order.size(),
+                                  /*kMajor*/ false);


why is kMajor always false here?

This is getting the warp order but not the element order. So m is the fastest changing dimension in opIdx=0. I think confusion may arise from the variable name kMajor.

I don't have a suggestion for improvement though. Maybe just add some additional comments.

Yep, similarly to in wgmma, we want the warps have the exterior dimension (i.e. not K) as their fastest running dimension.

ThomasRaoux · 2024-10-16T02:28:30Z

lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp

+            vType.getShape(), vType.getElementType(), newVEncoding);
+        return rewriter.create<ConvertLayoutOp>(v.getLoc(), newVType, v);
+      } else {
+        auto newVEncoding = DotOperandEncodingAttr::get(


nit: assert that this is a fp8 type?

Done, although it's a bit redundant, as we are already asserting this at the beginning of the function and in semantics.py.

ThomasRaoux

LGTM

@Jokeren

This PR includes triton-lang#4891 and triton-lang#4895. I will rebase once those have landed. It includes a number of hacks to work around bugs in `DotOperandEncodingAttr`. All these are marked as `FIXME [Dot LL]` to be easy to grep for. @Jokeren is working on a comprehensive revamp of `DotOperandEncodingAttr` which will get rid of all these. triton-lang#4895 is the first step in this direction.

@Jokeren

This PR includes triton-lang#4891 and triton-lang#4895. I will rebase once those have landed. It includes a number of hacks to work around bugs in `DotOperandEncodingAttr`. All these are marked as `FIXME [Dot LL]` to be easy to grep for. @Jokeren is working on a comprehensive revamp of `DotOperandEncodingAttr` which will get rid of all these. triton-lang#4895 is the first step in this direction.

@Jokeren

This PR includes triton-lang#4891 and triton-lang#4895. I will rebase once those have landed. It includes a number of hacks to work around bugs in `DotOperandEncodingAttr`. All these are marked as `FIXME [Dot LL]` to be easy to grep for. @Jokeren is working on a comprehensive revamp of `DotOperandEncodingAttr` which will get rid of all these. triton-lang#4895 is the first step in this direction.

lezcano requested review from Jokeren and ptillet as code owners October 14, 2024 16:59

lezcano changed the title ~~mxfp snd~~ [Backend] Implement scaled_dot(mxfp4, fp8) Oct 14, 2024

lezcano requested a review from ThomasRaoux October 14, 2024 17:25

lezcano marked this pull request as draft October 14, 2024 17:51

lezcano force-pushed the mxfp_snd branch 4 times, most recently from c15d411 to 104200d Compare October 15, 2024 14:23

lezcano marked this pull request as ready for review October 15, 2024 14:44

lezcano force-pushed the mxfp_snd branch 2 times, most recently from 20a64b1 to 33fceb2 Compare October 15, 2024 16:44

lezcano commented Oct 15, 2024

View reviewed changes

Jokeren reviewed Oct 15, 2024

View reviewed changes

lezcano added 5 commits October 15, 2024 19:09

Implement LL

881209d

Implement mxfp4 x fp8

720c4d6

This is a tentative PR to check how much breaks if we fix this.

One more workaround

4b27371

Improve Dialect.cpp changes. Fix lit tests. Tighten Dot LL condition.

5706c7d

Generalize semantic.dot_scaled

b6426ec

lezcano force-pushed the mxfp_snd branch from 33fceb2 to b6426ec Compare October 15, 2024 18:10

Improve comment

e4a4ed9

ThomasRaoux reviewed Oct 16, 2024

View reviewed changes

Address reviews

bd5c2fc

ThomasRaoux approved these changes Oct 16, 2024

View reviewed changes

lezcano merged commit 9e90089 into triton-lang:main Oct 16, 2024
7 checks passed

lezcano deleted the mxfp_snd branch October 16, 2024 15:21

AlexAUT mentioned this pull request Oct 17, 2024

[Backend] Fix incorrect shared layout for dot operands rank==3 #4944

Closed

lezcano mentioned this pull request Oct 18, 2024

[Backend] Pipeline scale_dot #4950

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend] Implement `scaled_dot(mxfp4, fp8)` #4904

[Backend] Implement `scaled_dot(mxfp4, fp8)` #4904

lezcano commented Oct 14, 2024 •

edited

Loading

lezcano Oct 15, 2024

Jokeren Oct 15, 2024

Jokeren Oct 15, 2024

Jokeren Oct 15, 2024

lezcano Oct 15, 2024

Jokeren Oct 15, 2024

Jokeren Oct 15, 2024

Jokeren Oct 15, 2024

lezcano Oct 15, 2024

Jokeren Oct 15, 2024

ThomasRaoux left a comment

ThomasRaoux Oct 16, 2024

lezcano Oct 16, 2024 •

edited

Loading

lezcano Oct 16, 2024

ThomasRaoux Oct 16, 2024

Jokeren Oct 16, 2024

Jokeren Oct 16, 2024

lezcano Oct 16, 2024

ThomasRaoux Oct 16, 2024

lezcano Oct 16, 2024

ThomasRaoux left a comment

		// FIXME: mma should just return getOrderForDotOperand(0, order.size(),
		// kMajor=false)

		order = getOrderForDotOperand(dotOpLayout.getOpIdx(), order.size(),
		/kMajor/ false);

[Backend] Implement scaled_dot(mxfp4, fp8) #4904

[Backend] Implement scaled_dot(mxfp4, fp8) #4904

Conversation

lezcano commented Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasRaoux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lezcano Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasRaoux left a comment

Choose a reason for hiding this comment

[Backend] Implement `scaled_dot(mxfp4, fp8)` #4904

[Backend] Implement `scaled_dot(mxfp4, fp8)` #4904

lezcano commented Oct 14, 2024 •

edited

Loading

lezcano Oct 16, 2024 •

edited

Loading