[QST] Why do we only need the result of the last k-loop in `cute::gemm` dispatch-5? #1629

sjfeng1999 · 2024-07-12T10:46:11Z

the original code is as follow

// Dispatch [5]: (V,M,K) x (V,N,K) => (V,M,N)
template <class MMA,
          class TD, class DLayout,
          class TA, class ALayout,
          class TB, class BLayout,
          class TC, class CLayout,
          __CUTE_REQUIRES(DLayout::rank == 3 && is_rmem<TD>::value &&
                          ALayout::rank == 3 && is_rmem<TA>::value &&
                          BLayout::rank == 3 && is_rmem<TB>::value &&
                          CLayout::rank == 3 && is_rmem<TC>::value)>
CUTE_HOST_DEVICE
void
gemm(MMA_Atom<MMA>       const& mma,
     Tensor<TD, DLayout>      & D,  // (V,M,N) Logical data
     Tensor<TA, ALayout> const& A,  // (V,M,K) Logical data
     Tensor<TB, BLayout> const& B,  // (V,N,K) Logical data
     Tensor<TC, CLayout> const& C)  // (V,M,N) Logical data
{
  CUTE_STATIC_ASSERT_V(size<1>(A) == size<1>(C));  // AM == CM
  CUTE_STATIC_ASSERT_V(size<1>(B) == size<2>(C));  // BN == CN
  CUTE_STATIC_ASSERT_V(size<2>(A) == size<2>(B));  // AK == BK
  CUTE_STATIC_ASSERT_V(size<0>(C) == size<0>(D) && size<1>(C) == size<1>(D) && size<2>(C) == size<2>(D));
  auto K = size<2>(A);

  CUTE_UNROLL
  for (int k = 0; k < K; ++k) {
    gemm(mma, D, A(_,_,k), B(_,_,k), C);
  }
}

In the for-loop of dim-k (D = Ak x Bk + C), the result of the last calculation will override the result of the previous one.
For example, the following code

  auto tA = make_tensor<int>(make_layout(make_shape(_1{}, _1{}, _2{}))); // V=1, M=1, K=2
  auto tB = make_tensor<int>(make_layout(make_shape(_1{}, _1{}, _2{}))); // V=1, N=1, K=2
  auto tC = make_tensor<int>(make_layout(make_shape(_1{}, _1{}, _1{}))); // V=1, M=1, N=1
  auto tD = make_tensor<int>(make_layout(make_shape(_1{}, _1{}, _1{}))); // V=1, M=1, N=1

  fill(tA, 1); // A = [1, 1]
  fill(tB, 1); // B = [1, 1]
  fill(tC, 10); // C = [10]

  gemm(tD, tA, tB, tC);
  print_tensor(tD); // should be 1 x 1 + 1 x 1 + 10 = 12

will get

ptr[32b](0x7fff402f0840) o (_1,_1,_1):(_0,_0,_0):
    11

instead of 12.

This is only correct when C and D point to the same register that result will be accumulated properly. Is this a restriction for calling this function(cute::gemm dispatch-5)?

The text was updated successfully, but these errors were encountered:

sjfeng1999 · 2024-07-12T10:48:10Z

#1618 @thakkarV

thakkarV · 2024-07-15T15:57:16Z

@ccecka can you please help take a look at this one. we discussed offline but this does seem legit

ccecka · 2024-07-15T17:24:15Z

I agree with this MR and believe it is no-cost in terms of perf. Let's open it back up and approve.

sjfeng1999 · 2024-07-15T17:38:21Z

I believe there is no if-cond in final assembly after fully unroll.

sjfeng1999 · 2024-07-18T17:24:24Z

Would you mind merging this pr #1618 to fix this problem ? It seems I have no authority to reopen.

github-actions · 2024-08-17T18:05:39Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sjfeng1999 added ? - Needs Triage question Question labels Jul 12, 2024

mnicely removed the ? - Needs Triage label Jul 15, 2024

github-actions bot added the inactive-30d label Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Why do we only need the result of the last k-loop in `cute::gemm` dispatch-5? #1629

[QST] Why do we only need the result of the last k-loop in `cute::gemm` dispatch-5? #1629

sjfeng1999 commented Jul 12, 2024

sjfeng1999 commented Jul 12, 2024

thakkarV commented Jul 15, 2024

ccecka commented Jul 15, 2024

sjfeng1999 commented Jul 15, 2024

sjfeng1999 commented Jul 18, 2024

github-actions bot commented Aug 17, 2024

[QST] Why do we only need the result of the last k-loop in cute::gemm dispatch-5? #1629

[QST] Why do we only need the result of the last k-loop in cute::gemm dispatch-5? #1629

Comments

sjfeng1999 commented Jul 12, 2024

sjfeng1999 commented Jul 12, 2024

thakkarV commented Jul 15, 2024

ccecka commented Jul 15, 2024

sjfeng1999 commented Jul 15, 2024

sjfeng1999 commented Jul 18, 2024

github-actions bot commented Aug 17, 2024

[QST] Why do we only need the result of the last k-loop in `cute::gemm` dispatch-5? #1629

[QST] Why do we only need the result of the last k-loop in `cute::gemm` dispatch-5? #1629