striding syntax fails tests in cuda builds #426

Infinoid · 2021-03-17T17:23:20Z

The windowing stride syntax doesn't work quite right when configured with -DCUDA=ON.

These 3 test cases fail:

	890 - windowing/stride.windowing/(dense,compressed) (Failed)
	891 - windowing/stride.windowing/(compressed,dense) (Failed)
	892 - windowing/stride.windowing/(compressed,compressed) (Failed)

I dug into this a bit, and the stride syntax causes taco to emit C code like this:

    if (jB2_window % 5 != 0) {
      jB++;
      continue;
    }

But the CUDA codegen doesn't emit continue for a Continue op. Instead it returns, like this:

    if (jB2_window % 5 != 0) {
      jB = jB + 1;
      return;
    }

Returning is wrong in this case, it effectively drops the remainder of the thread's work on the floor, producing incorrect output. Changing that return to a continue fixes the problem. Adding a flag to force it to emit continue in just this one case allows the tests to pass.

I cooked up a hacky workaround which does that. I don't think it's quite right. It just forces the CUDA codegen to emit a continue, rather than giving the CUDA codegen enough info to decide for itself. But hopefully it illustrates the problem.

Cc: @rohany

The text was updated successfully, but these errors were encountered:

rohany · 2021-03-17T17:59:14Z

Ah, I never understood why the CUDA code generator decided to emit break for a Continue operation. Do you think you could post the full generated code here? In my mind, if the for loop being continue'd out of is parallelized over GPU threads, then we would want to return for a continue.

cc @fredrikbk it seems like we should be running CI tests on nodes that have GPU's, rather than finding bugs like these after the fact. Is there a reason we don't (money)?

Infinoid · 2021-03-17T18:03:32Z

Sure.

// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C){
  double* __restrict__ A_vals = (double*)(A->vals);
  int* __restrict__ B2_pos = (int*)(B->indices[1][0]);
  int* __restrict__ B2_crd = (int*)(B->indices[1][1]);
  double* __restrict__ B_vals = (double*)(B->vals);
  int C2_dimension = (int)(C->dimensions[1]);
  double* __restrict__ C_vals = (double*)(C->vals);

  int32_t i125 = blockIdx.x;
  int32_t i126 = (threadIdx.x % (256));
  if (threadIdx.x >= 256) {
    return;
  }

  int32_t i = i125 * 256 + i126;
  int32_t iB = i * 5;
  int32_t iC = i * 5;
  if (i >= 2)
    return;

  int32_t j = 0;
  int32_t jB = taco_binarySearchAfter(B2_crd, (uint64_t) B2_pos[iB], (uint64_t) B2_pos[(iB + 1)], (uint64_t) 0);
  int32_t pB2_end = B2_pos[(iB + 1)];

  while (jB < pB2_end) {
    int32_t jB2_window = B2_crd[jB];
    if (jB2_window % 5 != 0) {
      jB = jB + 1;
      return; // ← THIS IS THE PROBLEM
    }
    int32_t jB0 = jB2_window / 5;
    if (jB0 >= 10)
      break;

    int32_t jC = iC * C2_dimension + j * 5;
    if (jB0 == j) {
      int32_t jA = i * 2 + j;
      A_vals[jA] = B_vals[jB] + C_vals[jC];
    }
    else {
      int32_t jA = i * 2 + j;
      A_vals[jA] = C_vals[jC];
    }
    jB = jB + (int32_t)(jB0 == j);
    j = j + 1;
  }
  while (j < 2 && j >= 0) {
    int32_t jC = iC * C2_dimension + j * 5;
    int32_t jA = i * 2 + j;
    A_vals[jA] = C_vals[jC];
    j = j + 1;
  }
}

int compute(taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *C) {
  double* __restrict__ A_vals = (double*)(A->vals);

  for (int32_t pA = 0; pA < 4; pA++) {
    A_vals[pA] = 0.0;
  }

  computeDeviceKernel0<<<1, 256>>>(A, B, C);
  cudaDeviceSynchronize();
  return 0;
}

I generated this by applying your nice #419 patch and running the following:

bin/taco "A(i,j)=B(i(0,10,5),j(0,10,5))+C(i(0,10,5),j(0,10,5))" -d=A:2,2 -f=B:ds -f=C:dd -cuda -print-nocolor

Search for THIS IS THE PROBLEM to find the problematic return statement.

rohany · 2021-03-17T18:06:47Z

I see. In this case we definitely shouldn't be returning.

In my mind, if the for loop being continue'd out of is parallelized over GPU threads, then we would want to return for a continue.

I think that this makes sense for the code generator to do. I think the last thing to figure out is why / when break statements get generated by the cuda code generator. If we know what the use is there then we can decide when continue should actuall be mapped to continue. @stephenchouca do you know what the deal with this is?

weiya711 · 2021-06-17T06:04:32Z

@rohany This is the CUDA regression issue I mentioned in the TACO meeting today. From #456 and many of the CUDA builds in the Action tab, the following tests still fail:

The following tests FAILED:
	183 - */stride.windowing/* (Failed)
	184 - */indexSetVectors.windowing/* (Failed)
	185 - */indexSetMatrices.windowing/* (Failed)

Could you take a look?

dilevin · 2023-02-08T15:04:31Z

Any progress on this fix ? I pulled the most recent version of TACO yesterday and these test are still failing. Is there a PR or branch with the fix implemented that I could try out ?

rohany · 2023-02-08T17:13:10Z

Hi David, thank you for your interest! Unfortunately, I don't think we have any students actively working on TACO at the moment, so I'm unsure when this will get fixed. I have other responsibilities and cannot take the time to fix this right now.

dilevin · 2023-02-08T17:14:42Z

Hi Rohan,

No worries ... is TACO still the sota for tensor compilers or do you have other projects that are meant as successors ?

For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU

rohany · 2023-02-08T17:18:52Z

In terms of sparse tensor computations, TACO is still the state of the art (# of supported features). However, the MLIR sparse dialect (https://mlir.llvm.org/docs/Dialects/SparseTensorOps/) supports a subset of TACO's features in a Google-funded library. I've never used it, but I assume it is quite robust.

For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU

What sort of kernels are you trying to run?

dilevin · 2023-02-08T17:21:30Z

The big bottlenecks are assembly type operations over the FE mesh and iterative solvers like preconditioned CG.

rohany · 2023-02-08T17:25:42Z

So breaking this down a bit more, what kind of assembly are you doing? What are the input and output types of tensors? TACO-like tensor compilers can do matrix assembly, but many cases do not parallelize well, and would be difficult to target GPUs directly with TACO.

Depending on how complicated the pre-conditioner is, my experience with CG solvers is that there aren't too many complicated tensor expressions present (SpMV, axpy, dot). Do you actually need tensor operations, or is mostly matrix operations the target? If so, I have some other libraries I can suggest.

dilevin · 2023-02-08T17:32:14Z

right that's true, mostly dense and sparse matrix operations would already be extremely helpful

rohany · 2023-02-08T17:34:10Z

Have you looked into CuPy at all? It will help you target a single GPU. If you have a larger problem that can utilize multiple GPUs, take a look at https://github.com/nv-legate/legate.sparse (shameless self-plug).

dilevin · 2023-02-08T17:41:18Z

oh I haven't seen this, thanks ! I'll take a look :) (sorry for hijacking this issues thread)

Infinoid mentioned this issue May 7, 2021

CUDA regression in tensor.recompile, expr.reduction0 and others #456

Closed

weiya711 added the bug Indicates an unexpected problem or unintended behavior label Jun 17, 2021

weiya711 assigned rohany Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

striding syntax fails tests in cuda builds #426

striding syntax fails tests in cuda builds #426

Infinoid commented Mar 17, 2021

rohany commented Mar 17, 2021

Infinoid commented Mar 17, 2021 •

edited

Loading

rohany commented Mar 17, 2021

weiya711 commented Jun 17, 2021

dilevin commented Feb 8, 2023 •

edited

Loading

rohany commented Feb 8, 2023

dilevin commented Feb 8, 2023 •

edited

Loading

rohany commented Feb 8, 2023 •

edited

Loading

dilevin commented Feb 8, 2023

rohany commented Feb 8, 2023

dilevin commented Feb 8, 2023

rohany commented Feb 8, 2023

dilevin commented Feb 8, 2023

striding syntax fails tests in cuda builds #426

striding syntax fails tests in cuda builds #426

Comments

Infinoid commented Mar 17, 2021

rohany commented Mar 17, 2021

Infinoid commented Mar 17, 2021 • edited Loading

rohany commented Mar 17, 2021

weiya711 commented Jun 17, 2021

dilevin commented Feb 8, 2023 • edited Loading

rohany commented Feb 8, 2023

dilevin commented Feb 8, 2023 • edited Loading

rohany commented Feb 8, 2023 • edited Loading

dilevin commented Feb 8, 2023

rohany commented Feb 8, 2023

dilevin commented Feb 8, 2023

rohany commented Feb 8, 2023

dilevin commented Feb 8, 2023

Infinoid commented Mar 17, 2021 •

edited

Loading

dilevin commented Feb 8, 2023 •

edited

Loading

dilevin commented Feb 8, 2023 •

edited

Loading

rohany commented Feb 8, 2023 •

edited

Loading