Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

striding syntax fails tests in cuda builds #426

Open
Infinoid opened this issue Mar 17, 2021 · 13 comments
Open

striding syntax fails tests in cuda builds #426

Infinoid opened this issue Mar 17, 2021 · 13 comments
Assignees
Labels
bug Indicates an unexpected problem or unintended behavior

Comments

@Infinoid
Copy link
Contributor

The windowing stride syntax doesn't work quite right when configured with -DCUDA=ON.

These 3 test cases fail:

	890 - windowing/stride.windowing/(dense,compressed) (Failed)
	891 - windowing/stride.windowing/(compressed,dense) (Failed)
	892 - windowing/stride.windowing/(compressed,compressed) (Failed)

I dug into this a bit, and the stride syntax causes taco to emit C code like this:

    if (jB2_window % 5 != 0) {
      jB++;
      continue;
    }

But the CUDA codegen doesn't emit continue for a Continue op. Instead it returns, like this:

    if (jB2_window % 5 != 0) {
      jB = jB + 1;
      return;
    }

Returning is wrong in this case, it effectively drops the remainder of the thread's work on the floor, producing incorrect output. Changing that return to a continue fixes the problem. Adding a flag to force it to emit continue in just this one case allows the tests to pass.

I cooked up a hacky workaround which does that. I don't think it's quite right. It just forces the CUDA codegen to emit a continue, rather than giving the CUDA codegen enough info to decide for itself. But hopefully it illustrates the problem.

Cc: @rohany

@rohany
Copy link
Contributor

rohany commented Mar 17, 2021

Ah, I never understood why the CUDA code generator decided to emit break for a Continue operation. Do you think you could post the full generated code here? In my mind, if the for loop being continue'd out of is parallelized over GPU threads, then we would want to return for a continue.

cc @fredrikbk it seems like we should be running CI tests on nodes that have GPU's, rather than finding bugs like these after the fact. Is there a reason we don't (money)?

@Infinoid
Copy link
Contributor Author

Infinoid commented Mar 17, 2021

Sure.

// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C){
  double* __restrict__ A_vals = (double*)(A->vals);
  int* __restrict__ B2_pos = (int*)(B->indices[1][0]);
  int* __restrict__ B2_crd = (int*)(B->indices[1][1]);
  double* __restrict__ B_vals = (double*)(B->vals);
  int C2_dimension = (int)(C->dimensions[1]);
  double* __restrict__ C_vals = (double*)(C->vals);

  int32_t i125 = blockIdx.x;
  int32_t i126 = (threadIdx.x % (256));
  if (threadIdx.x >= 256) {
    return;
  }

  int32_t i = i125 * 256 + i126;
  int32_t iB = i * 5;
  int32_t iC = i * 5;
  if (i >= 2)
    return;

  int32_t j = 0;
  int32_t jB = taco_binarySearchAfter(B2_crd, (uint64_t) B2_pos[iB], (uint64_t) B2_pos[(iB + 1)], (uint64_t) 0);
  int32_t pB2_end = B2_pos[(iB + 1)];

  while (jB < pB2_end) {
    int32_t jB2_window = B2_crd[jB];
    if (jB2_window % 5 != 0) {
      jB = jB + 1;
      return; // ← THIS IS THE PROBLEM
    }
    int32_t jB0 = jB2_window / 5;
    if (jB0 >= 10)
      break;

    int32_t jC = iC * C2_dimension + j * 5;
    if (jB0 == j) {
      int32_t jA = i * 2 + j;
      A_vals[jA] = B_vals[jB] + C_vals[jC];
    }
    else {
      int32_t jA = i * 2 + j;
      A_vals[jA] = C_vals[jC];
    }
    jB = jB + (int32_t)(jB0 == j);
    j = j + 1;
  }
  while (j < 2 && j >= 0) {
    int32_t jC = iC * C2_dimension + j * 5;
    int32_t jA = i * 2 + j;
    A_vals[jA] = C_vals[jC];
    j = j + 1;
  }
}

int compute(taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *C) {
  double* __restrict__ A_vals = (double*)(A->vals);

  for (int32_t pA = 0; pA < 4; pA++) {
    A_vals[pA] = 0.0;
  }

  computeDeviceKernel0<<<1, 256>>>(A, B, C);
  cudaDeviceSynchronize();
  return 0;
}

I generated this by applying your nice #419 patch and running the following:

bin/taco "A(i,j)=B(i(0,10,5),j(0,10,5))+C(i(0,10,5),j(0,10,5))" -d=A:2,2 -f=B:ds -f=C:dd -cuda -print-nocolor

Search for THIS IS THE PROBLEM to find the problematic return statement.

@rohany
Copy link
Contributor

rohany commented Mar 17, 2021

I see. In this case we definitely shouldn't be returning.

In my mind, if the for loop being continue'd out of is parallelized over GPU threads, then we would want to return for a continue.

I think that this makes sense for the code generator to do. I think the last thing to figure out is why / when break statements get generated by the cuda code generator. If we know what the use is there then we can decide when continue should actuall be mapped to continue. @stephenchouca do you know what the deal with this is?

@weiya711 weiya711 added the bug Indicates an unexpected problem or unintended behavior label Jun 17, 2021
@weiya711
Copy link
Contributor

@rohany This is the CUDA regression issue I mentioned in the TACO meeting today. From #456 and many of the CUDA builds in the Action tab, the following tests still fail:

The following tests FAILED:
	183 - */stride.windowing/* (Failed)
	184 - */indexSetVectors.windowing/* (Failed)
	185 - */indexSetMatrices.windowing/* (Failed)

Could you take a look?

@dilevin
Copy link

dilevin commented Feb 8, 2023

Any progress on this fix ? I pulled the most recent version of TACO yesterday and these test are still failing. Is there a PR or branch with the fix implemented that I could try out ?

@rohany
Copy link
Contributor

rohany commented Feb 8, 2023

Hi David, thank you for your interest! Unfortunately, I don't think we have any students actively working on TACO at the moment, so I'm unsure when this will get fixed. I have other responsibilities and cannot take the time to fix this right now.

@dilevin
Copy link

dilevin commented Feb 8, 2023

Hi Rohan,

No worries ... is TACO still the sota for tensor compilers or do you have other projects that are meant as successors ?

For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU

@rohany
Copy link
Contributor

rohany commented Feb 8, 2023

In terms of sparse tensor computations, TACO is still the state of the art (# of supported features). However, the MLIR sparse dialect (https://mlir.llvm.org/docs/Dialects/SparseTensorOps/) supports a subset of TACO's features in a Google-funded library. I've never used it, but I assume it is quite robust.

For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU

What sort of kernels are you trying to run?

@dilevin
Copy link

dilevin commented Feb 8, 2023

The big bottlenecks are assembly type operations over the FE mesh and iterative solvers like preconditioned CG.

@rohany
Copy link
Contributor

rohany commented Feb 8, 2023

So breaking this down a bit more, what kind of assembly are you doing? What are the input and output types of tensors? TACO-like tensor compilers can do matrix assembly, but many cases do not parallelize well, and would be difficult to target GPUs directly with TACO.

Depending on how complicated the pre-conditioner is, my experience with CG solvers is that there aren't too many complicated tensor expressions present (SpMV, axpy, dot). Do you actually need tensor operations, or is mostly matrix operations the target? If so, I have some other libraries I can suggest.

@dilevin
Copy link

dilevin commented Feb 8, 2023

right that's true, mostly dense and sparse matrix operations would already be extremely helpful

@rohany
Copy link
Contributor

rohany commented Feb 8, 2023

Have you looked into CuPy at all? It will help you target a single GPU. If you have a larger problem that can utilize multiple GPUs, take a look at https://github.com/nv-legate/legate.sparse (shameless self-plug).

@dilevin
Copy link

dilevin commented Feb 8, 2023

oh I haven't seen this, thanks ! I'll take a look :) (sorry for hijacking this issues thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants