-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
striding syntax fails tests in cuda builds #426
Comments
Ah, I never understood why the CUDA code generator decided to emit cc @fredrikbk it seems like we should be running CI tests on nodes that have GPU's, rather than finding bugs like these after the fact. Is there a reason we don't (money)? |
Sure. // Generated by the Tensor Algebra Compiler (tensor-compiler.org)
__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C){
double* __restrict__ A_vals = (double*)(A->vals);
int* __restrict__ B2_pos = (int*)(B->indices[1][0]);
int* __restrict__ B2_crd = (int*)(B->indices[1][1]);
double* __restrict__ B_vals = (double*)(B->vals);
int C2_dimension = (int)(C->dimensions[1]);
double* __restrict__ C_vals = (double*)(C->vals);
int32_t i125 = blockIdx.x;
int32_t i126 = (threadIdx.x % (256));
if (threadIdx.x >= 256) {
return;
}
int32_t i = i125 * 256 + i126;
int32_t iB = i * 5;
int32_t iC = i * 5;
if (i >= 2)
return;
int32_t j = 0;
int32_t jB = taco_binarySearchAfter(B2_crd, (uint64_t) B2_pos[iB], (uint64_t) B2_pos[(iB + 1)], (uint64_t) 0);
int32_t pB2_end = B2_pos[(iB + 1)];
while (jB < pB2_end) {
int32_t jB2_window = B2_crd[jB];
if (jB2_window % 5 != 0) {
jB = jB + 1;
return; // ← THIS IS THE PROBLEM
}
int32_t jB0 = jB2_window / 5;
if (jB0 >= 10)
break;
int32_t jC = iC * C2_dimension + j * 5;
if (jB0 == j) {
int32_t jA = i * 2 + j;
A_vals[jA] = B_vals[jB] + C_vals[jC];
}
else {
int32_t jA = i * 2 + j;
A_vals[jA] = C_vals[jC];
}
jB = jB + (int32_t)(jB0 == j);
j = j + 1;
}
while (j < 2 && j >= 0) {
int32_t jC = iC * C2_dimension + j * 5;
int32_t jA = i * 2 + j;
A_vals[jA] = C_vals[jC];
j = j + 1;
}
}
int compute(taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *C) {
double* __restrict__ A_vals = (double*)(A->vals);
for (int32_t pA = 0; pA < 4; pA++) {
A_vals[pA] = 0.0;
}
computeDeviceKernel0<<<1, 256>>>(A, B, C);
cudaDeviceSynchronize();
return 0;
}
I generated this by applying your nice #419 patch and running the following: bin/taco "A(i,j)=B(i(0,10,5),j(0,10,5))+C(i(0,10,5),j(0,10,5))" -d=A:2,2 -f=B:ds -f=C:dd -cuda -print-nocolor Search for |
I see. In this case we definitely shouldn't be returning.
I think that this makes sense for the code generator to do. I think the last thing to figure out is why / when |
@rohany This is the CUDA regression issue I mentioned in the TACO meeting today. From #456 and many of the CUDA builds in the Action tab, the following tests still fail:
Could you take a look? |
Any progress on this fix ? I pulled the most recent version of TACO yesterday and these test are still failing. Is there a PR or branch with the fix implemented that I could try out ? |
Hi David, thank you for your interest! Unfortunately, I don't think we have any students actively working on TACO at the moment, so I'm unsure when this will get fixed. I have other responsibilities and cannot take the time to fix this right now. |
Hi Rohan, No worries ... is TACO still the sota for tensor compilers or do you have other projects that are meant as successors ? For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU |
In terms of sparse tensor computations, TACO is still the state of the art (# of supported features). However, the MLIR sparse dialect (https://mlir.llvm.org/docs/Dialects/SparseTensorOps/) supports a subset of TACO's features in a Google-funded library. I've never used it, but I assume it is quite robust.
What sort of kernels are you trying to run? |
The big bottlenecks are assembly type operations over the FE mesh and iterative solvers like preconditioned CG. |
So breaking this down a bit more, what kind of assembly are you doing? What are the input and output types of tensors? TACO-like tensor compilers can do matrix assembly, but many cases do not parallelize well, and would be difficult to target GPUs directly with TACO. Depending on how complicated the pre-conditioner is, my experience with CG solvers is that there aren't too many complicated tensor expressions present (SpMV, axpy, dot). Do you actually need tensor operations, or is mostly matrix operations the target? If so, I have some other libraries I can suggest. |
right that's true, mostly dense and sparse matrix operations would already be extremely helpful |
Have you looked into CuPy at all? It will help you target a single GPU. If you have a larger problem that can utilize multiple GPUs, take a look at https://github.com/nv-legate/legate.sparse (shameless self-plug). |
oh I haven't seen this, thanks ! I'll take a look :) (sorry for hijacking this issues thread) |
The windowing stride syntax doesn't work quite right when configured with
-DCUDA=ON
.These 3 test cases fail:
I dug into this a bit, and the stride syntax causes taco to emit C code like this:
But the CUDA codegen doesn't emit
continue
for aContinue
op. Instead it returns, like this:Returning is wrong in this case, it effectively drops the remainder of the thread's work on the floor, producing incorrect output. Changing that
return
to acontinue
fixes the problem. Adding a flag to force it to emitcontinue
in just this one case allows the tests to pass.I cooked up a hacky workaround which does that. I don't think it's quite right. It just forces the CUDA codegen to emit a continue, rather than giving the CUDA codegen enough info to decide for itself. But hopefully it illustrates the problem.
Cc: @rohany
The text was updated successfully, but these errors were encountered: