-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel hang related to noreturn function attributes #113
Comments
I'm on 13.2.1 but I can reproduce the hanging behaviour during tests locally. Except for my particular machine, instead of hanging during gpuarrays/random, it consistently hangs during gpuarrays/broadcasting. I compared activity monitor behaviour between Another thing I've noticed is that when I stop the test with ctrl+c, I get Distributed.jl warnings telling me that the process was not removed, and indeed looking at activity monitor, I have 5 Julia processes that shouldn't be there. |
That's just the test suite runner capturing your interrupt and exiting. One way to run tests in isolation is, from the Metal.jl repository, do something like |
I'm having trouble getting the test to run outside of the test suite runner because it comes from I also did a bisect of Julia, and it seems like JuliaLang/julia@a12c2f0 is the commit that caused the issue (or at least caused it to surface). |
To run code from the GPUArrays test suite, you can do something like:
That should set-up the required environment. Note that you also might have to start Julia with |
I found where the test hangs for my machine. The code I'm referring to is right above the line that looks like
|
Julia
|
Julia
|
I was hoping the removal of libcmt would magically fix the 1.9 ci issues but unfortunately the tests still hang. |
Sadly this looks codegen related; the atomic store is probably a good clue. I hope I'll have time to investigate next week, but it's problematic that I can't reproduce this. Which hardware do you have exactly? |
I have a 30-core M2 Max Macbook Pro and all of my comments have been using that computer. I also have access to a base model M1 Mac mini in my lab where I was able to reproduce the hang. |
'I ran Of the three 1.8.5 tests I ran and saved, the first one had a failure in Then, I ran the Metal 0.2.0 tests on 1.9.0-rc1, and for the first time since seeing this issue, a test pass completed on 1.9.0, although with some errors. Gist Hopeful, I reran it on 1.9.0-rc1 twice and unfortunately, both times the test hanged, with some errors in previous tests. The second time had a new error in the unified memory example (hint?) Gist 1, Gist 2 All of the gists are from M1 runs. On my M2 Max, all tests consistently pass on 1.8.5, and I never get any errors on 1.9.0-rc1 (other than broadcasting hanging). I don’t know how useful these gists will be, but I figure since you can’t reproduce, I might as well give you as much as you can. Last time I had a bug like this that was very inconsistent, it ended being that I wasn’t initializing some values but they weren’t caught in debug mode because all the memory gets 0 initialized when running the debugger. I’ll run the tests a few more times in the background this weekend from the Master branch to see if anything has changed since there’s been quite a few changes. |
I haven't had the time to reproduce (probably only next week), but since you have a system on which the tests consistently hang: can you post the MWE that makes it hang in a clean session, and could you try running with |
MWE (running from the Metal folder):
|
That's not really a minimal example, can you reduce it to any of the tests that make the GPU hang? You mentioned EDIT: OK, I have access to a system on which this hangs as well. I'll try reducing next week. |
The Faster to run code of the same as above
|
So it doesn't hang if you run that operation in isolation? Did you try with |
I forgot about
Pasting the above code into the REPL after starting julia in the following ways did not hang:
However, when julia was started with I dumped the generated llvm code for each version. Both 1.8.5 versions were identical, I'm pretty sure both 1.9.0-rc1 versions are identical (some function names different), but I put both in in case I'm wrong. The difference between 1.8.5 and 1.9.0-rc1 is the Setting @code_llvm for 1.8.5 (both)
@code_llvm for 1.9.0-rc1 with shader validation (no hang)
@code_llvm for 1.9.0-rc1 with no shader validation (hang)
|
Thanks! This gets us much closer to something debuggable. FYI, the debug layer needs both |
Looks like that specific failure doesn't reproduce anymore after JuliaGPU/GPUArrays.jl#454, so let's try bumping GPUArrays to at least work around the immediate issue. |
MWE: using Metal
function kernel(dest, nelem)
j = 0
while j < nelem
j += 1
i = Metal.thread_position_in_grid_1d() + (j-1) * Metal.threads_per_grid_1d()
i > length(dest) && return
I = @inbounds CartesianIndices(dest)[i]
@inbounds dest[I] = 42
end
return
end
arr = MtlArray{Int64}(undef)
Metal.@sync @metal kernel(arr, 1) |
I've spent some time debugging this, and I don't notice significant differences between the --check-bounds=yes IR on 1.8 and 1.9. Specifically, there were two differences:
The former doesn't seem to be the culprit, I think (after manually stripping those attributes and still reproducing the hang). The latter may be related, but I wonder why our LLVM back-end messes up here. We have this code, https://github.com/JuliaGPU/llvm-metal/blob/llvm_release_14/llvm/lib/Target/Metal/Metal.cpp#L285-L324, and strangely if I compile our back-end it just sets the metadata correctly. I wonder if something's up with the Yggdrasil build. Instead of debugging this, I'm going to try to set this flag from Julia, see maleadt/LLVM.jl#329. Can't test this right now though as the machine where I could reproduce this has died 🤦 EDIT: Setting the SDK version didn't help. |
Bumping GPUArrays seems to have fixed the hanging for me. If I understand correctly this gets around the issue by not calling the problematic code but the problem still exists? |
Yeah there's still an issue. |
Reduced the hang to the following IR: ; ModuleID = 'kernel.ll'
source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx13.2.1"
; Function Attrs: cold noreturn nounwind
declare void @llvm.trap() #0
; Function Attrs: noinline
define internal void @throw() #1 {
top:
tail call void @llvm.trap()
unreachable
}
define cc103 void @kernel({ i8 addrspace(1)* } addrspace(1)* %0, i64 addrspace(1)* %1) {
entry:
%2 = load i64, i64 addrspace(1)* %1, align 8
%.not2 = icmp sgt i64 %2, 0
br i1 %.not2, label %oob, label %exit
oob: ; preds = %entry
tail call void @throw()
unreachable
exit: ; preds = %entry
ret void
}
attributes #0 = { cold noreturn nounwind }
attributes #1 = { noinline }
!air.kernel = !{!0}
!air.version = !{!5}
!llvm.module.flags = !{!6}
!0 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !1, !2}
!1 = !{}
!2 = !{!3, !4}
!3 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!4 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!5 = !{i32 2, i32 4, i32 0}
!6 = !{i32 2, !"SDK Version", [2 x i32] [i32 13, i32 2]} After compiling this IR with our Metal back-end: using Metal
function main(path)
metallib = read(path)
dev = current_device()
lib = MTLLibraryFromData(dev, metallib)
fun = MTLFunction(lib, "kernel")
pipeline = MTLComputePipelineState(dev, fun)
f = identity
ft = typeof(f)
tt = Tuple{ft, Tuple{MtlDeviceArray{Int64, 0, 1}, Int64}}
kernel = Metal.HostKernel{ft, tt}(f, pipeline)
arr = MtlArray{Int64}(undef)
println("Waiting...")
Metal.@sync kernel(arr, 1)
end
isinteractive() || main(ARGS...) This hangs when the metallib was generated by our back-end based on LLVM 14, but not when using the LLVM 13 version. The difference: ; ModuleID = 'bc_module'
source_filename = "text"
@@ -38,7 +38,8 @@
; Function Attrs: cold noreturn nounwind
declare void @llvm.trap() #0
-define internal void @throw() {
+; Function Attrs: noinline
+define internal void @throw() #1 {
top:
tail call void @llvm.trap()
unreachable
@@ -59,6 +60,7 @@
}
attributes #0 = { cold noreturn nounwind }
+attributes #1 = { noinline }
!air.kernel = !{!0}
!air.version = !{!5} i.e. on LLVM 13 we drop the |
ObjC loader: #import <Foundation/Foundation.h>
#import <Metal/Metal.h>
int main(int argc, const char * argv[]) {
@autoreleasepool {
if (argc != 2) {
NSLog(@"Usage: %s [Metal Library Filename]", argv[0]);
return 1;
}
NSString *libraryFilePath = [NSString stringWithUTF8String:argv[1]];
NSError *error = nil;
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
if (!device) {
NSLog(@"Metal is not supported on this device");
return 1;
}
NSURL *libraryFileURL = [NSURL fileURLWithPath:libraryFilePath];
id<MTLLibrary> library = [device newLibraryWithURL:libraryFileURL error:&error];
if (!library) {
NSLog(@"Failed to create Metal library: %@", error);
return 1;
}
id<MTLFunction> kernelFunction = [library newFunctionWithName:@"kernel"];
if (!kernelFunction) {
NSLog(@"Failed to find the 'kernel' function");
return 1;
}
id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:kernelFunction error:&error];
if (!pipeline) {
NSLog(@"Failed to create compute pipeline state: %@", error);
return 1;
}
id<MTLCommandQueue> commandQueue = [device newCommandQueue];
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:pipeline];
NSUInteger bufferSize = sizeof(int64_t);
id<MTLBuffer> buffer1 = [device newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
id<MTLBuffer> buffer2 = [device newBufferWithBytes:&(int64_t){1} length:sizeof(int64_t) options:MTLResourceStorageModeShared];
[computeEncoder setBuffer:buffer1 offset:0 atIndex:0];
[computeEncoder setBuffer:buffer2 offset:0 atIndex:1];
MTLSize gridSize = MTLSizeMake(1, 1, 1);
MTLSize threadgroupSize = MTLSizeMake(1, 1, 1);
[computeEncoder dispatchThreadgroups:gridSize threadsPerThreadgroup:threadgroupSize];
[computeEncoder endEncoding];
MTLCommandBufferHandler completionHandler = ^(id<MTLCommandBuffer> cb) {
NSLog(@"Kernel execution completed");
};
[commandBuffer addCompletedHandler:completionHandler];
[commandBuffer commit];
NSLog(@"Waiting...");
[commandBuffer waitUntilCompleted];
}
return 0;
} |
Hmm, I can actually reconstruct this IR using a Metal kernel: #include <metal_stdlib>
using namespace metal;
struct Array {
device int8_t *data;
};
__attribute__((noinline)) void perform_throw() {
__builtin_trap();
}
kernel void kernel_fun(device Array *a, device int64_t *b [[ buffer(0) ]]) {
if (*b > 0)
perform_throw();
} ... but that one executes correctly. Trying to narrow down the differences, it looks like a metadata-related issue. |
So with the following base IR:
... it works with the following metadata: !air.kernel = !{!14}
!14 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !15, !16}
!15 = !{}
!16 = !{!17, !20}
!17 = !{i32 0, !"air.indirect_buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.struct_type_info", !18, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!18 = !{i32 0, i32 8, i32 0, !"char", !"data", !"air.indirect_argument", !19}
!19 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 1, !"air.arg_type_align_size", i32 1}
!20 = !{i32 1, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8} ... but fails with what we emit: !air.kernel = !{!1}
!1 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !2, !3}
!2 = !{}
!3 = !{!4, !5}
!4 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!5 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8} So this is the original metadata issue again (where we emit a simple buffer, for bindless oepration, while Metal apparently expects a fleshed out metadata tree). |
Let's narrow this issue down to the kernel hang seen with Disabling the workaround and running the MWE above on |
Looks like this workaround isn't required anymore on 14.5 using an M3, however, disabling it seems to cause other test failures. At least no hangs, though. |
Actually, it seems like running the MWE causes the GPU to hang (100% usage as reported by |
Although Metal.jl works fine on Julia 1.9 locally, it for some reason fails on CI. Maybe this is related to the
juliaecosystem
machines running an outdated macOS (12.4, while I'm running 13, but 12.6 was also reported to work fine in #85 (comment)).The text was updated successfully, but these errors were encountered: