Improve CUDA/HIP local argument handling #2298

EwanC · 2024-11-08T10:30:19Z

The current implementation of CUDA/HIP local memory argument handling does not update the offset of any following local memory arguments when a preceding local memory argument is set. Instead the expectation is that clearLocalSize() is called after a kernel command has been appended/enqueued which clears the vector of local memory used by each argument. Then if the local memory arguments must be reset with urKernelSetArgLocal.

This implementation causes problems for command-buffer kernel command update, where a user can pass a subset of local arguments to update, without a guarantee of passing all the local arguments in each update.

In this patch the local argument of CUDA/HIP is refactored so that when a local argument is set, any local arguments which follow are updated to account of changes in size & padding. This removes the need to have a clearLocalSize() method and also for a user to have to set urKernelSetArgLocal before each kernel enqueue/append.

DPC++ PR intel/llvm#16025

Tests UR PR oneapi-src/unified-runtime#2298 with additional SYCL-Graph local memory argument E2E tests. PR also sets the `pnext` and `snext` members of `ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when calling into UR.

hdelan · 2024-11-18T16:09:41Z

source/adapters/cuda/kernel.hpp

    args_index_t Indices;
-    args_size_t OffsetPerIndex;
+    /// Aligned size in bytes for each local memory parameter after padding has


Could you maybe add a brief comment explaining how local mem is instrumented for DPC++? ie there's a single local mem buffer, and we pass the offsets into this as kernel args.

This relies on how device code is generated by DPC++ and is not necessarily obvious or intuitive. It also constitutes maybe a certain kind of ABI that should be documented somewhere so that these entry points can be used by non-DPC++ applications.

The spec for urKernelSetArgLocal says of the parameters:

argIndex – [in] argument index in range [0, num args - 1] argSize – [in] size of the local buffer to be allocated by the runtime

So the user only passes the size of the local memory it wants from the pointer at that argument. It's the UR adapter's implementations decision whether it wants to implement that as it's own local memory, or as CUDA is doing each local argument an offset into single local data allocation. That UR API hasn't changes so I don't think there is an ABI break, we've just removed the requirement for a user to make redundant urKernelSetArgLocal calls when targeting the CUDA/HIP backend. Although it's not wrong if a caller does do this.

I'll have a try at documenting this all up so it's clearer though, I probably have a lot of implicit knowledge in my head about what's going on that would be better written down.

Probably the main thing that I think isn't clear is that the CUDA/HIP urKernelSetArgLocal impls will end up passing offsets as kernel args to the device kernel. This means that kernels need to be written in such a way that the args contain offsets into the smem allocation, and that within device code, the user will combine these offsets with the extern ptr that represents the entire smem allocation. This is even true for the base offset into an smem allocation, which is always zero.

If local memory is to be used in the CUDA/HIP UR adapters, the compiler must generate a kernel like this that uses an offset (if urKernelSetArgLocal is to be used).

int main() { T *ptr; sycl::queue{}.submit([&](sycl::handler &cgh) { sycl::local_accessor<T> local_acc(1, cgh); cgh.single_task<foo>([=] { local_acc[0] += local_acc[0]; }); }); }

// // Generated by LLVM NVPTX Back-End // .version 8.5 .target sm_60 .address_size 64 // .weak _ZTS3foo // -- Begin function _ZTS3foo .extern .shared .align 4 .b8 _ZTS3foo_shared_mem[]; // @_ZTS3foo .weak .entry _ZTS3foo( .param .u32 _ZTS3foo_param_0 ) { .reg .b32 %r<3>; .reg .b64 %rd<4>; // %bb.0: // %entry ld.param.s32 %rd1, [_ZTS3foo_param_0]; mov.u64 %rd2, _ZTS3foo_shared_mem; add.s64 %rd3, %rd2, %rd1; ld.shared.u32 %r1, [%rd3]; shl.b32 %r2, %r1, 1; st.shared.u32 [%rd3], %r2; ret; // -- End function }

I think in general we don't have good enough testing for this sort of behaviour in DPC++. From a quick grep I can't find any tests where you have multiple local accessors in the same command group that have different alignments.

Also to clarify - I don't think this is ABI breaking. I just think that we need to define clearly what kind of kernels UR expects when it is handed local mem args through urKernelSetArgLocal. Maybe another alternative is just to say.

It is not recommended to use urKernelSetArgLocal if your device kernels were not generated by DPC++.

Such a limitation would still allow the user to use dynamic or static shared mem allocations without any assumptions baked in by DPC++.

Do you think you could write a snippet of a SYCL/OpenCL-C/SPIRV/PTX kernel that you think wouldn't work currently? I'm not understanding enough to come up with a concrete example, and would help specify what the restrictions are.

My current thinking on how to progress this is:

add E2E DPC++ tests

~~multiple local accessors in the same command group that have different alignments~~. Edit: I found https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Regression/local-arg-align.cpp already exists which is marked as unsupported

Kernel written in OpenCL-C using multiple local arguments (core SYCL test without graphs and test with graphs)

add UR CTS tests outwith command-buffers for local memory args

urKernelSetArgLocal doesn't need to be called between multiple enqueues of a kernel.

Multiple urKernelSetArgLocal args with different alignments

Document in local mem arg impl and/or restrictions in CUDA/HIP reference doc

CUDA kernel that won't work:

__global__ void kernel(int *a) { extern __shared__ int blah1[]; a[0] = blah1[0]; }

As this produces the PTX:

// // Generated by NVIDIA NVVM Compiler // // Compiler Build ID: CL-34841621 // Cuda compilation tools, release 12.6, V12.6.77 // Based on NVVM 7.0.1 // .version 8.5 .target sm_52 .address_size 64 // .globl _Z6kernelPi .extern .shared .align 16 .b8 blah1[]; .visible .entry _Z6kernelPi( .param .u64 _Z6kernelPi_param_0 ) { .reg .b32 %r<2>; .reg .b64 %rd<3>; ld.param.u64 %rd1, [_Z6kernelPi_param_0]; cvta.to.global.u64 %rd2, %rd1; ld.shared.u32 %r1, [blah1]; st.global.u32 [%rd2], %r1; ret; }

This doesn't work because the UR adapters assume that every use of urKernelSetArgLocal corresponds to an offset param in the kernel. This kernel doesn't use an offset into the shared local memory since it is just using the unoffset base address of the local mem allocation.

There is no way to say "This kernel uses x bytes in dynamic local mem, but my kernel signature doesn't include an offset param with which I want to index into the local mem". By contrast this would be fine:

__global__ void kernel(int *a, unsigned offset) { extern __shared__ int blah1[]; // This should never be used unless offset with a kernel arg int *blah_ptr = &blah1[offset]; a[0] = blah_ptr[0]; }

Once you have a param you can call urKernelSetArgLocal with the index of the offset param (in this case 1). As a matter of fact you don't even need to use the local mem offset at all, once you can be assured that the kernel arg won't be elided by the compiler. So this should be fine as well:

__global__ void kernel(int *a, [[maybe_unused]] unsigned offset) { extern __shared__ int blah1[]; a[0] = blah1[0]; }

openCL kernel that won't work (at least when run by the CUDA adapter):

__kernel void blah( __local float* local_mem)

As the CUDA adapter assumes that the Index arg given to urKernelSetArgLocal will map to an offset param in the kernel signature. In this case the dynamic local mem doesn't have an unsigned Offset param, but instead will have a ptr param which corresponds to the local ptr. Loading it at runtime with an offset param will fail.

Also good find with the local arg alignment tests. Interesting that it's UNSUPPORTED for all targets. We should maybe pass this task to the OAC team?

Ahh, I get you know - I read "instrumented" as "implemented" in your original comment. Could you check if what i've put in scripts/core/CUDA.rst is the kinda thing you had in mind?

Also good find with the local arg alignment tests. Interesting that it's UNSUPPORTED for all targets. We should maybe pass this task to the OAC team?

Good segway to ping @oneapi-src/unified-runtime-cuda-write for a review on this PR

aarongreig

CTS LGTM

test/conformance/exp_command_buffer/update/local_memory_update.cpp

source/adapters/cuda/kernel.hpp

source/adapters/hip/kernel.hpp

Tests UR PR oneapi-src/unified-runtime#2298 with additional SYCL-Graph local memory argument E2E tests. PR also sets the `pnext` and `snext` members of `ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when calling into UR.

hdelan · 2024-11-19T16:08:51Z

scripts/core/CUDA.rst

+and make available from the pointer argument.
+
+The CUDA adapter implements local memory arguments to a kernel as a single
+``__shared__`` memory allocation, with each local address space pointer argument


It's not actually a pointer arg, it ends up as a u32 offset arg, which is combined with the ptr within the kernel

I mean you actually mention this elsewhere. So maybe this is OK

Do you think we could add an overload for urKernelSetArgLocalExt that only takes a Size, not a (Size, Index)? Then the user could ask for dynamic shared mem without having to have extra offsets in the kernel args.

scripts/core/CUDA.rst

source/adapters/cuda/kernel.hpp

After setting kernel arguments during update, we need to reset the amount of local memory used.

Iterate on previous solution so that the local argument offsets at following inidices are updated when an earlier local argument is updated

Co-authored-by: Ben Tracy <[email protected]> Co-authored-by: aarongreig <[email protected]>

github-actions bot added cuda CUDA adapter specific issues command-buffer Command Buffer feature addition/changes/specification labels Nov 8, 2024

EwanC mentioned this pull request Nov 8, 2024

[SYCL][Graph] Fix CUDA/HIP local mem argument update bug intel/llvm#16025

Draft

EwanC force-pushed the ewan/cuda_update_local_size branch from 3d64f76 to 977a240 Compare November 8, 2024 14:59

github-actions bot added the conformance Conformance test suite issues. label Nov 8, 2024

EwanC force-pushed the ewan/cuda_update_local_size branch from 99775d4 to e9ecf06 Compare November 14, 2024 09:43

EwanC force-pushed the ewan/cuda_update_local_size branch from e9ecf06 to 6c67530 Compare November 14, 2024 14:17

github-actions bot added the hip HIP adapter specific issues label Nov 14, 2024

EwanC changed the title ~~WIP: Cuda Fix for command-buffer local argument upate~~ Improve CUDA/HIP local argument handling Nov 14, 2024

EwanC force-pushed the ewan/cuda_update_local_size branch from 6c67530 to a13c0d1 Compare November 14, 2024 15:01

EwanC marked this pull request as ready for review November 15, 2024 09:20

EwanC requested review from a team as code owners November 15, 2024 09:20

EwanC requested a review from frasercrmck November 15, 2024 09:20

EwanC force-pushed the ewan/cuda_update_local_size branch 2 times, most recently from f5a75b9 to bed340f Compare November 18, 2024 11:47

hdelan reviewed Nov 18, 2024

View reviewed changes

aarongreig approved these changes Nov 18, 2024

View reviewed changes

Bensuo reviewed Nov 18, 2024

View reviewed changes

source/adapters/cuda/kernel.hpp Outdated Show resolved Hide resolved

source/adapters/hip/kernel.hpp Outdated Show resolved Hide resolved

github-actions bot added loader Loader related feature/bug sanitizer Sanitizer layer issues/changes/specification labels Nov 19, 2024

EwanC force-pushed the ewan/cuda_update_local_size branch 2 times, most recently from 23c24a4 to 0500f4c Compare November 19, 2024 14:32

EwanC force-pushed the ewan/cuda_update_local_size branch from 0500f4c to 3e32dba Compare November 19, 2024 15:20

github-actions bot added the specification Changes or additions to the specification label Nov 19, 2024

hdelan reviewed Nov 19, 2024

View reviewed changes

scripts/core/CUDA.rst Outdated Show resolved Hide resolved

frasercrmck reviewed Nov 19, 2024

View reviewed changes

source/adapters/cuda/kernel.hpp Outdated Show resolved Hide resolved

EwanC and others added 6 commits November 20, 2024 09:54

[CUDA][HIP] Fix for command-buffer local argument update

966a6d4

After setting kernel arguments during update, we need to reset the amount of local memory used.

Improve solution

8276de5

Iterate on previous solution so that the local argument offsets at following inidices are updated when an earlier local argument is updated

Add extra more basic CTS test

5c3c228

FIx comment typos

c2b9add

Co-authored-by: Ben Tracy <[email protected]> Co-authored-by: aarongreig <[email protected]>

Add non command-buffer test

df2922d

Document hip extra arg behavior

582f358

EwanC force-pushed the ewan/cuda_update_local_size branch from 3e32dba to 582f358 Compare November 20, 2024 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CUDA/HIP local argument handling #2298

Improve CUDA/HIP local argument handling #2298

EwanC commented Nov 8, 2024 •

edited

Loading

hdelan Nov 18, 2024 •

edited

Loading

EwanC Nov 18, 2024

hdelan Nov 18, 2024

hdelan Nov 18, 2024

hdelan Nov 18, 2024 •

edited

Loading

EwanC Nov 19, 2024 •

edited

Loading

hdelan Nov 19, 2024 •

edited

Loading

hdelan Nov 19, 2024 •

edited

Loading

hdelan Nov 19, 2024

EwanC Nov 19, 2024

aarongreig left a comment

hdelan Nov 19, 2024

hdelan Nov 19, 2024

hdelan Nov 19, 2024 •

edited

Loading

Improve CUDA/HIP local argument handling #2298

Are you sure you want to change the base?

Improve CUDA/HIP local argument handling #2298

Conversation

EwanC commented Nov 8, 2024 • edited Loading

hdelan Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

EwanC Nov 18, 2024

Choose a reason for hiding this comment

hdelan Nov 18, 2024

Choose a reason for hiding this comment

hdelan Nov 18, 2024

Choose a reason for hiding this comment

hdelan Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

EwanC Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

hdelan Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

hdelan Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

hdelan Nov 19, 2024

Choose a reason for hiding this comment

EwanC Nov 19, 2024

Choose a reason for hiding this comment

aarongreig left a comment

Choose a reason for hiding this comment

hdelan Nov 19, 2024

Choose a reason for hiding this comment

hdelan Nov 19, 2024

Choose a reason for hiding this comment

hdelan Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

EwanC commented Nov 8, 2024 •

edited

Loading

hdelan Nov 18, 2024 •

edited

Loading

hdelan Nov 18, 2024 •

edited

Loading

EwanC Nov 19, 2024 •

edited

Loading

hdelan Nov 19, 2024 •

edited

Loading

hdelan Nov 19, 2024 •

edited

Loading

hdelan Nov 19, 2024 •

edited

Loading