-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CUDA/HIP local argument handling #2298
base: main
Are you sure you want to change the base?
Conversation
3d64f76
to
977a240
Compare
99775d4
to
e9ecf06
Compare
Tests UR PR oneapi-src/unified-runtime#2298 with additional SYCL-Graph local memory argument E2E tests. PR also sets the `pnext` and `snext` members of `ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when calling into UR.
e9ecf06
to
6c67530
Compare
6c67530
to
a13c0d1
Compare
Tests UR PR oneapi-src/unified-runtime#2298 with additional SYCL-Graph local memory argument E2E tests. PR also sets the `pnext` and `snext` members of `ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when calling into UR.
f5a75b9
to
bed340f
Compare
Tests UR PR oneapi-src/unified-runtime#2298 with additional SYCL-Graph local memory argument E2E tests. PR also sets the `pnext` and `snext` members of `ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when calling into UR.
args_index_t Indices; | ||
args_size_t OffsetPerIndex; | ||
/// Aligned size in bytes for each local memory parameter after padding has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you maybe add a brief comment explaining how local mem is instrumented for DPC++? ie there's a single local mem buffer, and we pass the offsets into this as kernel args.
This relies on how device code is generated by DPC++ and is not necessarily obvious or intuitive. It also constitutes maybe a certain kind of ABI that should be documented somewhere so that these entry points can be used by non-DPC++ applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec for urKernelSetArgLocal says of the parameters:
argIndex – [in] argument index in range [0, num args - 1]
argSize – [in] size of the local buffer to be allocated by the runtime
So the user only passes the size of the local memory it wants from the pointer at that argument. It's the UR adapter's implementations decision whether it wants to implement that as it's own local memory, or as CUDA is doing each local argument an offset into single local data allocation. That UR API hasn't changes so I don't think there is an ABI break, we've just removed the requirement for a user to make redundant urKernelSetArgLocal
calls when targeting the CUDA/HIP backend. Although it's not wrong if a caller does do this.
I'll have a try at documenting this all up so it's clearer though, I probably have a lot of implicit knowledge in my head about what's going on that would be better written down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the main thing that I think isn't clear is that the CUDA/HIP urKernelSetArgLocal
impls will end up passing offsets as kernel args to the device kernel. This means that kernels need to be written in such a way that the args contain offsets into the smem allocation, and that within device code, the user will combine these offsets with the extern ptr that represents the entire smem allocation. This is even true for the base offset into an smem allocation, which is always zero.
If local memory is to be used in the CUDA/HIP UR adapters, the compiler must generate a kernel like this that uses an offset (if urKernelSetArgLocal
is to be used).
int main() {
T *ptr;
sycl::queue{}.submit([&](sycl::handler &cgh) {
sycl::local_accessor<T> local_acc(1, cgh);
cgh.single_task<foo>([=] { local_acc[0] += local_acc[0]; });
});
}
//
// Generated by LLVM NVPTX Back-End
//
.version 8.5
.target sm_60
.address_size 64
// .weak _ZTS3foo // -- Begin function _ZTS3foo
.extern .shared .align 4 .b8 _ZTS3foo_shared_mem[];
// @_ZTS3foo
.weak .entry _ZTS3foo(
.param .u32 _ZTS3foo_param_0
)
{
.reg .b32 %r<3>;
.reg .b64 %rd<4>;
// %bb.0: // %entry
ld.param.s32 %rd1, [_ZTS3foo_param_0];
mov.u64 %rd2, _ZTS3foo_shared_mem;
add.s64 %rd3, %rd2, %rd1;
ld.shared.u32 %r1, [%rd3];
shl.b32 %r2, %r1, 1;
st.shared.u32 [%rd3], %r2;
ret;
// -- End function
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general we don't have good enough testing for this sort of behaviour in DPC++. From a quick grep I can't find any tests where you have multiple local accessors in the same command group that have different alignments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to clarify - I don't think this is ABI breaking. I just think that we need to define clearly what kind of kernels UR expects when it is handed local mem args through urKernelSetArgLocal
. Maybe another alternative is just to say.
It is not recommended to use
urKernelSetArgLocal
if your device kernels were not generated by DPC++.
Such a limitation would still allow the user to use dynamic or static shared mem allocations without any assumptions baked in by DPC++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you could write a snippet of a SYCL/OpenCL-C/SPIRV/PTX kernel that you think wouldn't work currently? I'm not understanding enough to come up with a concrete example, and would help specify what the restrictions are.
My current thinking on how to progress this is:
-
add E2E DPC++ tests
multiple local accessors in the same command group that have different alignments. Edit: I found https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Regression/local-arg-align.cpp already exists which is marked as unsupported- Kernel written in OpenCL-C using multiple local arguments (core SYCL test without graphs and test with graphs)
-
add UR CTS tests outwith command-buffers for local memory args
urKernelSetArgLocal
doesn't need to be called between multiple enqueues of a kernel.- Multiple
urKernelSetArgLocal
args with different alignments
-
Document in local mem arg impl and/or restrictions in CUDA/HIP reference doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA kernel that won't work:
__global__ void kernel(int *a) {
extern __shared__ int blah1[];
a[0] = blah1[0];
}
As this produces the PTX:
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-34841621
// Cuda compilation tools, release 12.6, V12.6.77
// Based on NVVM 7.0.1
//
.version 8.5
.target sm_52
.address_size 64
// .globl _Z6kernelPi
.extern .shared .align 16 .b8 blah1[];
.visible .entry _Z6kernelPi(
.param .u64 _Z6kernelPi_param_0
)
{
.reg .b32 %r<2>;
.reg .b64 %rd<3>;
ld.param.u64 %rd1, [_Z6kernelPi_param_0];
cvta.to.global.u64 %rd2, %rd1;
ld.shared.u32 %r1, [blah1];
st.global.u32 [%rd2], %r1;
ret;
}
This doesn't work because the UR adapters assume that every use of urKernelSetArgLocal
corresponds to an offset
param in the kernel. This kernel doesn't use an offset into the shared local memory since it is just using the unoffset base address of the local mem allocation.
There is no way to say "This kernel uses x bytes in dynamic local mem, but my kernel signature doesn't include an offset param with which I want to index into the local mem". By contrast this would be fine:
__global__ void kernel(int *a, unsigned offset) {
extern __shared__ int
blah1[]; // This should never be used unless offset with a kernel arg
int *blah_ptr = &blah1[offset];
a[0] = blah_ptr[0];
}
Once you have a param you can call urKernelSetArgLocal
with the index of the offset param (in this case 1). As a matter of fact you don't even need to use the local mem offset at all, once you can be assured that the kernel arg won't be elided by the compiler. So this should be fine as well:
__global__ void kernel(int *a, [[maybe_unused]] unsigned offset) {
extern __shared__ int blah1[];
a[0] = blah1[0];
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openCL kernel that won't work (at least when run by the CUDA adapter):
__kernel void blah(
__local float* local_mem)
As the CUDA adapter assumes that the Index
arg given to urKernelSetArgLocal
will map to an offset
param in the kernel signature. In this case the dynamic local mem doesn't have an unsigned Offset
param, but instead will have a ptr param which corresponds to the local ptr. Loading it at runtime with an offset param will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also good find with the local arg alignment tests. Interesting that it's UNSUPPORTED
for all targets. We should maybe pass this task to the OAC team?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I get you know - I read "instrumented" as "implemented" in your original comment. Could you check if what i've put in scripts/core/CUDA.rst
is the kinda thing you had in mind?
Also good find with the local arg alignment tests. Interesting that it's UNSUPPORTED for all targets. We should maybe pass this task to the OAC team?
Good segway to ping @oneapi-src/unified-runtime-cuda-write for a review on this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CTS LGTM
test/conformance/exp_command_buffer/update/local_memory_update.cpp
Outdated
Show resolved
Hide resolved
test/conformance/exp_command_buffer/update/local_memory_update.cpp
Outdated
Show resolved
Hide resolved
test/conformance/exp_command_buffer/update/local_memory_update.cpp
Outdated
Show resolved
Hide resolved
test/conformance/exp_command_buffer/update/local_memory_update.cpp
Outdated
Show resolved
Hide resolved
test/conformance/exp_command_buffer/update/local_memory_update.cpp
Outdated
Show resolved
Hide resolved
Tests UR PR oneapi-src/unified-runtime#2298 with additional SYCL-Graph local memory argument E2E tests. PR also sets the `pnext` and `snext` members of `ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when calling into UR.
23c24a4
to
0500f4c
Compare
Tests UR PR oneapi-src/unified-runtime#2298 with additional SYCL-Graph local memory argument E2E tests. PR also sets the `pnext` and `snext` members of `ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when calling into UR.
0500f4c
to
3e32dba
Compare
scripts/core/CUDA.rst
Outdated
and make available from the pointer argument. | ||
|
||
The CUDA adapter implements local memory arguments to a kernel as a single | ||
``__shared__`` memory allocation, with each local address space pointer argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not actually a pointer arg, it ends up as a u32 offset
arg, which is combined with the ptr within the kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you actually mention this elsewhere. So maybe this is OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could add an overload for urKernelSetArgLocalExt
that only takes a Size
, not a (Size, Index)
? Then the user could ask for dynamic shared mem without having to have extra offsets in the kernel args.
After setting kernel arguments during update, we need to reset the amount of local memory used.
Iterate on previous solution so that the local argument offsets at following inidices are updated when an earlier local argument is updated
Co-authored-by: Ben Tracy <[email protected]> Co-authored-by: aarongreig <[email protected]>
3e32dba
to
582f358
Compare
The current implementation of CUDA/HIP local memory argument handling does not update the offset of any following local memory arguments when a preceding local memory argument is set. Instead the expectation is that
clearLocalSize()
is called after a kernel command has been appended/enqueued which clears the vector of local memory used by each argument. Then if the local memory arguments must be reset withurKernelSetArgLocal
.This implementation causes problems for command-buffer kernel command update, where a user can pass a subset of local arguments to update, without a guarantee of passing all the local arguments in each update.
In this patch the local argument of CUDA/HIP is refactored so that when a local argument is set, any local arguments which follow are updated to account of changes in size & padding. This removes the need to have a
clearLocalSize()
method and also for a user to have to seturKernelSetArgLocal
before each kernel enqueue/append.DPC++ PR intel/llvm#16025