Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CUDA/HIP local argument handling #2298

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

EwanC
Copy link
Contributor

@EwanC EwanC commented Nov 8, 2024

The current implementation of CUDA/HIP local memory argument handling does not update the offset of any following local memory arguments when a preceding local memory argument is set. Instead the expectation is that clearLocalSize() is called after a kernel command has been appended/enqueued which clears the vector of local memory used by each argument. Then if the local memory arguments must be reset with urKernelSetArgLocal.

This implementation causes problems for command-buffer kernel command update, where a user can pass a subset of local arguments to update, without a guarantee of passing all the local arguments in each update.

In this patch the local argument of CUDA/HIP is refactored so that when a local argument is set, any local arguments which follow are updated to account of changes in size & padding. This removes the need to have a clearLocalSize() method and also for a user to have to set urKernelSetArgLocal before each kernel enqueue/append.

DPC++ PR intel/llvm#16025

@github-actions github-actions bot added cuda CUDA adapter specific issues command-buffer Command Buffer feature addition/changes/specification labels Nov 8, 2024
@EwanC EwanC force-pushed the ewan/cuda_update_local_size branch from 3d64f76 to 977a240 Compare November 8, 2024 14:59
@github-actions github-actions bot added the conformance Conformance test suite issues. label Nov 8, 2024
@EwanC EwanC force-pushed the ewan/cuda_update_local_size branch from 99775d4 to e9ecf06 Compare November 14, 2024 09:43
EwanC added a commit to reble/llvm that referenced this pull request Nov 14, 2024
Tests UR PR oneapi-src/unified-runtime#2298
with additional SYCL-Graph local memory argument E2E tests.

PR also sets the `pnext` and `snext` members of
`ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when
calling into UR.
@EwanC EwanC force-pushed the ewan/cuda_update_local_size branch from e9ecf06 to 6c67530 Compare November 14, 2024 14:17
@github-actions github-actions bot added the hip HIP adapter specific issues label Nov 14, 2024
@EwanC EwanC changed the title WIP: Cuda Fix for command-buffer local argument upate Improve CUDA/HIP local argument handling Nov 14, 2024
@EwanC EwanC force-pushed the ewan/cuda_update_local_size branch from 6c67530 to a13c0d1 Compare November 14, 2024 15:01
EwanC added a commit to reble/llvm that referenced this pull request Nov 14, 2024
Tests UR PR oneapi-src/unified-runtime#2298
with additional SYCL-Graph local memory argument E2E tests.

PR also sets the `pnext` and `snext` members of
`ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when
calling into UR.
@EwanC EwanC marked this pull request as ready for review November 15, 2024 09:20
@EwanC EwanC requested review from a team as code owners November 15, 2024 09:20
@EwanC EwanC force-pushed the ewan/cuda_update_local_size branch 2 times, most recently from f5a75b9 to bed340f Compare November 18, 2024 11:47
EwanC added a commit to reble/llvm that referenced this pull request Nov 18, 2024
Tests UR PR oneapi-src/unified-runtime#2298
with additional SYCL-Graph local memory argument E2E tests.

PR also sets the `pnext` and `snext` members of
`ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when
calling into UR.
args_index_t Indices;
args_size_t OffsetPerIndex;
/// Aligned size in bytes for each local memory parameter after padding has
Copy link
Contributor

@hdelan hdelan Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you maybe add a brief comment explaining how local mem is instrumented for DPC++? ie there's a single local mem buffer, and we pass the offsets into this as kernel args.

This relies on how device code is generated by DPC++ and is not necessarily obvious or intuitive. It also constitutes maybe a certain kind of ABI that should be documented somewhere so that these entry points can be used by non-DPC++ applications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec for urKernelSetArgLocal says of the parameters:

argIndex – [in] argument index in range [0, num args - 1]

argSize – [in] size of the local buffer to be allocated by the runtime 

So the user only passes the size of the local memory it wants from the pointer at that argument. It's the UR adapter's implementations decision whether it wants to implement that as it's own local memory, or as CUDA is doing each local argument an offset into single local data allocation. That UR API hasn't changes so I don't think there is an ABI break, we've just removed the requirement for a user to make redundant urKernelSetArgLocal calls when targeting the CUDA/HIP backend. Although it's not wrong if a caller does do this.

I'll have a try at documenting this all up so it's clearer though, I probably have a lot of implicit knowledge in my head about what's going on that would be better written down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the main thing that I think isn't clear is that the CUDA/HIP urKernelSetArgLocal impls will end up passing offsets as kernel args to the device kernel. This means that kernels need to be written in such a way that the args contain offsets into the smem allocation, and that within device code, the user will combine these offsets with the extern ptr that represents the entire smem allocation. This is even true for the base offset into an smem allocation, which is always zero.

If local memory is to be used in the CUDA/HIP UR adapters, the compiler must generate a kernel like this that uses an offset (if urKernelSetArgLocal is to be used).

int main() {                                                                                                                       
  T *ptr;                                                                                                                          
  sycl::queue{}.submit([&](sycl::handler &cgh) {                                                                                   
    sycl::local_accessor<T> local_acc(1, cgh);                                                                                     
    cgh.single_task<foo>([=] { local_acc[0] += local_acc[0]; });                                                                   
  });                                                                                                                              
}    
//                                                                                                                                 
// Generated by LLVM NVPTX Back-End                                                                                                
//                                                                                                                                 
                                                                                                                                   
.version 8.5                                                                                                                       
.target sm_60                                                                                                                      
.address_size 64                                                                                                                   
                                                                                                                                   
        // .weak        _ZTS3foo                        // -- Begin function _ZTS3foo                                              
.extern .shared .align 4 .b8 _ZTS3foo_shared_mem[];                                                                                
                                        // @_ZTS3foo                                                                               
.weak .entry _ZTS3foo(                                                                                                             
        .param .u32 _ZTS3foo_param_0                                                                                               
)                                                                                                                                  
{                                                                                                                                  
        .reg .b32       %r<3>;                                                                                                     
        .reg .b64       %rd<4>;                                                                                                    
                                                                                                                                   
// %bb.0:                               // %entry                                                                                  
        ld.param.s32    %rd1, [_ZTS3foo_param_0];                                                                                  
        mov.u64         %rd2, _ZTS3foo_shared_mem;                                                                                 
        add.s64         %rd3, %rd2, %rd1;                                                                                          
        ld.shared.u32   %r1, [%rd3];                                                                                               
        shl.b32         %r2, %r1, 1;                                                                                               
        st.shared.u32   [%rd3], %r2;                                                                                               
        ret;                                                                                                                       
                                        // -- End function                                                                         
}               

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general we don't have good enough testing for this sort of behaviour in DPC++. From a quick grep I can't find any tests where you have multiple local accessors in the same command group that have different alignments.

Copy link
Contributor

@hdelan hdelan Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to clarify - I don't think this is ABI breaking. I just think that we need to define clearly what kind of kernels UR expects when it is handed local mem args through urKernelSetArgLocal. Maybe another alternative is just to say.

It is not recommended to use urKernelSetArgLocal if your device kernels were not generated by DPC++.

Such a limitation would still allow the user to use dynamic or static shared mem allocations without any assumptions baked in by DPC++.

Copy link
Contributor Author

@EwanC EwanC Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could write a snippet of a SYCL/OpenCL-C/SPIRV/PTX kernel that you think wouldn't work currently? I'm not understanding enough to come up with a concrete example, and would help specify what the restrictions are.

My current thinking on how to progress this is:

  • add E2E DPC++ tests

  • add UR CTS tests outwith command-buffers for local memory args

    • urKernelSetArgLocal doesn't need to be called between multiple enqueues of a kernel.
    • Multiple urKernelSetArgLocal args with different alignments
  • Document in local mem arg impl and/or restrictions in CUDA/HIP reference doc

Copy link
Contributor

@hdelan hdelan Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA kernel that won't work:

__global__ void kernel(int *a) {
  extern __shared__ int blah1[];
  a[0] = blah1[0];
}

As this produces the PTX:

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-34841621
// Cuda compilation tools, release 12.6, V12.6.77
// Based on NVVM 7.0.1
//

.version 8.5
.target sm_52
.address_size 64

	// .globl	_Z6kernelPi
.extern .shared .align 16 .b8 blah1[];

.visible .entry _Z6kernelPi(
	.param .u64 _Z6kernelPi_param_0
)
{
	.reg .b32 	%r<2>;
	.reg .b64 	%rd<3>;


	ld.param.u64 	%rd1, [_Z6kernelPi_param_0];
	cvta.to.global.u64 	%rd2, %rd1;
	ld.shared.u32 	%r1, [blah1];
	st.global.u32 	[%rd2], %r1;
	ret;

}

This doesn't work because the UR adapters assume that every use of urKernelSetArgLocal corresponds to an offset param in the kernel. This kernel doesn't use an offset into the shared local memory since it is just using the unoffset base address of the local mem allocation.

There is no way to say "This kernel uses x bytes in dynamic local mem, but my kernel signature doesn't include an offset param with which I want to index into the local mem". By contrast this would be fine:

__global__ void kernel(int *a, unsigned offset) {
  extern __shared__ int
      blah1[]; // This should never be used unless offset with a kernel arg
  int *blah_ptr = &blah1[offset];
  a[0] = blah_ptr[0];
}

Once you have a param you can call urKernelSetArgLocal with the index of the offset param (in this case 1). As a matter of fact you don't even need to use the local mem offset at all, once you can be assured that the kernel arg won't be elided by the compiler. So this should be fine as well:

__global__ void kernel(int *a, [[maybe_unused]] unsigned offset) {
  extern __shared__ int blah1[];
  a[0] = blah1[0];
}

Copy link
Contributor

@hdelan hdelan Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openCL kernel that won't work (at least when run by the CUDA adapter):

__kernel void blah(
    __local float* local_mem)

As the CUDA adapter assumes that the Index arg given to urKernelSetArgLocal will map to an offset param in the kernel signature. In this case the dynamic local mem doesn't have an unsigned Offset param, but instead will have a ptr param which corresponds to the local ptr. Loading it at runtime with an offset param will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also good find with the local arg alignment tests. Interesting that it's UNSUPPORTED for all targets. We should maybe pass this task to the OAC team?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I get you know - I read "instrumented" as "implemented" in your original comment. Could you check if what i've put in scripts/core/CUDA.rst is the kinda thing you had in mind?

Also good find with the local arg alignment tests. Interesting that it's UNSUPPORTED for all targets. We should maybe pass this task to the OAC team?

Good segway to ping @oneapi-src/unified-runtime-cuda-write for a review on this PR

Copy link
Contributor

@aarongreig aarongreig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CTS LGTM

source/adapters/cuda/kernel.hpp Outdated Show resolved Hide resolved
source/adapters/hip/kernel.hpp Outdated Show resolved Hide resolved
EwanC added a commit to reble/llvm that referenced this pull request Nov 19, 2024
Tests UR PR oneapi-src/unified-runtime#2298
with additional SYCL-Graph local memory argument E2E tests.

PR also sets the `pnext` and `snext` members of
`ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when
calling into UR.
@github-actions github-actions bot added loader Loader related feature/bug sanitizer Sanitizer layer issues/changes/specification labels Nov 19, 2024
@EwanC EwanC force-pushed the ewan/cuda_update_local_size branch 2 times, most recently from 23c24a4 to 0500f4c Compare November 19, 2024 14:32
EwanC added a commit to reble/llvm that referenced this pull request Nov 19, 2024
Tests UR PR oneapi-src/unified-runtime#2298
with additional SYCL-Graph local memory argument E2E tests.

PR also sets the `pnext` and `snext` members of
`ur_exp_command_buffer_update_kernel_launch_desc_t ` which were missing when
calling into UR.
@EwanC EwanC force-pushed the ewan/cuda_update_local_size branch from 0500f4c to 3e32dba Compare November 19, 2024 15:20
@github-actions github-actions bot added the specification Changes or additions to the specification label Nov 19, 2024
and make available from the pointer argument.

The CUDA adapter implements local memory arguments to a kernel as a single
``__shared__`` memory allocation, with each local address space pointer argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually a pointer arg, it ends up as a u32 offset arg, which is combined with the ptr within the kernel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you actually mention this elsewhere. So maybe this is OK

Copy link
Contributor

@hdelan hdelan Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could add an overload for urKernelSetArgLocalExt that only takes a Size, not a (Size, Index)? Then the user could ask for dynamic shared mem without having to have extra offsets in the kernel args.

scripts/core/CUDA.rst Outdated Show resolved Hide resolved
EwanC and others added 6 commits November 20, 2024 09:54
After setting kernel arguments during update, we
need to reset the amount of local memory used.
Iterate on previous solution so that the local argument
offsets at following inidices are updated when an earlier
local argument is updated
Co-authored-by: Ben Tracy <[email protected]>
Co-authored-by: aarongreig <[email protected]>
@EwanC EwanC force-pushed the ewan/cuda_update_local_size branch from 3e32dba to 582f358 Compare November 20, 2024 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
command-buffer Command Buffer feature addition/changes/specification conformance Conformance test suite issues. cuda CUDA adapter specific issues hip HIP adapter specific issues loader Loader related feature/bug sanitizer Sanitizer layer issues/changes/specification specification Changes or additions to the specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants