Add function for recursively printing parameter memory #2560

charleskawczynski · 2024-11-21T02:06:01Z

As Clima has developed increasingly complex models and fused increasingly complex broadcast expressions, we've been running into parameter memory issues more frequently.

One issue I have with the existing printed message is that it does not provide granularity for large objects.

This PR implements a recursive print function/macro @rprint_parameter_memory(some_object) that users can use (and build options around) to print parameter memory usage with high granularity. For example (which I've tentatively implemented in MultiBroadcastFusion):

fmb
size: 72, fmb.pairs::Tuple{…}
size: 16, fmb.pairs.1::Pair{…}
size: 64, fmb.pairs.1.first::CUDA.CuArray{…}
size: 16, fmb.pairs.1.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.1.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.1.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.1.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.1.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.1.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.first.dims::NTuple{…}
size: 64, fmb.pairs.1.second::Base.Broadcast.Broadcasted{…}
size: 64, fmb.pairs.1.second.args::Tuple{…}
size: 64, fmb.pairs.1.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.1.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.1.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.1.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.1.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.1.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.1.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.second.args.1.dims::NTuple{…}
size: 24, fmb.pairs.2::Pair{…}
size: 64, fmb.pairs.2.first::CUDA.CuArray{…}
size: 16, fmb.pairs.2.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.first.dims::NTuple{…}
size: 128, fmb.pairs.2.second::Base.Broadcast.Broadcasted{…}
size: 128, fmb.pairs.2.second.args::Tuple{…}
size: 64, fmb.pairs.2.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.2.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.1.dims::NTuple{…}
size: 64, fmb.pairs.2.second.args.2::CUDA.CuArray{…}
size: 16, fmb.pairs.2.second.args.2.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.second.args.2.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.second.args.2.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.second.args.2.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.second.args.2.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.2.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.second.args.2.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.2.dims::NTuple{…}
size: 32, fmb.pairs.3::Pair{…}
size: 64, fmb.pairs.3.first::CUDA.CuArray{…}
size: 16, fmb.pairs.3.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.first.dims::NTuple{…}
size: 192, fmb.pairs.3.second::Base.Broadcast.Broadcasted{…}
size: 192, fmb.pairs.3.second.args::Tuple{…}
size: 64, fmb.pairs.3.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.1.dims::NTuple{…}
size: 64, fmb.pairs.3.second.args.2::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.2.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.2.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.2.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.2.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.2.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.2.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.2.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.2.dims::NTuple{…}
size: 64, fmb.pairs.3.second.args.3::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.3.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.3.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.3.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.3.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.3.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.3.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.3.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.3.dims::NTuple{…}

I'm cc-ing some people who may also be interested in this: @glwagner @simonbyrne @simone-silvestri

charleskawczynski · 2024-11-21T19:09:33Z

I'm open to changing the format somehow, but I do think that this format is somewhat simple and explicit.

maleadt · 2024-12-03T13:57:58Z

This feels like a very niche feature that I'm not sure is worth putting in CUDA.jl. We already report on the size of each argument; can't you keep the specific functionality to analyze within individual arguments in your package, or even a dedicated package for this purpose? I'd rather expose some way for you to perform that analysis on the actual kernel arguments (e.g., by saving them in the error that's thrown by @cuda).

glwagner · 2024-12-03T16:25:31Z

(e.g., by saving them in the error that's thrown by @cuda).

Whatever feature is implemented to help solve parameter space problems, a key criterion should be that it does not require launching a kernel to do the debugging. It's much more efficient to design kernel arguments by direct inspection, rather than by trial-and-error kernel launching + digging through stacktraces which is the main issue with the current workflow.

maleadt · 2024-12-03T18:56:39Z

It's much more efficient to design kernel arguments by direct inspection, rather than by trial-and-error kernel launching

I fail to see what's more convenient about doing @rprint_parameter_memory(some_object) (after you somehow decided the kernel will fail to launch) as opposed to a try ... catch surrounding a call to @cuda and prying the arguments from that error (instead relying on the source of truth wrt. whether the kernel would run or not).

Can you elaborate on the workflow you want? I'm proposing here that you would be able to catch a KernelError, containing all arguments for you do call @rprint_parameter_memory (or whatever thing you maintain locally) on, instead of CUDA.jl potentially generating a relatively inscrutable (at least to most users) infodump.

glwagner · 2024-12-03T19:08:58Z

The workflow is:

Discover a parameter space error by attempting to run some program. These programs can be complex; for example reaching the desired kernel may require constructing some intermediate objects that also involve some computation. A typical time to reach some error could be 10 or even 20 minutes.
Inspect the error message, which helpfully prints the parameter space usage of the kernel arguments. After this one can take one or two actions: a) split the kernel into components so that each component requires fewer arguments or b) somehow simplify the objects being passed to the kernel.

Executing on 2a doesn't really require any new features; we can simply do the arithmetic to figure out whether the kernels will succeed.

For 2b, we may have to change adapt_structure or, deeper down, experiment with changing the objects themselves. For example, one change we are tempted to experiment with is to allow OffsetArray with offsets that are Int8 (more generally variable Int). Predicting the potential savings of that change may be difficult, because some objects involve a mixture of objects including many OffsetArrays. Therefore, to test whether such deep changes will succeed, we will have to recompile and run our MWE. Since the MWE takes 10-20 minutes this is slow. On the other hand, if we could simply print the parameter space usage of some large object that we are making changes to, we could iterate a bit more quickly.

Here's an example conversation where we are trying to deduce how to solve a parameter space problem:

CliMA/ClimaOcean.jl#116

maleadt · 2024-12-04T08:12:39Z

On the other hand, if we could simply print the parameter space usage of some large object that we are making changes to, we could iterate a bit more quickly.

I see, so this is strictly a development utility that doesn't actually need any support in CUDA.jl?

FWIW, you should be able to make this all type-based, by just inspecting the device-side types that CUDA.jl already reports:

julia> @cuda Returns(nothing)((ntuple(_->UInt64(1), 2^13),))
ERROR: Kernel invocation uses too much parameter memory.
64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.5.

Relevant parameters:
  [1] args::Tuple{NTuple{8192, UInt64}} uses 64.000 KiB

With a recursive size printer that operates on Tuple{NTuple{8192, UInt64}}, you don't even need to call into any CUDA.jl internals (i.e., no calls to cudaconvert).

glwagner · 2024-12-04T18:43:32Z

I see, so this is strictly a development utility that doesn't actually need any support in CUDA.jl?

Correct, I think the only point of putting in CUDA is to make it more visible / keep it up to date with CUDA development. It could easily find a home elsewhere. Also it doesn't even really need to be packaged since I don't really see much scope for further development --- offering it in a package is merely trying to be friendly to other developers I guess.

you don't even need to call into any CUDA.jl internals (i.e., no calls to cudaconvert).

I'm not sure I understand though @charleskawczynski might... the point of cudaconvert is to isolate the objects that get passed into kernel (eg after being passed through adapt_structure), right?

maleadt · 2024-12-05T07:53:14Z

the point of cudaconvert is to isolate the objects that get passed into kernel (eg after being passed through adapt_structure)

Yes, but in the error message that's reported by CUDA.jl you already get to see the types of the converted arguments:

julia> @cuda Returns(nothing)((ntuple(_->UInt64(1), 2^13), CUDA.rand(1)))
ERROR: Kernel invocation uses too much parameter memory.
64.047 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.5.

Relevant parameters:
  [1] args::Tuple{NTuple{8192, UInt64}, CuDeviceVector{Float32, 1}} uses 64.031 KiB

So you helper could ingest the Type{Tuple{NTuple{8192, UInt64}, CuDeviceVector{Float32, 1}}} and break it down exactly like how's done in the OP. That would make the utility fully generic.

maleadt · 2025-01-08T10:10:57Z

I think I'll close this, since (in its current form) I don't think the utility function added in this PR belongs in CUDA.jl, but could be something generic in a stand-alone package. Let me know if my understanding is wrong though, I'm not opposed to improving the current parameter OOM reporting in any way.

charleskawczynski · 2025-01-08T14:19:45Z

Sorry I'm late to this, I was on vacation for a while, and I've still been catching up since the end of the holidays.

One issue with doing this fully in the type space is that it's more difficult to map the given data structures to the types, hence why I wrote this using objects and not types.

I can move this to a separate package, but that would mean that users will need a try-catch around kernel launches to get the more verbose, and IMO, more useful / granular information on what is taking up parameter memory. Ideally users could pass a verbose flag or something in kernel launches to get more granular memory, but I don't know enough CUDA internals to thread that information through.

maleadt · 2025-01-08T14:33:18Z

One issue with doing this fully in the type space is that it's more difficult to map the given data structures to the types, hence why I wrote this using objects and not types.

Why is that? You should have all the information you need using the fieldnames query. In fact, the fact that you're using getproperty now makes this unreliable, as this can be overloaded. Also, there's no guarantee that cudaconvert on a field matches what would have been sent to the compiler, as a higher-level cudaconvert could have resulted in a vastly different type. Really, the only way to exactly match the types being used is by re-using the compiler output.

charleskawczynski · 2025-01-08T15:04:29Z

Hm, that's a good point, regarding overloading getproperty. Would you be more inclined to merge this if it was recursively defined using fieldnames instead?

maleadt · 2025-01-08T15:07:49Z

But why? Why do you want this to operate on the object level while all the information is available at the type level?

maleadt · 2025-01-08T15:11:49Z

but that would mean that users will need a try-catch around kernel launches

It would now too because you're requiring objects to be passed into this functionality, while the error message already prints the types needed for the analysis (i.e., avoiding such a try/catch).

charleskawczynski · 2025-01-08T15:21:15Z

But why? Why do you want this to operate on the object level while all the information is available at the type level?

The main issue with the error messages that I have with the current state is that there is virtually no granularity for large objects. In the original post, there is a single object with a massive type signature, and it's not clear to me what parts of the object that we should strip out / tackle to get the biggest bang for our buck.

I believe that this is the same issue that @glwagner has experienced.

maleadt · 2025-01-08T15:30:06Z

The main issue with the error messages that I have with the current state is that there is virtually no granularity for large objects. In the original post, there is a single object with a massive type signature, and it's not clear to me what parts of the object that we should strip out / tackle to get the biggest bang for our buck.

Right, so that's where your functionality comes in: feed it the type printed by the error message, and get insights into the fields. I'm happy to take improvements to the current error reporting, but printing the size of each field recursively quickly becomes too noisy for this to be a valid default. Which is why I suggested putting this in a separate package, since it shouldn't be CUDA specific in any way (and could be useful to other back-ends' users that way). I'd rather not carry functionality here that isn't used, that's just asking for it to bitrot anyway.

charleskawczynski · 2025-01-08T15:37:44Z

That's fair, and I agree that functionality like this should probably not be the default. It'd be ideal if we could thread an option through, but making a separate package might be simpler. Thanks for the discussion, @maleadt.

maleadt · 2025-01-08T15:47:21Z

Alternatively, I'd be fine with adding some kind of hook for external consumers to investigate launch or compilation failures.

charleskawczynski · 2025-01-08T16:10:02Z

Hm, I'd be fine with that. How would the hook work? Something like allowing users to set

import CUDA
CUDA.print_granular_parameter_memory() = true

and then check for this?

Add function for recursively printing parameter memory

add733a

maleadt added speculative Not sure about this one yet. enhancement New feature or request cuda kernels Stuff about writing CUDA kernels. labels Dec 3, 2024

maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18

maleadt closed this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function for recursively printing parameter memory #2560

Add function for recursively printing parameter memory #2560

charleskawczynski commented Nov 21, 2024

charleskawczynski commented Nov 21, 2024

maleadt commented Dec 3, 2024

glwagner commented Dec 3, 2024

maleadt commented Dec 3, 2024

glwagner commented Dec 3, 2024

maleadt commented Dec 4, 2024

glwagner commented Dec 4, 2024

maleadt commented Dec 5, 2024

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

maleadt commented Jan 8, 2025

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

Add function for recursively printing parameter memory #2560

Add function for recursively printing parameter memory #2560

Conversation

charleskawczynski commented Nov 21, 2024

charleskawczynski commented Nov 21, 2024

maleadt commented Dec 3, 2024

glwagner commented Dec 3, 2024

maleadt commented Dec 3, 2024

glwagner commented Dec 3, 2024

maleadt commented Dec 4, 2024

glwagner commented Dec 4, 2024

maleadt commented Dec 5, 2024

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

maleadt commented Jan 8, 2025

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025

maleadt commented Jan 8, 2025

charleskawczynski commented Jan 8, 2025