Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for recursively printing parameter memory #2560

Closed

Conversation

charleskawczynski
Copy link
Contributor

As Clima has developed increasingly complex models and fused increasingly complex broadcast expressions, we've been running into parameter memory issues more frequently.

One issue I have with the existing printed message is that it does not provide granularity for large objects.

This PR implements a recursive print function/macro @rprint_parameter_memory(some_object) that users can use (and build options around) to print parameter memory usage with high granularity. For example (which I've tentatively implemented in MultiBroadcastFusion):

fmb
size: 72, fmb.pairs::Tuple{…}
size: 16, fmb.pairs.1::Pair{…}
size: 64, fmb.pairs.1.first::CUDA.CuArray{…}
size: 16, fmb.pairs.1.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.1.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.1.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.1.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.1.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.1.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.first.dims::NTuple{…}
size: 64, fmb.pairs.1.second::Base.Broadcast.Broadcasted{…}
size: 64, fmb.pairs.1.second.args::Tuple{…}
size: 64, fmb.pairs.1.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.1.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.1.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.1.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.1.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.1.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.1.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.second.args.1.dims::NTuple{…}
size: 24, fmb.pairs.2::Pair{…}
size: 64, fmb.pairs.2.first::CUDA.CuArray{…}
size: 16, fmb.pairs.2.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.first.dims::NTuple{…}
size: 128, fmb.pairs.2.second::Base.Broadcast.Broadcasted{…}
size: 128, fmb.pairs.2.second.args::Tuple{…}
size: 64, fmb.pairs.2.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.2.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.1.dims::NTuple{…}
size: 64, fmb.pairs.2.second.args.2::CUDA.CuArray{…}
size: 16, fmb.pairs.2.second.args.2.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.second.args.2.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.second.args.2.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.second.args.2.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.second.args.2.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.2.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.second.args.2.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.2.dims::NTuple{…}
size: 32, fmb.pairs.3::Pair{…}
size: 64, fmb.pairs.3.first::CUDA.CuArray{…}
size: 16, fmb.pairs.3.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.first.dims::NTuple{…}
size: 192, fmb.pairs.3.second::Base.Broadcast.Broadcasted{…}
size: 192, fmb.pairs.3.second.args::Tuple{…}
size: 64, fmb.pairs.3.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.1.dims::NTuple{…}
size: 64, fmb.pairs.3.second.args.2::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.2.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.2.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.2.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.2.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.2.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.2.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.2.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.2.dims::NTuple{…}
size: 64, fmb.pairs.3.second.args.3::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.3.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.3.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.3.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.3.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.3.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.3.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.3.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.3.dims::NTuple{…}

I'm cc-ing some people who may also be interested in this: @glwagner @simonbyrne @simone-silvestri

@charleskawczynski
Copy link
Contributor Author

I'm open to changing the format somehow, but I do think that this format is somewhat simple and explicit.

@maleadt
Copy link
Member

maleadt commented Dec 3, 2024

This feels like a very niche feature that I'm not sure is worth putting in CUDA.jl. We already report on the size of each argument; can't you keep the specific functionality to analyze within individual arguments in your package, or even a dedicated package for this purpose? I'd rather expose some way for you to perform that analysis on the actual kernel arguments (e.g., by saving them in the error that's thrown by @cuda).

@maleadt maleadt added speculative Not sure about this one yet. enhancement New feature or request cuda kernels Stuff about writing CUDA kernels. labels Dec 3, 2024
@glwagner
Copy link
Contributor

glwagner commented Dec 3, 2024

(e.g., by saving them in the error that's thrown by @cuda).

Whatever feature is implemented to help solve parameter space problems, a key criterion should be that it does not require launching a kernel to do the debugging. It's much more efficient to design kernel arguments by direct inspection, rather than by trial-and-error kernel launching + digging through stacktraces which is the main issue with the current workflow.

@maleadt
Copy link
Member

maleadt commented Dec 3, 2024

It's much more efficient to design kernel arguments by direct inspection, rather than by trial-and-error kernel launching

I fail to see what's more convenient about doing @rprint_parameter_memory(some_object) (after you somehow decided the kernel will fail to launch) as opposed to a try ... catch surrounding a call to @cuda and prying the arguments from that error (instead relying on the source of truth wrt. whether the kernel would run or not).

Can you elaborate on the workflow you want? I'm proposing here that you would be able to catch a KernelError, containing all arguments for you do call @rprint_parameter_memory (or whatever thing you maintain locally) on, instead of CUDA.jl potentially generating a relatively inscrutable (at least to most users) infodump.

@glwagner
Copy link
Contributor

glwagner commented Dec 3, 2024

The workflow is:

  1. Discover a parameter space error by attempting to run some program. These programs can be complex; for example reaching the desired kernel may require constructing some intermediate objects that also involve some computation. A typical time to reach some error could be 10 or even 20 minutes.

  2. Inspect the error message, which helpfully prints the parameter space usage of the kernel arguments. After this one can take one or two actions: a) split the kernel into components so that each component requires fewer arguments or b) somehow simplify the objects being passed to the kernel.

Executing on 2a doesn't really require any new features; we can simply do the arithmetic to figure out whether the kernels will succeed.

For 2b, we may have to change adapt_structure or, deeper down, experiment with changing the objects themselves. For example, one change we are tempted to experiment with is to allow OffsetArray with offsets that are Int8 (more generally variable Int). Predicting the potential savings of that change may be difficult, because some objects involve a mixture of objects including many OffsetArrays. Therefore, to test whether such deep changes will succeed, we will have to recompile and run our MWE. Since the MWE takes 10-20 minutes this is slow. On the other hand, if we could simply print the parameter space usage of some large object that we are making changes to, we could iterate a bit more quickly.

Here's an example conversation where we are trying to deduce how to solve a parameter space problem:

CliMA/ClimaOcean.jl#116

@maleadt
Copy link
Member

maleadt commented Dec 4, 2024

On the other hand, if we could simply print the parameter space usage of some large object that we are making changes to, we could iterate a bit more quickly.

I see, so this is strictly a development utility that doesn't actually need any support in CUDA.jl?

FWIW, you should be able to make this all type-based, by just inspecting the device-side types that CUDA.jl already reports:

julia> @cuda Returns(nothing)((ntuple(_->UInt64(1), 2^13),))
ERROR: Kernel invocation uses too much parameter memory.
64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.5.

Relevant parameters:
  [1] args::Tuple{NTuple{8192, UInt64}} uses 64.000 KiB

With a recursive size printer that operates on Tuple{NTuple{8192, UInt64}}, you don't even need to call into any CUDA.jl internals (i.e., no calls to cudaconvert).

@glwagner
Copy link
Contributor

glwagner commented Dec 4, 2024

I see, so this is strictly a development utility that doesn't actually need any support in CUDA.jl?

Correct, I think the only point of putting in CUDA is to make it more visible / keep it up to date with CUDA development. It could easily find a home elsewhere. Also it doesn't even really need to be packaged since I don't really see much scope for further development --- offering it in a package is merely trying to be friendly to other developers I guess.

you don't even need to call into any CUDA.jl internals (i.e., no calls to cudaconvert).

I'm not sure I understand though @charleskawczynski might... the point of cudaconvert is to isolate the objects that get passed into kernel (eg after being passed through adapt_structure), right?

@maleadt
Copy link
Member

maleadt commented Dec 5, 2024

the point of cudaconvert is to isolate the objects that get passed into kernel (eg after being passed through adapt_structure)

Yes, but in the error message that's reported by CUDA.jl you already get to see the types of the converted arguments:

julia> @cuda Returns(nothing)((ntuple(_->UInt64(1), 2^13), CUDA.rand(1)))
ERROR: Kernel invocation uses too much parameter memory.
64.047 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.5.

Relevant parameters:
  [1] args::Tuple{NTuple{8192, UInt64}, CuDeviceVector{Float32, 1}} uses 64.031 KiB

So you helper could ingest the Type{Tuple{NTuple{8192, UInt64}, CuDeviceVector{Float32, 1}}} and break it down exactly like how's done in the OP. That would make the utility fully generic.

@maleadt maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18
@maleadt
Copy link
Member

maleadt commented Jan 8, 2025

I think I'll close this, since (in its current form) I don't think the utility function added in this PR belongs in CUDA.jl, but could be something generic in a stand-alone package. Let me know if my understanding is wrong though, I'm not opposed to improving the current parameter OOM reporting in any way.

@maleadt maleadt closed this Jan 8, 2025
@charleskawczynski
Copy link
Contributor Author

Sorry I'm late to this, I was on vacation for a while, and I've still been catching up since the end of the holidays.

One issue with doing this fully in the type space is that it's more difficult to map the given data structures to the types, hence why I wrote this using objects and not types.

I can move this to a separate package, but that would mean that users will need a try-catch around kernel launches to get the more verbose, and IMO, more useful / granular information on what is taking up parameter memory. Ideally users could pass a verbose flag or something in kernel launches to get more granular memory, but I don't know enough CUDA internals to thread that information through.

@maleadt
Copy link
Member

maleadt commented Jan 8, 2025

One issue with doing this fully in the type space is that it's more difficult to map the given data structures to the types, hence why I wrote this using objects and not types.

Why is that? You should have all the information you need using the fieldnames query. In fact, the fact that you're using getproperty now makes this unreliable, as this can be overloaded. Also, there's no guarantee that cudaconvert on a field matches what would have been sent to the compiler, as a higher-level cudaconvert could have resulted in a vastly different type. Really, the only way to exactly match the types being used is by re-using the compiler output.

@charleskawczynski
Copy link
Contributor Author

Hm, that's a good point, regarding overloading getproperty. Would you be more inclined to merge this if it was recursively defined using fieldnames instead?

@maleadt
Copy link
Member

maleadt commented Jan 8, 2025

But why? Why do you want this to operate on the object level while all the information is available at the type level?

@maleadt
Copy link
Member

maleadt commented Jan 8, 2025

but that would mean that users will need a try-catch around kernel launches

It would now too because you're requiring objects to be passed into this functionality, while the error message already prints the types needed for the analysis (i.e., avoiding such a try/catch).

@charleskawczynski
Copy link
Contributor Author

But why? Why do you want this to operate on the object level while all the information is available at the type level?

The main issue with the error messages that I have with the current state is that there is virtually no granularity for large objects. In the original post, there is a single object with a massive type signature, and it's not clear to me what parts of the object that we should strip out / tackle to get the biggest bang for our buck.

I believe that this is the same issue that @glwagner has experienced.

@maleadt
Copy link
Member

maleadt commented Jan 8, 2025

The main issue with the error messages that I have with the current state is that there is virtually no granularity for large objects. In the original post, there is a single object with a massive type signature, and it's not clear to me what parts of the object that we should strip out / tackle to get the biggest bang for our buck.

Right, so that's where your functionality comes in: feed it the type printed by the error message, and get insights into the fields. I'm happy to take improvements to the current error reporting, but printing the size of each field recursively quickly becomes too noisy for this to be a valid default. Which is why I suggested putting this in a separate package, since it shouldn't be CUDA specific in any way (and could be useful to other back-ends' users that way). I'd rather not carry functionality here that isn't used, that's just asking for it to bitrot anyway.

@charleskawczynski
Copy link
Contributor Author

That's fair, and I agree that functionality like this should probably not be the default. It'd be ideal if we could thread an option through, but making a separate package might be simpler. Thanks for the discussion, @maleadt.

@maleadt
Copy link
Member

maleadt commented Jan 8, 2025

Alternatively, I'd be fine with adding some kind of hook for external consumers to investigate launch or compilation failures.

@charleskawczynski
Copy link
Contributor Author

Hm, I'd be fine with that. How would the hook work? Something like allowing users to set

import CUDA
CUDA.print_granular_parameter_memory() = true

and then check for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request speculative Not sure about this one yet.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants