-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function for recursively printing parameter memory #2560
Conversation
I'm open to changing the format somehow, but I do think that this format is somewhat simple and explicit. |
This feels like a very niche feature that I'm not sure is worth putting in CUDA.jl. We already report on the size of each argument; can't you keep the specific functionality to analyze within individual arguments in your package, or even a dedicated package for this purpose? I'd rather expose some way for you to perform that analysis on the actual kernel arguments (e.g., by saving them in the error that's thrown by |
Whatever feature is implemented to help solve parameter space problems, a key criterion should be that it does not require launching a kernel to do the debugging. It's much more efficient to design kernel arguments by direct inspection, rather than by trial-and-error kernel launching + digging through stacktraces which is the main issue with the current workflow. |
I fail to see what's more convenient about doing Can you elaborate on the workflow you want? I'm proposing here that you would be able to |
The workflow is:
Executing on 2a doesn't really require any new features; we can simply do the arithmetic to figure out whether the kernels will succeed. For 2b, we may have to change Here's an example conversation where we are trying to deduce how to solve a parameter space problem: |
I see, so this is strictly a development utility that doesn't actually need any support in CUDA.jl? FWIW, you should be able to make this all type-based, by just inspecting the device-side types that CUDA.jl already reports:
With a recursive size printer that operates on |
Correct, I think the only point of putting in CUDA is to make it more visible / keep it up to date with CUDA development. It could easily find a home elsewhere. Also it doesn't even really need to be packaged since I don't really see much scope for further development --- offering it in a package is merely trying to be friendly to other developers I guess.
I'm not sure I understand though @charleskawczynski might... the point of |
Yes, but in the error message that's reported by CUDA.jl you already get to see the types of the converted arguments: julia> @cuda Returns(nothing)((ntuple(_->UInt64(1), 2^13), CUDA.rand(1)))
ERROR: Kernel invocation uses too much parameter memory.
64.047 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.5.
Relevant parameters:
[1] args::Tuple{NTuple{8192, UInt64}, CuDeviceVector{Float32, 1}} uses 64.031 KiB So you helper could ingest the |
5d585c4
to
c850163
Compare
I think I'll close this, since (in its current form) I don't think the utility function added in this PR belongs in CUDA.jl, but could be something generic in a stand-alone package. Let me know if my understanding is wrong though, I'm not opposed to improving the current parameter OOM reporting in any way. |
Sorry I'm late to this, I was on vacation for a while, and I've still been catching up since the end of the holidays. One issue with doing this fully in the type space is that it's more difficult to map the given data structures to the types, hence why I wrote this using objects and not types. I can move this to a separate package, but that would mean that users will need a try-catch around kernel launches to get the more verbose, and IMO, more useful / granular information on what is taking up parameter memory. Ideally users could pass a |
Why is that? You should have all the information you need using the |
Hm, that's a good point, regarding overloading |
But why? Why do you want this to operate on the object level while all the information is available at the type level? |
It would now too because you're requiring objects to be passed into this functionality, while the error message already prints the types needed for the analysis (i.e., avoiding such a try/catch). |
The main issue with the error messages that I have with the current state is that there is virtually no granularity for large objects. In the original post, there is a single object with a massive type signature, and it's not clear to me what parts of the object that we should strip out / tackle to get the biggest bang for our buck. I believe that this is the same issue that @glwagner has experienced. |
Right, so that's where your functionality comes in: feed it the type printed by the error message, and get insights into the fields. I'm happy to take improvements to the current error reporting, but printing the size of each field recursively quickly becomes too noisy for this to be a valid default. Which is why I suggested putting this in a separate package, since it shouldn't be CUDA specific in any way (and could be useful to other back-ends' users that way). I'd rather not carry functionality here that isn't used, that's just asking for it to bitrot anyway. |
That's fair, and I agree that functionality like this should probably not be the default. It'd be ideal if we could thread an option through, but making a separate package might be simpler. Thanks for the discussion, @maleadt. |
Alternatively, I'd be fine with adding some kind of hook for external consumers to investigate launch or compilation failures. |
Hm, I'd be fine with that. How would the hook work? Something like allowing users to set import CUDA
CUDA.print_granular_parameter_memory() = true and then check for this? |
As Clima has developed increasingly complex models and fused increasingly complex broadcast expressions, we've been running into parameter memory issues more frequently.
One issue I have with the existing printed message is that it does not provide granularity for large objects.
This PR implements a recursive print function/macro
@rprint_parameter_memory(some_object)
that users can use (and build options around) to print parameter memory usage with high granularity. For example (which I've tentatively implemented in MultiBroadcastFusion):I'm cc-ing some people who may also be interested in this: @glwagner @simonbyrne @simone-silvestri