-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfp
binary CUDA support broken?
#178
Comments
I'm surprised that
If you drop
which matches what you get with the serial backend:
I'm assuming that you have built with Let me also add that it may be worthwhile running The next zfp release will focus on improving CUDA and HIP support, including new/missing capabilities, performance improvements, as well as bug fixes. |
Thank you for information on running the tests. They betrayed the root cause of my observed failures: This is usually not a problem due to CUDA's binary compatibility guarantees. However, this requires that the CUDA kernels be compiled to binary form for each extant compute capability, plus a PTX for future capabilities. Perhaps you can address this (along with improved error handling and propagation) as part of your planned GPU improvements. |
I agree that improved error reporting is a high priority. We've thought about something akin to Other than that, I'm not sure what zfp could/should do about issues external to the library itself, such as CUDA incompatibilities. Are you suggesting embedding information on what CUDA version zfp was built against and then checking for compatibility? That could be quite onerous in practice given dependencies on OpenMP, CUDA, HIP, SYCL, Cython, NumPy, etc., many of which are evolving faster than zfp itself. But I'm open to suggestions. One thing that might help that we have partially implemented on a feature branch is GPU warmup, where a simple kernel is launched when the execution policy is set. If this fails, so does the |
The big thing on zfp's side in my book is to build multiple binary versions plus a PTX version of each CUDA kernel as described on that NVidia page I linked. That means that distributed zfp packages will be compatible with many graphics cards and driver versions. I'm not sure how to do that in your build system, but it's how the big CUDA-dependent packages (e.g. ML frameworks) are set up and works well. There shouldn't be any need to do runtime detection or selection, the driver and CUDA runtime will handle that if you make available appropriate versions. |
I must admit to not being familiar with the issues, and it would take some time to go through the NVIDIA documentation to figure out what needs to be done. Do you know of any projects where this is handled that we could use as a template? |
I did some research and very recent versions of CMake know how to handle this automatically: https://cmake.org/cmake/help/v3.24/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES . I understand requiring these recent versions is not an attractive proposition but it is a reference for the proper logic in CMake which is license-compatible. I would encourage defining For now, it is sufficient to use something like The defaults for these flags and the allowed and suggested combinations depend on the CUDA runtime built against and what capabilities your code needs. CMake's new options know how to handle these factors best. Since you support a wide variety of CUDA versions, I'm not sure it would be wise to explicitly list these flags. Maybe it would be good for now to simply reference some information in the documentation to allow people to select for best performance in their particular environment. I imagine an HPC administrator for instance would want to select all the architectures for hardware known to be installed in their machines to get maximum performance, instead of letting the CUDA compiler pick arbitrarily and constantly paying JIT costs and potentially being unable to take full advantage of their system. Once you are comfortable mandating CMake 3.23, then the default experience can be nicer as a good set of architectures will be selected by default and the interested administrator can easily list compute capabilities instead of having to cross-reference the compiler documentation and manage supported flags directly. |
Thanks for sharing. We will be taking a closer look over the coming weeks. I don't think we're ready to mandate CMake 3.23 yet, but we could perhaps support this via CMake conditionals. Just to be sure I understand the use case, compiling for different CUDA architectures makes sense for distributing binaries (e.g., RPMs, Spack packages) and when building and installing zfp on file systems shared by multiple architectures. If you're building zfp from source for a particular architecture, then you'd end up paying a penalty by inflating binaries and startup time for no benefit, so the default ought to be to build only for the current architecture. When does the JIT compilation occur? At load time or when the first CUDA kernel is launched? All gets baked into the same binary, right? I imagine something similar would be needed for HIP and SYCL. |
NVIDIA's documentation here answers these questions. Note that what they call "cubin" the CMake docs call "real" and what they call "PTX" CMake calls "virtual". They say all kernel versions are baked into the same binary, and object loading/possible JIT compilation occurs when a particular CUDA kernel is first launched. Not sure how different kernels are treated. Never used HIP or SYCL so I can't provide advice there, sorry. I presume these would always have to JIT. What seems to be the case but they don't say there (and what was the root cause of this bug) is that PTX versions are specific to the CUDA library version. If the PTX compiled by a newer CUDA library version needs to be loaded by a driver shipped with an older library version, the JIT may fail (though the other way around will always work). CMake's But I do want to say that depending on host details like that upsets reproducibility and makes me unhappy as a packager. I understand the motivations for making I did some testing and compared to using whatever the CUDA compiler defines as the default architecture, passing the 9 architecture flags I linked earlier inflates compilation time by 2.5x (21s to 54s) and the size of |
I think that the fixes in this patch will also address this #232 |
I can't figure out how to test CUDA support on my system.
I generated a simple test file:
python3 -c 'import struct; import math; open("test.bin", "wb").write(struct.pack("<256f", *list(math.sin(v/128*6.28)*1000 for v in range(256))))'
.Then I tried to compress it:
zfp -x cuda -i test.bin -z testz.bin -f -1 256 -r 10 -s -h
Which reports the following info:
type=float nx=256 ny=1 nz=1 nw=1 raw=1024 zfp=16 ratio=64 rate=0.5 rmse=707.3 nrmse=0.3536 maxe=1000 psnr=3.01
Sometimes it reports nan values or random large floats, despite being given the same input data, but the statistics suggest it's not actually compressing and just storing zeros (related to #105 ?). If I remove the
-h
flag then it just tells me "compression failed". Everything works fine with the other execution modes (serial and omp).I'm trying to build and package the 1.0.0 release, which doesn't come with tests, so I'm not sure if CUDA is broken in general or this is specific to the command line tool. I am building against CUDA 11.6.
The text was updated successfully, but these errors were encountered: