You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently added the margo-info utility to help diagnose network transport support in Margo builds. It would be helpful to have a similar utility that can diagnose ability to access device (i.e. accelerator) memory as well.
CUDA support would be the first target, to confirm if CUDA memory access works with various provider/build combinations. This can be validated without performing any network communication, so a single process utility would be sufficient. It just needs to attempt to create a bulk handle for a CUDA memory region.
TBD if we can make a utility like this generic enough to be useful. One challenge is that you cannot allocate or reference CUDA memory without making CUDA calls, which means that this hypothetical utility would have a CUDA dependency, probably both for runtime library and compiler.
The text was updated successfully, but these errors were encountered:
See ofiwg/libfabric#8444 for a possible failure mode to look for; Verbs+CUDA doesn't work unless you are using the MOFED software stack, but there is no clear error message indicating this.
We recently added the
margo-info
utility to help diagnose network transport support in Margo builds. It would be helpful to have a similar utility that can diagnose ability to access device (i.e. accelerator) memory as well.CUDA support would be the first target, to confirm if CUDA memory access works with various provider/build combinations. This can be validated without performing any network communication, so a single process utility would be sufficient. It just needs to attempt to create a bulk handle for a CUDA memory region.
See mochi-hpc/mochi-thallium#7 for an example of the kind of scenario that we would like to diagnose.
TBD if we can make a utility like this generic enough to be useful. One challenge is that you cannot allocate or reference CUDA memory without making CUDA calls, which means that this hypothetical utility would have a CUDA dependency, probably both for runtime library and compiler.
The text was updated successfully, but these errors were encountered: