-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting HG_FAULT
when performing RDMA on data living in CUDA memory
#664
Comments
All 3 use cases should be supported as far as I know. The error indicates that there is a memory registration issue, not a transfer issue. Can you try again by turning off MR cache, |
Thank you for your answer! Unfortunately, disabling the MR cache with I didn't spot any major difference with I don't really understand what the device ID is referring to. There are 8 GPUs on the DGX-1 cluster I'm using, ranks are [0-7], I guess I ran my reproducer on another machine where it works (Theta). I'm attaching the logs below. The first line that is different from the The mystery remains... |
Closing for now, please re-open the libfabric issue if needed. |
Hello :)
Describe the bug
I'm trying to use RDMA to transfer a remote CPU variable to a local variable living in CUDA memory. First of all, is that use case supported? More generally, are the following scenarios supported:
If the later scenario is not supported, then this issue is irrelevant.
This is the error I'm getting:
I initialized Mercury with device memory support and MOFED is installed on the machines I'm using. I've tested on a DGX-1 cluster (part of the grid5000 testbed) and on a node on Cooley: both experiments yield to the same error.
To Reproduce
This example is using the Thallium API. I can try to rewrite it if needed.
The remote variable is an array of increasing integers stored on the CPU. The local variable is an array of the same size containing zeros and stored in CUDA memory. At the end of the program, I expect
devArray
to contain{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
.I'm eventually moving
devArray
to the CPU for the purpose of printing it (hostArray
variable). The program doesn't reach that line though, throwing theHG_FAULT
before that.Platform (please complete the following information):
Additional context
Here are some additional logs with
FI_LOG_LEVEL=debug HG_LOG_LEVEL=debug HG_SUBSYS_LOG=na
DGX-1 cluster
gemini.txt
Cooley
cooley.txt
Please note that I also noticed these lines on Cooley:
Thank you!
The text was updated successfully, but these errors were encountered: