Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting HG_FAULT when performing RDMA on data living in CUDA memory #664

Closed
thomas-bouvier opened this issue Mar 10, 2023 · 3 comments
Closed

Comments

@thomas-bouvier
Copy link

thomas-bouvier commented Mar 10, 2023

Hello :)

Describe the bug

I'm trying to use RDMA to transfer a remote CPU variable to a local variable living in CUDA memory. First of all, is that use case supported? More generally, are the following scenarios supported:

  • Remote CUDA variable to local CUDA variable (I guess so)
  • Remote CUDA variable to local CPU variable
  • Remote CPU variable to local CUDA variable (my current use-case)

If the later scenario is not supported, then this issue is irrelevant.

This is the error I'm getting:

[1,0]<stderr>:libfabric:57702:1678406492::verbs:mr:ofi_mr_cache_search():334<debug> search 0x40fe8e00000 (len: 602112)
[1,0]<stderr>:libfabric:57702:1678406492::verbs:mr:util_mr_cache_create():266<debug> create 0x40fe8e00000 (len: 602112)
[1,0]<stderr>:libfabric:57702:1678406492::verbs:mr:util_mr_free_entry():107<debug> free 0x40fe8e00000 (len: 602112)
[1,0]<stderr>:libfabric:57702:1678406492::ofi_rxm:domain:rxm_mr_regattr():445<warn> Unable to register MSG MR
[1,0]<stderr>:Function returned HG_FAULT
[1,0]<stderr>:terminate called after throwing an instance of 'thallium::margo_exception'
[1,0]<stderr>:  what():  [/mnt/spack/linux-debian11-broadwell/gcc-10.2.1/mochi-thallium-main-wppyqz2fdrp24omimgaom4fiau7fotop/include/thallium/engine.hpp:1132][margo_bulk_create] Function returned HG_FAULT

I initialized Mercury with device memory support and MOFED is installed on the machines I'm using. I've tested on a DGX-1 cluster (part of the grid5000 testbed) and on a node on Cooley: both experiments yield to the same error.

To Reproduce

This example is using the Thallium API. I can try to rewrite it if needed.

The remote variable is an array of increasing integers stored on the CPU. The local variable is an array of the same size containing zeros and stored in CUDA memory. At the end of the program, I expect devArray to contain {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

I'm eventually moving devArray to the CPU for the purpose of printing it (hostArray variable). The program doesn't reach that line though, throwing the HG_FAULT before that.

#include <iostream>
#include <cuda_runtime.h>
#include <thallium.hpp>

namespace tl = thallium;

int main(int argc, char** argv) {
    struct hg_init_info hii;
    memset(&hii, 0, sizeof(hii));
    hii.na_init_info.request_mem_device = true;
    tl::engine myEngine("verbs", THALLIUM_SERVER_MODE, true, 1, &hii);

    std::function<void(const tl::request&, tl::bulk&)> f =
        [&myEngine](const tl::request& req, tl::bulk& b) {
            int myArray[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
            std::vector<std::pair<void*, std::size_t>> segments;
            segments.emplace_back(&myArray[0], 10 * sizeof(int));

            tl::bulk bulk = myEngine.expose(segments, tl::bulk_mode::read_only);
            bulk >> b.on(req.get_endpoint());

            req.respond();
            myEngine.finalize();
        };
    myEngine.define("do_rdma", f);

    tl::remote_procedure remote_do_rdma = myEngine.define("do_rdma");
    tl::endpoint server_endpoint = myEngine.lookup(myEngine.self());

    int myArray[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
    int* devArray;
    cudaMalloc((void**) &devArray, 10 * sizeof(int));
    cudaMemcpy(devArray, myArray, 10 * sizeof(int), cudaMemcpyHostToDevice);
    std::vector<std::pair<void*, std::size_t>> segments;
    segments.emplace_back(&devArray[0], 10 * sizeof(int));

    struct hg_bulk_attr attr;
    memset(&attr, 0, sizeof(attr));
    attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_CUDA;
    attr.device = 0;

    tl::bulk local_bulk = myEngine.expose(segments, tl::bulk_mode::write_only, attr);
    remote_do_rdma.on(server_endpoint)(local_bulk);

    // Displaying the array which should have been modified
    int hostArray[10];
    cudaMemcpy(hostArray, devArray, 10 * sizeof(int), cudaMemcpyDeviceToHost);
    for (int i = 0; i < 10; ++i) {
        if (i != hostArray[i])
            std::cout << "Not working!" << std::endl;
    }
    std::cout << "done" << std::endl;

    cudaFree(devArray);
    return 0;
}

Platform (please complete the following information):

Input spec
--------------------------------
mochi-thallium@main
   ^argobots
   ^libfabric+cuda fabrics=rxm,tcp,verbs
   ^[email protected]~boostsys~checksum+ofi

Additional context

Here are some additional logs with FI_LOG_LEVEL=debug HG_LOG_LEVEL=debug HG_SUBSYS_LOG=na

DGX-1 cluster
gemini.txt

Cooley
cooley.txt

Please note that I also noticed these lines on Cooley:

libfabric:12942:1678470176:ofi_rxm:verbs:core:vrb_check_hints():241<info> Supported: FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_LOCAL_COMM, FI_REMOTE_COMM
libfabric:12942:1678470176:ofi_rxm:verbs:core:vrb_check_hints():241<info> Requested: FI_MSG, FI_RMA, FI_READ, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_HMEM

Thank you!

@soumagne
Copy link
Member

All 3 use cases should be supported as far as I know. The error indicates that there is a memory registration issue, not a transfer issue. Can you try again by turning off MR cache, FI_MR_CACHE_MONITOR=disabled or try turning on cuda cache monitor with FI_MR_CUDA_CACHE_MONITOR_ENABLED=1. Also have you verified that your cuda device ID is 0 ?

@thomas-bouvier
Copy link
Author

Thank you for your answer!

Unfortunately, disabling the MR cache with FI_MR_CACHE_MAX_COUNT=0 didn't change anything. The issue seems to be caused by something else.
gemini_mr_disabled.txt

I didn't spot any major difference with FI_MR_CUDA_CACHE_MONITOR_ENABLED=1 enabled either.
gemini_cuda_monitor_enabled.txt

I don't really understand what the device ID is referring to. There are 8 GPUs on the DGX-1 cluster I'm using, ranks are [0-7], I guess 0 should work?

I ran my reproducer on another machine where it works (Theta). I'm attaching the logs below. The first line that is different from the gemini logs above is L2043, where cuda_mm_subscribe() is called. This results in the following: Assigned CUDA buffer ID 26632 to buffer 0x7ff144400000. The corresponding line on gemini (my non-working setup) is L721, and I don't see any buffer ID being assigned anywhere from there.
theta.txt

The mystery remains...

@soumagne
Copy link
Member

soumagne commented May 3, 2024

Closing for now, please re-open the libfabric issue if needed.

@soumagne soumagne closed this as completed May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants