Refcount Error in Graph Code? #166

apowers313 · 2024-10-13T03:38:14Z

I'm running into a problem with creating a graph. A recreation of the problem is here and the quick summary is that the code does something like:

cuda_code = """ 
    extern "C" __global__ void simple(char *str) {
        printf("this is a test\\n");
        printf("ptr: %p\\n", str);
        printf("passed argument was: %s\\n", str);
    }
    """

def mkgraph():
    # initialize device
    # load PTX, get function
    # allocate memory
    # create memcpy node, copies UTF-8 encoded bytes to GPU
    # create kernel node
    # add dependency from kernel node to memcpy node
    # *** run graph first time ***
    # return graph
g = mkgraph()
# *** run graph second time ***

The first graph execution works. I can also instantiate and execute the graph multiple times before the function returns and it works fine and prints out the correct string.

The second graph execution has the exact same memory address as the first instantiation. The this is a test message prints fine, the pointer address points fine, and then the final line is passed argument was: followed by garbage.

*** LAUNCHING GRAPH IN FUNCTION ***
this is a test
ptr: 0x7f4b19800000
passed argument was: hello from host
*** LAUNCHING GRAPH OUTSIDE FUNCTION ***
this is a test
ptr: 0x7f4b19800000
passed argument was: ?t?TK�

My best guess is that there is a refcount that gets decremented when the function returns and the graph isn't hanging on to a copy of the memory, so it's freed or something? Is this a bug in the CUDA Python code or is there something I'm missing?

The text was updated successfully, but these errors were encountered:

apowers313 · 2024-10-13T14:52:41Z

As further evidence that this is a refcount bug, I've updated the example code to isolate the bug to the Buffer that is passed to the memcpy graph node as an argument. If you set do_bug to True no reference is kept and the bug shows up. If you set do_bug to False a top-level reference is kept in ref_keeper and the bug goes away.

leofang · 2024-10-15T17:02:02Z

@vzhurba01 could you take a look? My guess is it is our implicit requirement for the Python bindings that Python objects such as str_arg_buffer should be kept alive together with the graph nodes like memcpy_node.

vzhurba01 · 2024-10-15T17:42:08Z

Yeah this would be one of our implicit requirements.
Since these bindings are 1-to-1 with CUDA in C, we inherit the same rules as C. So in this case the host memory space needs to be alive for all invocations of the graph.

leofang · 2024-10-16T16:05:57Z

Thanks, Vlad! I've created #175 to track the need to document our requirements. @apowers313 does this answer your question?

leofang · 2024-10-16T16:09:05Z

btw forgot to say, we are building pythonic abstraction over the low-level bindings (#70), and CUDA graphs will be covered in a future beta release (#111). @apowers313 it'd be nice if you can also share your use cases with us so that we can understand better your needs, and @vzhurba01 we should probably take lifetime management into account in the API design 🙂

apowers313 · 2024-10-17T15:01:20Z

Thanks for the quick reply and hopefully documentation helps future users avoid this footgun.

I'm building a system that strings together multiple feature extractors that depend on inputs / outputs from each other and it's a lot more efficient to build a graph of feature extractors rather than to do memory transfers for each of them.

Along the way I realized that there's a simple architecture for automatically handling inputs, outputs, and kernel dependencies so I built a FFI library to automatically build graphs for kernel calls that hides a lot of details from the users: https://github.com/atoms-org/cuda-ffi (still a work in progress, but ~80% complete)

leofang · 2024-10-17T15:31:58Z

Along the way I realized that there's a simple architecture for automatically handling inputs, outputs, and kernel dependencies so I built a FFI library to automatically build graphs for kernel calls that hides a lot of details from the users: https://github.com/atoms-org/cuda-ffi (still a work in progress, but ~80% complete)

@apowers313 Thanks a lot for sharing, I wish we could have learned about your needs sooner to help you save some time 😅 (cc @aterrel for vis) I am pleased to share with you that an official solution for exactly your needs is being built and it's called cuda.core, currently available as a beta release and installable by building from source from this repo. Python wheels and conda packages will be offered in the near future. See the example codes here. We've been busy wrapping up the first beta release (tracked here).

As mentioned earlier we'll cover CUDA graphs in a future release. If possible please give it a try and let us know if you have any feedbacks/questions!

leofang · 2024-10-17T15:44:32Z

In the meanwhile if you are OK with a field-tested, 3rd-party solution (but with NVIDIA support) for executing your C++ kernels in Python, I would encourage you to try out CuPy's cupy.RawKernel/cupy.RawModule. It's a somewhat higher-level abstraction from what cuda.core aims to offer. CuPy currently also supports building a CUDA graph in Python through stream capture, though it's also in experimental phase.

leofang · 2024-10-17T15:48:18Z

(Let me move this issue to the Discussion board.)

leofang added the triage Needs the team's attention label Oct 15, 2024

leofang added support All things related to the project that can't be categorized and removed triage Needs the team's attention labels Oct 16, 2024

leofang mentioned this issue Oct 16, 2024

Document the expectation of Python object lifetime for binding users #175

Open

NVIDIA locked and limited conversation to collaborators Oct 17, 2024

leofang converted this issue into discussion #181 Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Refcount Error in Graph Code? #166

Refcount Error in Graph Code? #166

apowers313 commented Oct 13, 2024

apowers313 commented Oct 13, 2024 •

edited

Loading

leofang commented Oct 15, 2024

vzhurba01 commented Oct 15, 2024

leofang commented Oct 16, 2024

leofang commented Oct 16, 2024

apowers313 commented Oct 17, 2024

leofang commented Oct 17, 2024 •

edited

Loading

leofang commented Oct 17, 2024

leofang commented Oct 17, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Refcount Error in Graph Code? #166

Refcount Error in Graph Code? #166

Comments

apowers313 commented Oct 13, 2024

apowers313 commented Oct 13, 2024 • edited Loading

leofang commented Oct 15, 2024

vzhurba01 commented Oct 15, 2024

leofang commented Oct 16, 2024

leofang commented Oct 16, 2024

apowers313 commented Oct 17, 2024

leofang commented Oct 17, 2024 • edited Loading

leofang commented Oct 17, 2024

leofang commented Oct 17, 2024

This issue was moved to a discussion.

apowers313 commented Oct 13, 2024 •

edited

Loading

leofang commented Oct 17, 2024 •

edited

Loading