Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refcount Error in Graph Code? #166

Closed
apowers313 opened this issue Oct 13, 2024 · 9 comments
Closed

Refcount Error in Graph Code? #166

apowers313 opened this issue Oct 13, 2024 · 9 comments
Labels
support All things related to the project that can't be categorized

Comments

@apowers313
Copy link

I'm running into a problem with creating a graph. A recreation of the problem is here and the quick summary is that the code does something like:

cuda_code = """ 
    extern "C" __global__ void simple(char *str) {
        printf("this is a test\\n");
        printf("ptr: %p\\n", str);
        printf("passed argument was: %s\\n", str);
    }
    """

def mkgraph():
    # initialize device
    # load PTX, get function
    # allocate memory
    # create memcpy node, copies UTF-8 encoded bytes to GPU
    # create kernel node
    # add dependency from kernel node to memcpy node
    # *** run graph first time ***
    # return graph
g = mkgraph()
# *** run graph second time ***

The first graph execution works. I can also instantiate and execute the graph multiple times before the function returns and it works fine and prints out the correct string.

The second graph execution has the exact same memory address as the first instantiation. The this is a test message prints fine, the pointer address points fine, and then the final line is passed argument was: followed by garbage.

*** LAUNCHING GRAPH IN FUNCTION ***
this is a test
ptr: 0x7f4b19800000
passed argument was: hello from host
*** LAUNCHING GRAPH OUTSIDE FUNCTION ***
this is a test
ptr: 0x7f4b19800000
passed argument was: ?t?TK�

My best guess is that there is a refcount that gets decremented when the function returns and the graph isn't hanging on to a copy of the memory, so it's freed or something? Is this a bug in the CUDA Python code or is there something I'm missing?

@apowers313
Copy link
Author

apowers313 commented Oct 13, 2024

As further evidence that this is a refcount bug, I've updated the example code to isolate the bug to the Buffer that is passed to the memcpy graph node as an argument. If you set do_bug to True no reference is kept and the bug shows up. If you set do_bug to False a top-level reference is kept in ref_keeper and the bug goes away.

@leofang leofang added the triage Needs the team's attention label Oct 15, 2024
@leofang
Copy link
Member

leofang commented Oct 15, 2024

@vzhurba01 could you take a look? My guess is it is our implicit requirement for the Python bindings that Python objects such as str_arg_buffer should be kept alive together with the graph nodes like memcpy_node.

@vzhurba01
Copy link
Collaborator

Yeah this would be one of our implicit requirements.
Since these bindings are 1-to-1 with CUDA in C, we inherit the same rules as C. So in this case the host memory space needs to be alive for all invocations of the graph.

@leofang leofang added support All things related to the project that can't be categorized and removed triage Needs the team's attention labels Oct 16, 2024
@leofang
Copy link
Member

leofang commented Oct 16, 2024

Thanks, Vlad! I've created #175 to track the need to document our requirements. @apowers313 does this answer your question?

@leofang
Copy link
Member

leofang commented Oct 16, 2024

btw forgot to say, we are building pythonic abstraction over the low-level bindings (#70), and CUDA graphs will be covered in a future beta release (#111). @apowers313 it'd be nice if you can also share your use cases with us so that we can understand better your needs, and @vzhurba01 we should probably take lifetime management into account in the API design 🙂

@apowers313
Copy link
Author

Thanks for the quick reply and hopefully documentation helps future users avoid this footgun.

I'm building a system that strings together multiple feature extractors that depend on inputs / outputs from each other and it's a lot more efficient to build a graph of feature extractors rather than to do memory transfers for each of them.

Along the way I realized that there's a simple architecture for automatically handling inputs, outputs, and kernel dependencies so I built a FFI library to automatically build graphs for kernel calls that hides a lot of details from the users: https://github.com/atoms-org/cuda-ffi (still a work in progress, but ~80% complete)

@leofang
Copy link
Member

leofang commented Oct 17, 2024

Along the way I realized that there's a simple architecture for automatically handling inputs, outputs, and kernel dependencies so I built a FFI library to automatically build graphs for kernel calls that hides a lot of details from the users: https://github.com/atoms-org/cuda-ffi (still a work in progress, but ~80% complete)

@apowers313 Thanks a lot for sharing, I wish we could have learned about your needs sooner to help you save some time 😅 (cc @aterrel for vis) I am pleased to share with you that an official solution for exactly your needs is being built and it's called cuda.core, currently available as a beta release and installable by building from source from this repo. Python wheels and conda packages will be offered in the near future. See the example codes here. We've been busy wrapping up the first beta release (tracked here).

As mentioned earlier we'll cover CUDA graphs in a future release. If possible please give it a try and let us know if you have any feedbacks/questions!

@leofang
Copy link
Member

leofang commented Oct 17, 2024

In the meanwhile if you are OK with a field-tested, 3rd-party solution (but with NVIDIA support) for executing your C++ kernels in Python, I would encourage you to try out CuPy's cupy.RawKernel/cupy.RawModule. It's a somewhat higher-level abstraction from what cuda.core aims to offer. CuPy currently also supports building a CUDA graph in Python through stream capture, though it's also in experimental phase.

@leofang
Copy link
Member

leofang commented Oct 17, 2024

(Let me move this issue to the Discussion board.)

@NVIDIA NVIDIA locked and limited conversation to collaborators Oct 17, 2024
@leofang leofang converted this issue into discussion #181 Oct 17, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
support All things related to the project that can't be categorized
Projects
None yet
Development

No branches or pull requests

3 participants