Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building using versions of cuda-python with the new layout breaks runtime compatibility with older versions #274

Open
vyasr opened this issue Dec 6, 2024 · 4 comments
Assignees
Labels
bug Something isn't working cuda.bindings Everything related to the cuda.bindings module triage Needs the team's attention

Comments

@vyasr
Copy link

vyasr commented Dec 6, 2024

RAPIDS started rolling out new pinnings to use cuda-python 11.8.5/12.6.2 at build-time. Once we did so, we started observing some cascading issues like this one. Specifically, once rmm was built using the new-layout cuda-python versions, we started seeing the following error in downstream repositories that were still pinned to older versions of cuda-python due to #215 and #226:

  File "device_buffer.pyx", line 1, in init rmm.pylibrmm.device_buffer
ModuleNotFoundError: No module named 'cuda.bindings'

The last frame in the traceback always points to the initialization of the rmm.pylibrmm.device_buffer module. This indicates to me that parts of cuda-python that are cimported into this module are embedding the cuda.bindings namespace into the module initialization in a way that is likely defeating the trampoline modules that were added to cuda-python for backwards compatibility, thus making the rmm modules compiled against new-layout cuda-python incompatible with runtime usage of old-layout cuda-python.

My guess is that some of the same issues that are causing us to have to manually do the __pyx_capi__ definition are causing this. Cython simply isn't designed for mismatching layouts in this way, and my guess is that some of the objects that it defines internally in the cuda-bindings module are being copied directly over to the legacy cuda modules even though that isn't what was intended, resulting in those modules having internal objects that still specify the new-layout module names and then break consumers in the mixed build/runtime version case.

I'm not sure this will be fixable without further interactions with Cython internals like the __pyx_capi__ change. It may not be worthwhile, and at this point it might make sense for cuda-python to simply advocate a clean break.

@leofang leofang added triage Needs the team's attention bug Something isn't working cuda.bindings Everything related to the cuda.bindings module labels Dec 7, 2024
@leofang
Copy link
Member

leofang commented Dec 13, 2024

Spoke to Vyas while I was on site. We did not have a clear vision whether this is even fixable or just something we have to document/keep track, and be sure to not introduce another ABI break in a minor release. (So far, we've had major breaking changes twice: 11.7.1 & 12.6.2). Now that our Cython expert @shwina is back, I'd like to have one final confirmation before we declare this as a "won't fix"...

@vyasr
Copy link
Author

vyasr commented Dec 17, 2024

At a high level my suspicion is that we are explicitly doing things that Cython is not designed to support by creating trampolines that do not reflect the original layout, and as such I suspect that we can only expect limited success in trying to make each of these cases work.

@shwina
Copy link
Contributor

shwina commented Dec 17, 2024

Admittedly I don't exactly understand the problem. I tried reproducing outside of cuda-python + RMM + cuML and got to the point where I did need the __pyx_capi__ hack, but I'm not sure how to reproduce the exact scenario we're dealing with here.

It may not be worthwhile, and at this point it might make sense for cuda-python to simply advocate a clean break.

I agree with this.

@vyasr
Copy link
Author

vyasr commented Dec 17, 2024

I think you should be able to reproduce this issue by building a project that uses cuda-python's Cython via the new API (so from cuda.bindings cimport ...) and then at runtime install cuda-python with a version pre-12.6.0 that does not have cuda.bindings. If that doesn't reproduce, then maybe there's something more specific in what RAPIDS was doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.bindings Everything related to the cuda.bindings module triage Needs the team's attention
Projects
None yet
Development

No branches or pull requests

3 participants