-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI failures and ConnectionPool
errors
#1236
Comments
The only side-effect of Sure enough, this also triggers the issue: diff --git a/dask_cuda/is_spillable_object.py b/dask_cuda/is_spillable_object.py
index cb85248..959d9f3 100644
--- a/dask_cuda/is_spillable_object.py
+++ b/dask_cuda/is_spillable_object.py
@@ -48,6 +48,8 @@ def cudf_spilling_status() -> Optional[bool]:
- None if the current version of cudf doesn't support spilling, or
- None if cudf isn't available.
"""
+ import cudf
+ return False
try:
from cudf.core.buffer.spill_manager import get_global_manager
except ImportError: |
You're right @madsbk . This is essentially being triggered by having dask-cuda/dask_cuda/tests/test_proxy.py Lines 27 to 35 in ec80f97
ProxifyHostFile instantiation also prevents this from happening. IOW, it seems this is another instance of a global context that gets leaked to undesirable places.
I'm thinking instantiating |
I think that is a good idea! |
I guess the problem is somehow that there are things in the |
Ah wait, something strange. So |
We continue to have lots of tests failing in CI (see for example yesterday's and today's nightly runs), more commonly in nightly builds, but those also happen in PR builds although less often.
In a bit of further investigation, one of the issues I see is we get lots of errors such as the one below when a cluster is shutting down.
In most cases those errors seem harmless, but I can't say yet whether they are at least partly responsible for the failing tests. I was able to get one test to locally reproduce the error above in a consistent manner, see below.
I was able to work around that error with the patch below, which essentially forcefully disabled cuDF spilling.
It looks to me like
from cudf.core.buffer.spill_manager import get_global_manager
(if you simply import it without calling theget_global_manager()
before returning, it will already fail) is again battling Disitributed somehow to destroy resources. I've spent literally no time looking at the cuDF code to verify whether there's something obvious there, but I was hoping @madsbk would be able to tell whether there's something we should be concerned there when we are working with Distributed's finalizers, or @wence- who has been to the depths of Distributed's finalizers and back to have some magic power in pinpointing the problem right away.The text was updated successfully, but these errors were encountered: