CI failures and `ConnectionPool` errors #1236

pentschev · 2023-09-25T12:51:19Z

We continue to have lots of tests failing in CI (see for example yesterday's and today's nightly runs), more commonly in nightly builds, but those also happen in PR builds although less often.

In a bit of further investigation, one of the issues I see is we get lots of errors such as the one below when a cluster is shutting down.

2023-09-24 05:36:14,555 - distributed.worker - ERROR - Unexpected exception during heartbeat. Closing worker.
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/worker.py", line 1253, in heartbeat
    response = await retry_operation(
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/utils_comm.py", line 454, in retry_operation
    return await retry(
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/utils_comm.py", line 433, in retry
    return await coro()
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/core.py", line 1543, in connect
    raise RuntimeError("ConnectionPool is closed")

In most cases those errors seem harmless, but I can't say yet whether they are at least partly responsible for the failing tests. I was able to get one test to locally reproduce the error above in a consistent manner, see below.

# pytest -vs dask_cuda/tests/test_proxy.py -k test_communicating_proxy_objects[ucx-None]
==================================================== test session starts ====================================================
platform linux -- Python 3.9.18, pytest-7.4.2, pluggy-1.3.0 -- /opt/conda/envs/test/bin/python3.9
cachedir: .pytest_cache
rootdir: /repo
configfile: pyproject.toml
plugins: cov-4.1.0
collected 1175 items / 1174 deselected / 1 skipped / 1 selected

tests/test_proxy.py::test_communicating_proxy_objects[ucx-None] 2023-09-25 12:47:31,680 - distributed.worker - ERROR - Unexpected exception during heartbeat. Closing worker.
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/worker.py", line 1253, in heartbeat
    response = await retry_operation(
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/utils_comm.py", line 454, in retry_operation
    return await retry(
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/utils_comm.py", line 433, in retry
    return await coro()
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1543, in connect
    raise RuntimeError("ConnectionPool is closed")
RuntimeError: ConnectionPool is closed
2023-09-25 12:47:31,692 - tornado.application - ERROR - Exception in callback <bound method Worker.heartbeat of <Worker 'ucx://127.0.0.1:54751', name: 0, status: closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>>
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    await val
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/worker.py", line 1253, in heartbeat
    response = await retry_operation(
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/utils_comm.py", line 454, in retry_operation
    return await retry(
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/utils_comm.py", line 433, in retry
    return await coro()
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1543, in connect
    raise RuntimeError("ConnectionPool is closed")
RuntimeError: ConnectionPool is closed
PASSED

I was able to work around that error with the patch below, which essentially forcefully disabled cuDF spilling.

diff --git a/dask_cuda/is_spillable_object.py b/dask_cuda/is_spillable_object.py
index cb85248..f30800b 100644
--- a/dask_cuda/is_spillable_object.py
+++ b/dask_cuda/is_spillable_object.py
@@ -48,6 +48,7 @@ def cudf_spilling_status() -> Optional[bool]:
         - None if the current version of cudf doesn't support spilling, or
         - None if cudf isn't available.
     """
+    return False
     try:
         from cudf.core.buffer.spill_manager import get_global_manager
     except ImportError:

It looks to me like from cudf.core.buffer.spill_manager import get_global_manager (if you simply import it without calling the get_global_manager() before returning, it will already fail) is again battling Disitributed somehow to destroy resources. I've spent literally no time looking at the cuDF code to verify whether there's something obvious there, but I was hoping @madsbk would be able to tell whether there's something we should be concerned there when we are working with Distributed's finalizers, or @wence- who has been to the depths of Distributed's finalizers and back to have some magic power in pinpointing the problem right away.

The text was updated successfully, but these errors were encountered:

madsbk · 2023-09-26T06:36:42Z

The only side-effect of get_global_manager(), when spilling is disabled, is the import of cudf.

Sure enough, this also triggers the issue:

diff --git a/dask_cuda/is_spillable_object.py b/dask_cuda/is_spillable_object.py
index cb85248..959d9f3 100644
--- a/dask_cuda/is_spillable_object.py
+++ b/dask_cuda/is_spillable_object.py
@@ -48,6 +48,8 @@ def cudf_spilling_status() -> Optional[bool]:
         - None if the current version of cudf doesn't support spilling, or
         - None if cudf isn't available.
     """
+    import cudf
+    return False
     try:
         from cudf.core.buffer.spill_manager import get_global_manager
     except ImportError:

pentschev · 2023-09-26T08:40:25Z

You're right @madsbk . This is essentially being triggered by having

dask-cuda/dask_cuda/tests/test_proxy.py

Lines 27 to 35 in ec80f97

    
           # Make the "disk" serializer available and use a directory that are 
        
           # remove on exit. 
        
           if ProxifyHostFile._spill_to_disk is None: 
        
               tmpdir = tempfile.TemporaryDirectory() 
        
               ProxifyHostFile( 
        
                   worker_local_directory=tmpdir.name, 
        
                   device_memory_limit=1024, 
        
                   memory_limit=1024, 
        
               )

in the context of the test file, commenting out the ProxifyHostFile instantiation also prevents this from happening. IOW, it seems this is another instance of a global context that gets leaked to undesirable places.

I'm thinking instantiating ProxifyHostFile should be a fixture that is only set for tests that actually need it. Are there any reasons why we shouldn't be doing that?

madsbk · 2023-09-26T08:58:30Z

I'm thinking instantiating ProxifyHostFile should be a fixture that is only set for tests that actually need it. Are there any reasons why we shouldn't be doing that?

I think that is a good idea!

wence- · 2023-09-26T10:52:25Z

I guess the problem is somehow that there are things in the ProxifyHostFile that are keeping objects alive, and since it is a module-level variable it never gets cleaned up until too late.

wence- · 2023-09-26T10:55:20Z

Ah wait, something strange. So ProxifyHostFile._spill_to_disk is a singleton object, that just sets up some parameters? The ProxifyHostFile object that is being constructed here is thrown away, some there is something strange.

pentschev mentioned this issue Sep 25, 2023

Increase test timeouts further to reduce CI failures #1234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI failures and `ConnectionPool` errors #1236

CI failures and `ConnectionPool` errors #1236

pentschev commented Sep 25, 2023

madsbk commented Sep 26, 2023 •

edited

Loading

pentschev commented Sep 26, 2023

madsbk commented Sep 26, 2023 •

edited

Loading

wence- commented Sep 26, 2023

wence- commented Sep 26, 2023

CI failures and ConnectionPool errors #1236

CI failures and ConnectionPool errors #1236

Comments

pentschev commented Sep 25, 2023

madsbk commented Sep 26, 2023 • edited Loading

pentschev commented Sep 26, 2023

madsbk commented Sep 26, 2023 • edited Loading

wence- commented Sep 26, 2023

wence- commented Sep 26, 2023

CI failures and `ConnectionPool` errors #1236

CI failures and `ConnectionPool` errors #1236

madsbk commented Sep 26, 2023 •

edited

Loading

madsbk commented Sep 26, 2023 •

edited

Loading