Ensure client_desires_keys does not corrupt Scheduler state #8827

fjetter · 2024-08-12T13:27:16Z

I ran into this over in dask/dask#11248 where I was somehow triggering this condition in a very different way.

The test test_futures_of_cancelled_raises is actually broken on main. It is indeed raising a CancelledError but not for the reason we'd like it to. What happens under the hood is that this behavior is triggering an AssertionError during transitioning which will close the network connection to the scheduler and raises a FutureCancelledError but not a "This task has been cancelled" CancelledError 🙄

The reason for this is that client_desires_keys instantiates a new TaskState object if the key is unknown. This is a pretty breaking behavior in general but so far has been required to make Variable, Queue, etc. work the way they do. Variables and the like are communicating via an unordered RPC to the scheduler causing the key often to not be registered yet and this premature initialization just worked because typically the state corruption would only last for a brief moment.

However, the case that cannot be corrected easily (and something I'd like to fix but is more work than I currently care to invest) is that Future objects are _inform_ing the scheduler about their existence whenever they are instantiated. This is important to allow the submission of persisted collections (e.g. used by publish/get_dataset) or when storing/recovering stored futures in a variable.
This mechanism is typically disabled on ordinary clusters because there is no client in the context of the scheduler. However, when running in async mode, the async test client is shared with the scheduler and therefore the future in the scheduler context is also calling client_desires_keys even though it was already released...

Alas, this is only a small patch but it should make things more reliable. Eventually this should be cleaned up...

closes #7498

github-actions · 2024-08-12T14:18:24Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

25 files ± 0 25 suites ±0 10h 15m 23s ⏱️ + 8m 13s
4 101 tests - 4 3 978 ✅ - 9 113 💤 ±0 10 ❌ +5
47 390 runs - 48 45 248 ✅ - 56 2 132 💤 +3 10 ❌ +5

For more details on these failures, see this check.

Results for commit fc085a7. ± Comparison against base commit f12cc4f.

This pull request removes 8 and adds 4 tests. Note that renamed tests count towards both.

distributed.tests.test_client ‑ test_future_auto_inform
distributed.tests.test_client ‑ test_future_defaults_to_default_client
distributed.tests.test_client ‑ test_rebalance_raises_on_explicit_missing_data
distributed.tests.test_client ‑ test_serialize_future
distributed.tests.test_client ‑ test_serialize_future_without_client
distributed.tests.test_scheduler ‑ test_client_desires_keys_creates_ts
distributed.tests.test_spans ‑ test_client_desires_keys_creates_tg
distributed.tests.test_spans ‑ test_client_desires_keys_creates_ts

distributed.tests.test_client ‑ test_worker_clients_do_not_claim_ownership_of_serialize_futures[False]
distributed.tests.test_client ‑ test_worker_clients_do_not_claim_ownership_of_serialize_futures[True]
distributed.tests.test_queues ‑ test_set_cancelled_future
distributed.tests.test_variable ‑ test_set_cancelled_future

♻️ This comment has been updated with latest results.

fjetter · 2024-08-12T14:58:02Z

FAILED distributed/tests/test_client.py::test_futures_of_cancelled_raises - AssertionError: Regex pattern did not match.
Regex: '(reason: unknown|testreason)'
Input: 'inc-2ac8f30d27eb33df3f76277392063a38 cancelled for reason: scheduler-connection-lost.\nClient lost the connection to the scheduler. Please check your connection and re-run your work.'

aha... will have to keep digging, it seems

fjetter · 2024-08-12T15:24:06Z

ok, the failure I was writing about above was due to the warning being raised. We're raising on all warnings.

I also found a couple of tests that explicitly implemented the behavior I am now forbidding. I ended up deleting those tests, especially since we're now discouraging users from instantiating Futures themselves

distributed/client.py

hendrikmakait

Thanks, @fjetter. Overall these changes make sense to me, I'm not sure if I have a full understanding of the subtleties of the change but CI looks good, so that's enough for me.

distributed/queues.py

distributed/tests/test_queues.py

distributed/variable.py

Co-authored-by: Hendrik Makait <[email protected]>

…thub.com/fjetter/distributed into client_desires_must_not_instantiate_task

fjetter · 2024-08-20T09:35:49Z

distributed/tests/test_client.py

+
+@pytest.mark.slow()
+@pytest.mark.parametrize("do_wait", [True, False])
+def test_worker_clients_do_not_claim_ownership_of_serialize_futures(c, do_wait):


I think this test describes the subtleties involved in this change. The cases that raise a CancelledError could previously still work, depending on timing. If the futures would unpack on the worker before the client side release reached the scheduler, the futures would still be referenced.
In an async test, this would rather trivially be true since the scheduler also deserializes the future and given that the test client and scheduler run in the same thread, the get_client discovery would detect the client even while inside of the scheduler which is the "bug/feature" I had to get rid of here.

#7498 describes this race condition in detail

Ensure client_desires_keys must not create TaskState object

90a39e6

fjetter force-pushed the client_desires_must_not_instantiate_task branch from 8c77c11 to 90a39e6 Compare August 15, 2024 12:43

fjetter added 2 commits August 15, 2024 14:45

Fix ws.executing assignment

6322cf3

Fix ws.executing assignment

99372d9

fjetter commented Aug 15, 2024

View reviewed changes

distributed/client.py Outdated Show resolved Hide resolved

fjetter added 2 commits August 15, 2024 16:09

remove inform keyword

8c423b6

remove filterwarnings

93f4c6f

fjetter self-assigned this Aug 15, 2024

hendrikmakait self-requested a review August 16, 2024 09:11

hendrikmakait approved these changes Aug 16, 2024

View reviewed changes

distributed/queues.py Outdated Show resolved Hide resolved

distributed/tests/test_queues.py Outdated Show resolved Hide resolved

distributed/variable.py Outdated Show resolved Hide resolved

fjetter and others added 4 commits August 20, 2024 10:32

Apply suggestions from code review

f7bbd44

Co-authored-by: Hendrik Makait <[email protected]>

Add test case for gh7498

b53ec14

Merge branch 'client_desires_must_not_instantiate_task' of https://gi…

d7eed61

…thub.com/fjetter/distributed into client_desires_must_not_instantiate_task

Add test for dask#7498

4cfcc22

fjetter commented Aug 20, 2024

View reviewed changes

deal with pytest deprecation

fc085a7

fjetter merged commit fe79a36 into dask:main Aug 20, 2024
22 of 32 checks passed

fjetter deleted the client_desires_must_not_instantiate_task branch August 20, 2024 11:29

This was referenced Aug 27, 2024

Use Task class instead of tuple #8797

Merged

"Recursive" futures are deleted from the cluster #8854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure client_desires_keys does not corrupt Scheduler state #8827

Ensure client_desires_keys does not corrupt Scheduler state #8827

fjetter commented Aug 12, 2024 •

edited

Loading

github-actions bot commented Aug 12, 2024 •

edited

Loading

fjetter commented Aug 12, 2024

fjetter commented Aug 12, 2024

hendrikmakait left a comment

fjetter Aug 20, 2024

Ensure client_desires_keys does not corrupt Scheduler state #8827

Ensure client_desires_keys does not corrupt Scheduler state #8827

Conversation

fjetter commented Aug 12, 2024 • edited Loading

github-actions bot commented Aug 12, 2024 • edited Loading

Unit Test Results

fjetter commented Aug 12, 2024

fjetter commented Aug 12, 2024

hendrikmakait left a comment

Choose a reason for hiding this comment

fjetter Aug 20, 2024

Choose a reason for hiding this comment

fjetter commented Aug 12, 2024 •

edited

Loading

github-actions bot commented Aug 12, 2024 •

edited

Loading