Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better exception if scheduler disconnects from client #8690

Closed
fjetter opened this issue Jun 12, 2024 · 1 comment · Fixed by #8705
Closed

Better exception if scheduler disconnects from client #8690

fjetter opened this issue Jun 12, 2024 · 1 comment · Fixed by #8705
Assignees
Labels
diagnostics enhancement Improve existing functionality or make things work better scheduler stability Issue or feature related to cluster stability (e.g. deadlock)

Comments

@fjetter
Copy link
Member

fjetter commented Jun 12, 2024

If the connection between scheduler and client is lost (e.g. if the scheduler dies) this triggers a reconnect loop on the client to reestablish the connection. If the scheduler is still alive, users will not notice this failure except they are working with previously created Futures. Those futures are cancelled automatically as soon as the client is initiating a reconnect (see here).

If that Future is used the next time, this raises a CancelledError(<key>) without further context and it is frequently unclear for users what this exactly means.

Instead, the user should receive an informative message telling them to check on their scheduler.

@gen_cluster(client=True)
async def test_client_scheduler_lost_sane_exception(c, s, a, b):
    fut = c.submit(inc, 1)
    await wait(fut)

    await s.close()

    with pytest.raises(CancelledError, match='connection to scheduler'):
        await fut

This issue is particularly troublesome if the user is not working with futures directly but the futures are embedded in a persisted collection which renders the entire collection unusable.

@fjetter fjetter added enhancement Improve existing functionality or make things work better diagnostics stability Issue or feature related to cluster stability (e.g. deadlock) scheduler and removed needs triage labels Jun 12, 2024
@fjetter
Copy link
Member Author

fjetter commented Jun 12, 2024

A rather straightforward way to improve this is to allow the Future.cancel method that is being invoked in the reconnect method to accept an exception or message that is then properly forwarded and raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
diagnostics enhancement Improve existing functionality or make things work better scheduler stability Issue or feature related to cluster stability (e.g. deadlock)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants