Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a weakref finalizer to stop notifier thread #143

Open
wants to merge 3 commits into
base: branch-0.36
Choose a base branch
from

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Dec 6, 2023

When creating our own application context with the python notifier thread enabled we previously had to arrange to stop it before dropping the reference to the context we created. The consequence of not doing so is that our application hangs in exit waiting for a thread to join that never will.

Applying the principle of least surprise, an object should clean up all of its resources when dropping out of scope. To ensure this change Application.stop_notifier_thread to a staticmethod and attach it as a weakref finalizer when starting the notifier thread. This way, we don't have to handle the case where there is no notifier thread, nor do we have to remember to call the stop function manually.

This additionally allows us to clean up the top-level reset function since there is no longer a need to stop the notifier "manually" during reset.

When creating our own application context with the python notifier
thread enabled we previously had to arrange to stop it before dropping
the reference to the context we created. The consequence of not doing
so is that our application hangs in exit waiting for a thread to
join that never will.

Applying the principle of least surprise, an object should clean up
all of its resources when dropping out of scope. To ensure this
change Application.stop_notifier_thread to a staticmethod and attach
it as a weakref finalizer when starting the notifier thread. This way,
we don't have to handle the case where there is no notifier thread,
nor do we have to remember to call the stop function manually.

This additionally allows us to clean up the top-level reset function
since there is no longer a need to stop the notifier "manually" during
reset.
@wence- wence- requested a review from a team as a code owner December 6, 2023 11:16
@pentschev
Copy link
Member

This looks like a great solution, thanks for working on it @wence- . I see some tests are failing though because they explicitly call ucxx.stop_notifier_thread() so that should be fixed after removal, unfortunately I can't suggest changes to files that weren't touched via the GH interface.

@wence-
Copy link
Contributor Author

wence- commented Dec 6, 2023

I reinstated the top-level call, and will then see if removal also works fine.

@wence-
Copy link
Contributor Author

wence- commented Dec 6, 2023

How do we want to handle stopping the notifier in distributed-ucxx? It seems like there it might make sense to still stop the notifier explicitly, rather than waiting for the global application context to go out of scope in atexit teardown.

@pentschev
Copy link
Member

How do we want to handle stopping the notifier in distributed-ucxx? It seems like there it might make sense to still stop the notifier explicitly, rather than waiting for the global application context to go out of scope in atexit teardown.

Yeah, that was the only way I found to stop the notifier thread "on time" for distributed, if you have other ideas how to make that better they're certainly welcome. If you're worried about the top-level call exclusively, we should be ok removing that and then calling it directly from the ApplicationContext inside distributed-ucxx, i.e.: ucxx.core._get_ctx().stop_notifier_thread().

@pentschev
Copy link
Member

It seems that all progress thread tests are hanging during the first one (probably after it completes), maybe there's some selfref still in place. Let me know if you'd like a rubber duck to debug this at some point.

@wence- wence- force-pushed the wence/fix/appctx-notifier-cleanup branch from 67b17b4 to e5170e7 Compare December 6, 2023 16:25
@wence-
Copy link
Contributor Author

wence- commented Dec 6, 2023

I figured out the problem. The test_benchmark_cluster test relies on an explicit call to ucxx.stop_notifier_thread(). The fixture that automatically resets ucxx (and hence stops the notifier) doesn't work for this setup because ucxx has been initialised in a subprocess.

@wence-
Copy link
Contributor Author

wence- commented Dec 6, 2023

So I just dropped the removal of stop_notifier_thread everywhere.

@wence-
Copy link
Contributor Author

wence- commented Dec 6, 2023

Hmm, just the one test run failed, but no obvious error.

@pentschev
Copy link
Member

Hmm, just the one test run failed, but no obvious error.

Seems like this was an exit code 124 which means the pytest process timed out. Are you able to reproduce that locally? It seems like there may be some resource preventing the process from returning.

@wence-
Copy link
Contributor Author

wence- commented Dec 7, 2023

Hmm, just the one test run failed, but no obvious error.

Seems like this was an exit code 124 which means the pytest process timed out. Are you able to reproduce that locally? It seems like there may be some resource preventing the process from returning.

I was not yet able to repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants