-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graceful stop worker when max_requests/realod_on_* #2615
Conversation
This PR includes #2484. |
opentelemetry uses atexit not only stopping daemon thread, but also flushing buffered metrics/logs/traces. |
@methane thanks for PR. Could you please add the reproducer to a file in |
Although this PR fixes worker hangs up, hey still reports EOF error. If thread takes too long to be finish, master process sends KILL signal anyway. |
For the record, atexit is not related at all. I can see many issues without atexit. |
@xrmx I know you're probably quite busy, but may I humbly request a re-review of this PR & merge/release if it looks good to you? Unfortunately for us the issue this PR fixes causes a hard failure in production when our services using uWSGI are upgraded to Python versions containing https://bugs.python.org/issue44434 -- so the upgrade from 3.9.7 to 3.9.8 is a hard breaking change (and potentially 3.10 +upgrades) and we've been holding them back to 3.9.7 or lower. We're at the point where that upgrade is needed. I imagine we're probably not the only ones facing that situation since several others have reported the threading issue as well. And kudos to @methane for doing the deep investigation to provide this fix! Thank you so much, to both of you! |
Squashed and merged in #2626. Thanks! |
When working to reproduce unbit#2615 I saw many strange "defunct" (zombie) workers. The master called waitpid(-1, ...) but it return 0 even there are some zombies. Finally, master sends KILL signal (MERCY) and worker is restarted. I believe this strange zombie was born from pthread_cancel. Subthreads calls pthread_cancel() for main thread and it cause strange process. pthread_cancel() is very hard to use and debug. I can not even attach the strange zombie with gdb --pid. I think it is not maintainable. In the end we can remove six_feet_under_lock and make wait_for_threads() static.
When working to reproduce unbit#2615 I saw many strange "defunct" (zombie) workers. The master called waitpid(-1, ...) but it return 0 even there are some zombies. Finally, master sends KILL signal (MERCY) and worker is restarted. I believe this strange zombie was born from pthread_cancel. Subthreads calls pthread_cancel() for main thread and it cause strange process. pthread_cancel() is very hard to use and debug. I can not even attach the strange zombie with gdb --pid. I think it is not maintainable. In the end we can remove six_feet_under_lock and make wait_for_threads() static.
worker stops when reached max_requests or reload_on_*.
uwsgi/core/utils.c
Lines 1216 to 1251 in 39f3ade
goodbye_cruel_world()
is not graceful. It causedatexit
not called.If atexit stops daemon threads, worker won't stop until killed from master.
Reproduce
I used hey to request. But any other load testing tools is OK.
./hey -c 80 -z 1m 'http://127.0.0.1:8000/'
master branch
atexit is not called.
After this patch
atexit is called
Relating issues
open-telemetry/opentelemetry-python#3640