-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedulers stuck in cpool_delete and ets locks #8613
Comments
I've added to the gist backtraces for all normal scheduler threads. Seems like the reason is a combination of ets locks + allocations in ets_select_2. The table in a topic:
It is a mnesia_index |
Thanks for the excellent bug report! We found a bug that caused scheduler 8 to get stuck forever (in the loop at beam/erl_alloc_util.c:3474); which in turn caused the other schedulers to get stuck waiting for it forever. We have not been able to determine how we could end up in this scenario, but with the fix in #8627 scheduler 8 would have been able to continue which in turn would have solved the whole situation. |
Thanks for the fast answer! How do you think, is |
Yes, then the pool won't be used at all |
I've set break point on ethr_atomic_read_acqb (it's inlined actually) for thread 8, step one instruction ( |
Have you used the We are trying to understand how it could end up in that buggy loop. |
No. Nothing special. We didn't even run our own integration tests at that time. But this service constantly do some background job (ets lookups, math and binary constructions). For vm metrics we use erlang:system_info/1, erlang:memory/0 and erlang:statistics/1. We don't overwrite alloc and other defaults in vm.args for this setup. Interesting that we recently moved this service from otp 23.3.4 to 26.2.3. |
Couldn't there be a delete marker? |
Delete markers are set on If |
* rickard/cpool_delete-fix/GH-8613/OTP-19154: [erts] Fix cpool_delete()
* rickard/cpool_delete-fix/GH-8613/OTP-19154: [erts] Fix cpool_delete()
Describe the bug
3 of 24 schedulers stuck in cpool_delete function (but at different spins). System became unresponsive (remsh doesn't work etc.), CPU usage ~300%.
Threads 7, 8 and 16 bt:
To Reproduce
I think it's hard to reproduce due to async nature of this state.
Expected behavior
System operates as usual, remsh and other interfaces work as expected.
Affected versions
OTP version 26.2.3
Additional context
Additional info from gdb will be available as a gist. If you need something else (except core dump), let me know. The system will be up and available for observation for some time.
The text was updated successfully, but these errors were encountered: