-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to pause proc timer to erlang:suspend_process/2 #8670
base: master
Are you sure you want to change the base?
Conversation
CT Test ResultsTests are running... https://github.com/erlang/otp/actions/runs/9936660812 Results for commit e3236d4 To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts
// Erlang/OTP Github Action Bot |
Thanks for the PR! Most of us are away on vacation so we have limited time to look at this. We'll get back to it once we're back from vacation. (One thing that strikes me at a glance though is that the commits feel a bit too split up: there's no way to reach the code in the first five commits until the last commit. Feel free to squash, or leave it as-is until we're back :-)) |
e3236d4
to
36bef84
Compare
36bef84
to
b180b17
Compare
erts/emulator/beam/erl_process.c
Outdated
Process *rp = erts_proc_lookup(BIF_ARG_1); | ||
erts_proc_lock(rp, ERTS_PROC_LOCK_STATUS); | ||
|
||
schedule_pause_proc_timer(rp, ERTS_PROC_LOCK_STATUS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this call, the status lock of the process would not have to be involved in the paused timer operations. This since all other accesses are made while holding the main lock of the process. It is needed since the scheduled_pause_proc_timer()
function calls erts_try_reuse_paused_proc_timer()
. This need for the status lock has caused a lot of modifications that otherwise would not be necessary.
This looks to me as a premature optimization that most likely also is unnecessary. I would guess that this scenario where you have already suspended the process but without the timer paused and you also want to suspend it with the timer paused is very uncommon. If this optimization actually should be something that is needed it can be achieved using the thread progress cleanup functionality as the actual timers do and atomic operations without the need for the status lock being held.
Please remove the usage of erts_try_reuse_paused_proc_timer()
, send this as an asynchronous signal using erts_proc_sig_send_rpc_request()
instead of scheduling a sys-task and revert the changes made to sys-tasks. We have been working on reducing the sys-task usage as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickard-green Notice that, afair, this wasn't the point were we run into issues. If I'm not wrong, cancelling a proc timer can be done from any scheduler, so we didn't need a rpc-request signal for that. I think I changed it to use a sys-task in this version only for symmetry, but the initial issue was creating the proc timer while resuming the suspended process (see my other comment).
ERTS_MSUSPEND_STATE_FLG_ACTIVE); | ||
ASSERT(!(mstate & ERTS_MSUSPEND_STATE_FLG_ACTIVE)); | ||
pause_proc_timer = !!(mstate & ERTS_MSUSPEND_STATE_FLG_PAUSE_TIMER); | ||
erts_suspend(c_p, ERTS_PROC_LOCK_MAIN, NULL, pause_proc_timer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pause the timer before or after calling erts_suspend()
instead of letting erts_suspend()
(via suspend_process()
) schedule the pausing of the timer. Here you are in the context where you can do it right away.
erts/emulator/beam/erl_process.c
Outdated
@@ -7175,6 +7217,10 @@ suspend_process(Process *c_p, Process *p) | |||
} | |||
|
|||
p->rcount++; /* count number of suspend */ | |||
|
|||
if (pause_proc_timer) { | |||
schedule_pause_proc_timer(p, locks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not schedule pausing of proc timer here.
There are only a few places where you want to pause the proc timer and you typically do not want to schedule it, but perform the operation right away like in the new function activate_suspend_monitor()
.
Scheduling the operation here introduced a new argument to both schedule_process()
and erts_suspend()
which can be reverted.
@@ -9051,8 +9119,8 @@ resume_process_1(BIF_ALIST_1) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like a resume_process_2()
BIF with an "unpause timer" option is needed. As it is now, the pausing will be left even after all suspends with pause have been resumed if someone has suspended without pause of timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also dislike that the timer gets paused even after all those interested in pausing it, resumed the process. This should be changed. Do we need to add a new option though? I was thinking of adding a new counter in the process structure, similar to rcount
but with the count of processes that have suspended the proc_timer. When resume_process()
we know if the caller paused the proc timer as well, so we know if the new counter needs decreasing. When it reaches 0, we could unpause the timer. Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option is better since:
- You don't need to add anything to the process structure. We try to avoid adding things to the process structure when it is not frequently used since all processes will have to allocate memory for it even when not used. Since you can have a very large amount of processes, this overhead can become significant. There might be exceptions for important things or when it is not possible to do something another way. This is a pure debug feature that can be done some other way, though.
- The unpause could potentially be done too early. The resume operation corresponding to the suspend operation with pause could be done later than other suspend/resume operations which potentially would unpause the timer too early.
I think you should have a counter in the ErtsMonitorSuspend
structure, though, and throw a badarg
error exception in the erlang:resume_process()
BIF if the amount of suspend with pause and suspend without pause do not match with the amount of resume with unpause and resume without unpause.
erts/emulator/beam/erl_hl_timer.c
Outdated
|
||
int | ||
erts_try_reuse_paused_proc_timer(Process *c_p) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this function. See the comment in erts_internal_suspend_process_2()
.
erts/emulator/beam/erl_hl_timer.c
Outdated
typedef struct { | ||
ErtsTmrHead head; /* NEED to be first! */ | ||
Sint64 time_left_in_msec; | ||
erts_atomic32_t count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The count does not need to be an atomic if accessed on while having the main lock.
erts/emulator/beam/erl_process.c
Outdated
@@ -9006,11 +9067,11 @@ erts_internal_suspend_process_2(BIF_ALIST_2) | |||
if (rp == ERTS_PROC_LOCK_BUSY) | |||
send_sig = !0; | |||
else { | |||
send_sig = !suspend_process(BIF_P, rp); | |||
send_sig = !suspend_process(BIF_P, rp, pause_proc_timer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pause the timer before or after calling suspend_process()
. Here you are in the context where you can do it right away.
erts/emulator/beam/erl_process.c
Outdated
@@ -7194,16 +7240,19 @@ resume_process(Process *p, ErtsProcLocks locks) | |||
return; | |||
|
|||
state = erts_atomic32_read_nob(&p->state); | |||
if (state & ERTS_PSFLG_PAUSED_TIMER) { | |||
schedule_unpause_proc_timer(p->common.id, locks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickard-green The initial version of this PR was using erts_proc_sig_send_rpc_request()
. The reason I changed it to use sys-tasks was that at this point, we'd get a lock-order violation like this:
Lock order violation occurred when locking 'proc_msgq:<0.269.0>[proclock]'(none)!
Currently these locks are locked by the scheduler 1 thread:
'proc_main:<0.269.0>[proclock](beam/erl_process.c:10109)'(none)
'proc_status:<0.269.0>[proclock](beam/erl_process.c:9300)'(none)
So the problem remains on how to avoid this. We need to somehow send the rpc-request signal before the status lock is acquired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that this one would have a problem as well.
The problem is that the unpausing is triggered from the low level primitive resume_process()
which handles all resumes regardless of whether corresponding suspend came with a pause or not. Unpausing due to a resume that do not correspond to a suspend with a pause feels like a bug to me.
The only way to pause timers is via the erlang:suspend_process()
BIF. If the unpausing is done from corresponding resume in the erlang:resume_process()
BIF, the problem goes away.
…ions We want a way to "pause" a proc timer when suspending a process, and "resume" it later. We will do this as follows. * Pausing a proc timer means: 1. Cancelling the current timer in `common.timer`, if any. 2. Storing in `common.timer` instead how much time was left in the timer. To compute the time left, we need to inspect the current timer. 3. Flagging in the `Process` struct that the timer is paused. This is so that we know later that we need to resume the timer when resuming the process. * Resuming a proc timer then amounts to: 1. Creating a new proc timer based on the time left that was stored in `common.timer`. 2. Clear all the flags in the `Process` struct * When cancelling a proc timer, we now need to check if it is paused (in which case, it can just be ignored) So here we introduce a `ErtsPausedProcTimer` type, that will contain the time remaining on a paused timer, and a header that is shared with all other timer types (so that we can safely insert it in `common.timer`). We also introduce the functions to pause a proc timer and resume it. At this point nothing calls these functions, this happens in the next commits.
We add a new `pause_proc_timer` option to the `erlang:suspend_process/2` BIF. When given, the process is not only suspended, but its proc timer, if set, will be paused. This means that if the paused process is waiting on a `receive`, it will not timeout even if suspended for long. Each time pause_proc_timer is given, a counter is bumped in the suspend-process monitor. In order to decrease it, the (new) BIF `erlang:resume_process/2` needs to be called with the option `resume_proc_timer`. When the count reaches zero, the timer is resumed (even though the process may still be suspended). We add testcases for this functionality
b180b17
to
188faf1
Compare
@rickard-green I've pushed a new version that address all previous issues, and indeed is satisfyingly simpler:
|
We add a
pause_proc_timer
option toerlang:suspend_process/2
. When provided, if the suspendee is waiting on areceive
statement, the timer associated to theafter
clause will be paused, and resumed when the process is eventually resumed.The main motivation for this is for a debugger to be able to pause all user processes without leading to timeouts.