Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to pause proc timer to erlang:suspend_process/2 #8670

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jcpetruzza
Copy link

We add a pause_proc_timer option to erlang:suspend_process/2. When provided, if the suspendee is waiting on a receive statement, the timer associated to the after clause will be paused, and resumed when the process is eventually resumed.

The main motivation for this is for a debugger to be able to pause all user processes without leading to timeouts.

@CLAassistant
Copy link

CLAassistant commented Jul 15, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

CT Test Results

Tests are running... https://github.com/erlang/otp/actions/runs/9936660812

Results for commit e3236d4

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

  • No CT logs found
  • No HTML docs found
  • No Windows Installer found

// Erlang/OTP Github Action Bot

@jhogberg jhogberg added team:VM Assigned to OTP team VM stalled waiting for input by the Erlang/OTP team labels Jul 15, 2024
@jhogberg
Copy link
Contributor

jhogberg commented Jul 15, 2024

Thanks for the PR! Most of us are away on vacation so we have limited time to look at this. We'll get back to it once we're back from vacation.

(One thing that strikes me at a glance though is that the commits feel a bit too split up: there's no way to reach the code in the first five commits until the last commit. Feel free to squash, or leave it as-is until we're back :-))

@rickard-green rickard-green removed the stalled waiting for input by the Erlang/OTP team label Aug 19, 2024
Process *rp = erts_proc_lookup(BIF_ARG_1);
erts_proc_lock(rp, ERTS_PROC_LOCK_STATUS);

schedule_pause_proc_timer(rp, ERTS_PROC_LOCK_STATUS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this call, the status lock of the process would not have to be involved in the paused timer operations. This since all other accesses are made while holding the main lock of the process. It is needed since the scheduled_pause_proc_timer() function calls erts_try_reuse_paused_proc_timer(). This need for the status lock has caused a lot of modifications that otherwise would not be necessary.

This looks to me as a premature optimization that most likely also is unnecessary. I would guess that this scenario where you have already suspended the process but without the timer paused and you also want to suspend it with the timer paused is very uncommon. If this optimization actually should be something that is needed it can be achieved using the thread progress cleanup functionality as the actual timers do and atomic operations without the need for the status lock being held.

Please remove the usage of erts_try_reuse_paused_proc_timer(), send this as an asynchronous signal using erts_proc_sig_send_rpc_request() instead of scheduling a sys-task and revert the changes made to sys-tasks. We have been working on reducing the sys-task usage as much as possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickard-green Notice that, afair, this wasn't the point were we run into issues. If I'm not wrong, cancelling a proc timer can be done from any scheduler, so we didn't need a rpc-request signal for that. I think I changed it to use a sys-task in this version only for symmetry, but the initial issue was creating the proc timer while resuming the suspended process (see my other comment).

ERTS_MSUSPEND_STATE_FLG_ACTIVE);
ASSERT(!(mstate & ERTS_MSUSPEND_STATE_FLG_ACTIVE));
pause_proc_timer = !!(mstate & ERTS_MSUSPEND_STATE_FLG_PAUSE_TIMER);
erts_suspend(c_p, ERTS_PROC_LOCK_MAIN, NULL, pause_proc_timer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pause the timer before or after calling erts_suspend() instead of letting erts_suspend() (via suspend_process()) schedule the pausing of the timer. Here you are in the context where you can do it right away.

@@ -7175,6 +7217,10 @@ suspend_process(Process *c_p, Process *p)
}

p->rcount++; /* count number of suspend */

if (pause_proc_timer) {
schedule_pause_proc_timer(p, locks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not schedule pausing of proc timer here.

There are only a few places where you want to pause the proc timer and you typically do not want to schedule it, but perform the operation right away like in the new function activate_suspend_monitor().

Scheduling the operation here introduced a new argument to both schedule_process() and erts_suspend() which can be reverted.

@@ -9051,8 +9119,8 @@ resume_process_1(BIF_ALIST_1)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like a resume_process_2() BIF with an "unpause timer" option is needed. As it is now, the pausing will be left even after all suspends with pause have been resumed if someone has suspended without pause of timer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also dislike that the timer gets paused even after all those interested in pausing it, resumed the process. This should be changed. Do we need to add a new option though? I was thinking of adding a new counter in the process structure, similar to rcount but with the count of processes that have suspended the proc_timer. When resume_process() we know if the caller paused the proc timer as well, so we know if the new counter needs decreasing. When it reaches 0, we could unpause the timer. Would that work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option is better since:

  • You don't need to add anything to the process structure. We try to avoid adding things to the process structure when it is not frequently used since all processes will have to allocate memory for it even when not used. Since you can have a very large amount of processes, this overhead can become significant. There might be exceptions for important things or when it is not possible to do something another way. This is a pure debug feature that can be done some other way, though.
  • The unpause could potentially be done too early. The resume operation corresponding to the suspend operation with pause could be done later than other suspend/resume operations which potentially would unpause the timer too early.

I think you should have a counter in the ErtsMonitorSuspend structure, though, and throw a badarg error exception in the erlang:resume_process() BIF if the amount of suspend with pause and suspend without pause do not match with the amount of resume with unpause and resume without unpause.


int
erts_try_reuse_paused_proc_timer(Process *c_p)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this function. See the comment in erts_internal_suspend_process_2().

typedef struct {
ErtsTmrHead head; /* NEED to be first! */
Sint64 time_left_in_msec;
erts_atomic32_t count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count does not need to be an atomic if accessed on while having the main lock.

@@ -9006,11 +9067,11 @@ erts_internal_suspend_process_2(BIF_ALIST_2)
if (rp == ERTS_PROC_LOCK_BUSY)
send_sig = !0;
else {
send_sig = !suspend_process(BIF_P, rp);
send_sig = !suspend_process(BIF_P, rp, pause_proc_timer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pause the timer before or after calling suspend_process(). Here you are in the context where you can do it right away.

@@ -7194,16 +7240,19 @@ resume_process(Process *p, ErtsProcLocks locks)
return;

state = erts_atomic32_read_nob(&p->state);
if (state & ERTS_PSFLG_PAUSED_TIMER) {
schedule_unpause_proc_timer(p->common.id, locks);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickard-green The initial version of this PR was using erts_proc_sig_send_rpc_request(). The reason I changed it to use sys-tasks was that at this point, we'd get a lock-order violation like this:

Lock order violation occurred when locking 'proc_msgq:<0.269.0>[proclock]'(none)!                                                                                                                                                                                                                                           
Currently these locks are locked by the scheduler 1 thread:                                                                                                                                                                                                                                                                 
  'proc_main:<0.269.0>[proclock](beam/erl_process.c:10109)'(none)                                                                                                                                                                                                                                                           
  'proc_status:<0.269.0>[proclock](beam/erl_process.c:9300)'(none) 

So the problem remains on how to avoid this. We need to somehow send the rpc-request signal before the status lock is acquired

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that this one would have a problem as well.

The problem is that the unpausing is triggered from the low level primitive resume_process() which handles all resumes regardless of whether corresponding suspend came with a pause or not. Unpausing due to a resume that do not correspond to a suspend with a pause feels like a bug to me.

The only way to pause timers is via the erlang:suspend_process() BIF. If the unpausing is done from corresponding resume in the erlang:resume_process() BIF, the problem goes away.

…ions

We want a way to "pause" a proc timer when suspending a process, and
"resume" it later. We will do this as follows.

  * Pausing a proc timer means:
    1. Cancelling the current timer in `common.timer`, if any.
    2. Storing in `common.timer` instead how much time was left in the
       timer. To compute the time left, we need to inspect the current
       timer.
    3. Flagging in the `Process` struct that the timer is paused. This is
       so that we know later that we need to resume the timer when resuming
       the process.
  * Resuming a proc timer then amounts to:
    1. Creating a new proc timer based on the time left that was stored
       in `common.timer`.
    2. Clear all the flags in the `Process` struct
  * When cancelling a proc timer, we now need to check if it is paused
    (in which case, it can just be ignored)

So here we introduce a `ErtsPausedProcTimer` type, that will contain the
time remaining on a paused timer, and a header that is shared with all
other timer types (so that we can safely insert it in `common.timer`).

We also introduce the functions to pause a proc timer and resume it.
At this point nothing calls these functions, this happens in the next
commits.
We add a new `pause_proc_timer` option to the `erlang:suspend_process/2`
BIF. When given, the process is not only suspended, but its proc timer,
if set, will be paused. This means that if the paused process is waiting
on a `receive`, it will not timeout even if suspended for long.

Each time pause_proc_timer is given, a counter is bumped in the
suspend-process monitor. In order to decrease it, the (new) BIF
`erlang:resume_process/2` needs to be called with the option
`resume_proc_timer`. When the count reaches zero, the timer is
resumed (even though the process may still be suspended).

We add testcases for this functionality
@jcpetruzza
Copy link
Author

@rickard-green I've pushed a new version that address all previous issues, and indeed is satisfyingly simpler:

  • Each call to suspend_process() with the pause_proc_timer option requires an eventual resume_process() with the resume_proc_timer option.
  • The timer gets resumed whenever the count of pause_proc_timer requests gets to zero, even if the process remains suspended
  • For this, we use a counter in the ErtsPausedProcTimer structure and a new counter in the suspend monitor
  • We rely again only on rpc signals instead of sys-tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants