builds.py: let mq lock be reclaimed after 30min #337

lsf37 · 2024-02-27T12:48:40Z

A typical job takes about 5-10 min. If the lock has been held for 30 min in a single job, we assume something has gone wrong and the lock should be released so other jobs (or humans needing the machine) can proceed.

seL4-platforms/builds.py

axel-h · 2024-02-27T14:07:26Z

I wonder we shoud combine this with also using https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes

lsf37 · 2024-02-27T14:21:10Z

I wonder we shoud combine this with also using https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes

That's a different use case, and a bit harder to give a good value for, because that workflow timeout is the overall time for all jobs -- that is much more variable than a single job once it got the lock, because the global one also includes lock wait time and varies depending on machine queue contention. The time for a single job once it has the lock is reasonably predictable.

Might still be worth looking into and reducing the default for some of them.

The main case that this specific timeout is a fix for is the condition where a job takes the lock, and then the machine queue server becomes unavailable on the network when it's trying to release it. AFAICS there is no fix for that apart from a timeout. It doesn't happen that often, but it's still happened maybe 4 times or so in the last two years (mostly around network outages, unsurprisingly).

axel-h

Look good to me.

wom-bat

Either change the title to '20 minutes' or change the timeout to 1800 seconds. Either would be enough for this to pass.

seL4-platforms/builds.py

kent-mcleod · 2024-02-27T22:26:51Z

We used to handle this by setting the timeout on the job that was running. The step to release the locks would trigger when the job completed or was cancelled or was timed out. Would that not work for us? Is releasing the locks triggered as a :post action? Do post actions run when earlier steps timeout?

lsf37 · 2024-02-28T07:46:32Z

We used to handle this by setting the timeout on the job that was running. The step to release the locks would trigger when the job completed or was cancelled or was timed out. Would that not work for us? Is releasing the locks triggered as a :post action? Do post actions run when earlier steps timeout?

They do, and all of this is already done in the code. This is for when that has failed, because the network is unavailable, so the normal lock release fails and then the post step clear-all lock release fails, and the lock is still held on the server when it comes back on the network.

A typical job takes about 5-10 min. If the lock has been held for 30 min in a single job, we assume something has gone wrong and the lock should be released so other jobs (or humans needing the machine) can proceed. This step is only necessary when other timeouts have failed. It is to guard against the case where the lock release after timeout has failed, and the post-step lock release also has failed. This can happen when the machine queue server is temporarily unreachable on the network, and then comes back with the lock still in place. In this case, there is nothing the scripts here can do to release that lock. Signed-off-by: Gerwin Klein <[email protected]>

lsf37 · 2024-02-28T07:55:34Z

Either change the title to '20 minutes' or change the timeout to 1800 seconds. Either would be enough for this to pass.

Done. I've also updated the commit message to explain that this is a timeout of last resort.

lsf37 · 2024-02-29T10:48:11Z

@wom-bat you'll need to change your review to approve, otherwise merging is blocked. We usually have the policy to approve PRs for people with write access where the PR only has small things to fix up and trust that people will do these before merging to avoid getting stalled.

wom-bat

Looks good now

lsf37 requested a review from wom-bat February 27, 2024 12:48

lsf37 self-assigned this Feb 27, 2024

lsf37 force-pushed the lock-timeout branch from 0b2cdf3 to cfce4c3 Compare February 27, 2024 12:55

lsf37 changed the title ~~builds.py: let mq lock be reclaimed after 15min~~ builds.py: let mq lock be reclaimed after 30min Feb 27, 2024

axel-h reviewed Feb 27, 2024

View reviewed changes

seL4-platforms/builds.py Outdated Show resolved Hide resolved

axel-h approved these changes Feb 27, 2024

View reviewed changes

wom-bat requested changes Feb 27, 2024

View reviewed changes

seL4-platforms/builds.py Outdated Show resolved Hide resolved

seL4-platforms/builds.py Outdated Show resolved Hide resolved

lsf37 force-pushed the lock-timeout branch from cfce4c3 to 13a7943 Compare February 28, 2024 07:54

lsf37 requested a review from wom-bat February 29, 2024 10:46

Merge branch 'master' into lock-timeout

a19e8b8

wom-bat approved these changes Mar 1, 2024

View reviewed changes

wom-bat merged commit d136cb0 into master Mar 1, 2024
7 checks passed

wom-bat deleted the lock-timeout branch March 1, 2024 01:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

builds.py: let mq lock be reclaimed after 30min #337

builds.py: let mq lock be reclaimed after 30min #337

lsf37 commented Feb 27, 2024

axel-h commented Feb 27, 2024

lsf37 commented Feb 27, 2024

axel-h left a comment

wom-bat left a comment

kent-mcleod commented Feb 27, 2024

lsf37 commented Feb 28, 2024 •

edited

Loading

lsf37 commented Feb 28, 2024

lsf37 commented Feb 29, 2024

wom-bat left a comment

builds.py: let mq lock be reclaimed after 30min #337

builds.py: let mq lock be reclaimed after 30min #337

Conversation

lsf37 commented Feb 27, 2024

axel-h commented Feb 27, 2024

lsf37 commented Feb 27, 2024

axel-h left a comment

Choose a reason for hiding this comment

wom-bat left a comment

Choose a reason for hiding this comment

kent-mcleod commented Feb 27, 2024

lsf37 commented Feb 28, 2024 • edited Loading

lsf37 commented Feb 28, 2024

lsf37 commented Feb 29, 2024

wom-bat left a comment

Choose a reason for hiding this comment

lsf37 commented Feb 28, 2024 •

edited

Loading