Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builds.py: let mq lock be reclaimed after 30min #337

Merged
merged 2 commits into from
Mar 1, 2024
Merged

Conversation

lsf37
Copy link
Member

@lsf37 lsf37 commented Feb 27, 2024

A typical job takes about 5-10 min. If the lock has been held for 30 min in a single job, we assume something has gone wrong and the lock should be released so other jobs (or humans needing the machine) can proceed.

@lsf37 lsf37 requested a review from wom-bat February 27, 2024 12:48
@lsf37 lsf37 self-assigned this Feb 27, 2024
@lsf37 lsf37 changed the title builds.py: let mq lock be reclaimed after 15min builds.py: let mq lock be reclaimed after 30min Feb 27, 2024
seL4-platforms/builds.py Outdated Show resolved Hide resolved
@axel-h
Copy link
Member

axel-h commented Feb 27, 2024

@lsf37
Copy link
Member Author

lsf37 commented Feb 27, 2024

I wonder we shoud combine this with also using https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes

That's a different use case, and a bit harder to give a good value for, because that workflow timeout is the overall time for all jobs -- that is much more variable than a single job once it got the lock, because the global one also includes lock wait time and varies depending on machine queue contention. The time for a single job once it has the lock is reasonably predictable.

Might still be worth looking into and reducing the default for some of them.

The main case that this specific timeout is a fix for is the condition where a job takes the lock, and then the machine queue server becomes unavailable on the network when it's trying to release it. AFAICS there is no fix for that apart from a timeout. It doesn't happen that often, but it's still happened maybe 4 times or so in the last two years (mostly around network outages, unsurprisingly).

Copy link
Member

@axel-h axel-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me.

Copy link
Member

@wom-bat wom-bat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either change the title to '20 minutes' or change the timeout to 1800 seconds. Either would be enough for this to pass.

seL4-platforms/builds.py Outdated Show resolved Hide resolved
seL4-platforms/builds.py Outdated Show resolved Hide resolved
@kent-mcleod
Copy link
Member

We used to handle this by setting the timeout on the job that was running. The step to release the locks would trigger when the job completed or was cancelled or was timed out. Would that not work for us? Is releasing the locks triggered as a :post action? Do post actions run when earlier steps timeout?

@lsf37
Copy link
Member Author

lsf37 commented Feb 28, 2024

We used to handle this by setting the timeout on the job that was running. The step to release the locks would trigger when the job completed or was cancelled or was timed out. Would that not work for us? Is releasing the locks triggered as a :post action? Do post actions run when earlier steps timeout?

They do, and all of this is already done in the code. This is for when that has failed, because the network is unavailable, so the normal lock release fails and then the post step clear-all lock release fails, and the lock is still held on the server when it comes back on the network.

A typical job takes about 5-10 min. If the lock has been held for 30 min
in a single job, we assume something has gone wrong and the lock should
be released so other jobs (or humans needing the machine) can proceed.

This step is only necessary when other timeouts have failed. It is to
guard against the case where the lock release after timeout has failed,
and the post-step lock release also has failed. This can happen when the
machine queue server is temporarily unreachable on the network, and then
comes back with the lock still in place. In this case, there is nothing
the scripts here can do to release that lock.

Signed-off-by: Gerwin Klein <[email protected]>
@lsf37
Copy link
Member Author

lsf37 commented Feb 28, 2024

Either change the title to '20 minutes' or change the timeout to 1800 seconds. Either would be enough for this to pass.

Done. I've also updated the commit message to explain that this is a timeout of last resort.

@lsf37 lsf37 requested a review from wom-bat February 29, 2024 10:46
@lsf37
Copy link
Member Author

lsf37 commented Feb 29, 2024

@wom-bat you'll need to change your review to approve, otherwise merging is blocked. We usually have the policy to approve PRs for people with write access where the PR only has small things to fix up and trust that people will do these before merging to avoid getting stalled.

Copy link
Member

@wom-bat wom-bat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now

@wom-bat wom-bat merged commit d136cb0 into master Mar 1, 2024
7 checks passed
@wom-bat wom-bat deleted the lock-timeout branch March 1, 2024 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants