-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
builds.py: let mq lock be reclaimed after 30min #337
Conversation
I wonder we shoud combine this with also using https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes |
That's a different use case, and a bit harder to give a good value for, because that workflow timeout is the overall time for all jobs -- that is much more variable than a single job once it got the lock, because the global one also includes lock wait time and varies depending on machine queue contention. The time for a single job once it has the lock is reasonably predictable. Might still be worth looking into and reducing the default for some of them. The main case that this specific timeout is a fix for is the condition where a job takes the lock, and then the machine queue server becomes unavailable on the network when it's trying to release it. AFAICS there is no fix for that apart from a timeout. It doesn't happen that often, but it's still happened maybe 4 times or so in the last two years (mostly around network outages, unsurprisingly). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either change the title to '20 minutes' or change the timeout to 1800 seconds. Either would be enough for this to pass.
We used to handle this by setting the timeout on the job that was running. The step to release the locks would trigger when the job completed or was cancelled or was timed out. Would that not work for us? Is releasing the locks triggered as a :post action? Do post actions run when earlier steps timeout? |
They do, and all of this is already done in the code. This is for when that has failed, because the network is unavailable, so the normal lock release fails and then the post step clear-all lock release fails, and the lock is still held on the server when it comes back on the network. |
A typical job takes about 5-10 min. If the lock has been held for 30 min in a single job, we assume something has gone wrong and the lock should be released so other jobs (or humans needing the machine) can proceed. This step is only necessary when other timeouts have failed. It is to guard against the case where the lock release after timeout has failed, and the post-step lock release also has failed. This can happen when the machine queue server is temporarily unreachable on the network, and then comes back with the lock still in place. In this case, there is nothing the scripts here can do to release that lock. Signed-off-by: Gerwin Klein <[email protected]>
Done. I've also updated the commit message to explain that this is a timeout of last resort. |
@wom-bat you'll need to change your review to approve, otherwise merging is blocked. We usually have the policy to approve PRs for people with write access where the PR only has small things to fix up and trust that people will do these before merging to avoid getting stalled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now
A typical job takes about 5-10 min. If the lock has been held for 30 min in a single job, we assume something has gone wrong and the lock should be released so other jobs (or humans needing the machine) can proceed.