-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Automated Recovery #719
Comments
We are occasionally seeing errors like these as well. For the cache-related ones, I guess an option also could be to continue despite the failure, as writing to the cache probably doesn't constitute a critical failure that would require the entire workflow to fail. |
We run into this somewhat often. Is a |
some additional info: https://github.com/ddelange/pycuda/actions/runs/3972373867/jobs/6830922090
these images are each around 2GB+, so a retry might again error if there's no 'resume' ie successful layers don't need to be pushed again |
I would like to +1 this! For us, this happens maybe once a day just during continuous deployment (so not counting PRs). The errors we're seeing look like transient infrastructure errors:
Having an internal option to retry the step would be fantastic. We don't want to retry the whole job, as that could mean re-running things that should not be retried, like the test suite. |
We see this too, both on simple container image promotion for Fluent Bit releases - usually pushing to ghcr.io in parallel the 3 supported architectures at least one of them fails - but also when building the multi-arch images which is a huge time sink as it takes a long time to build with QEMU then just fails to push so we have to restart the whole lot again. |
I'm seeing this very often now, especially with parallel builds.. I have tried the pinned versions of Buildkit It would be good to have more resilient retries (with the understanding that 100% reliability is obviously not achievable). |
This comment was marked as off-topic.
This comment was marked as off-topic.
Does it not work on your side? Edit: Oh you mean when fetching cache I think right? |
This comment was marked as off-topic.
This comment was marked as off-topic.
I marked my comments above as off-topic since it was an isolated incident related to GCP artifact registry. However, I'd still +1 this feature request. During that incident, retries would have been extremely useful. |
Seeing this regularly.
|
+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:
|
|
Is this feature under development, or is it still being considered? 👀 |
+1, We also experience transient issues that could be handled by retries. |
+1 for backoff retries |
Just wanted to share my solution to this problem. The most naive stuff, wait a minute and try again if it failed. - name: Build and push Docker image
continue-on-error: true
id: buildx1
uses: docker/build-push-action
# ...and so on. Then:
- name: Wait to retry
if: steps.buildx1.outcome != 'success'
run: |
sleep 60
- name: Build and push Docker image
uses: docker/build-push-action
if: steps.buildx1.outcome != 'success'
# ...and so on This has reduced my random failures to almost 0 I must say |
Yes, this... We are also still seeing this. A retry solves it. But quite annoying.
Except for implementing a retry logic... Is there anything else we can do? This is from a job that only ran 2 workflows in parallel. 1 succeeded, 1 failed. |
I have a large Github Action workflow that pushes over 600 images up to the GitHub Container Registry.
This mostly works fine, except that I have to set max-parallels based on how many images I expect to be running at a time, and even then sometimes I'm hitting APIs too fast or getting a rare error.
For example:
or
or
These are all temporary errors that disappear the moment I re-run the job. Instead, what I wish for is that in such cases - like timeouts, or server errors or too many requests errors, that some sort of automated backoff retry system exists, with configurable limitations.
The text was updated successfully, but these errors were encountered: