Feature Request: Automated Recovery #719

navarr · 2022-11-04T17:27:56Z

I have a large Github Action workflow that pushes over 600 images up to the GitHub Container Registry.

This mostly works fine, except that I have to set max-parallels based on how many images I expect to be running at a time, and even then sometimes I'm hitting APIs too fast or getting a rare error.

For example:

buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status code https://ghcr.io/v2/swiftotter/den-php-fpm/blobs/sha256:456f646c7993e2a08fbdcbb09c191d153118d0c8675e1a0b29f83895c425105f: 500 Internal Server Error - Server message: unknown

or

buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: read tcp 172.17.0.2:59588->185.199.111.154:443: read: connection timed out

or

buildx failed with: ERROR: failed to solve: failed to do request: Head "https://ghcr.io/v2/swiftotter/den-php-fpm-debug/blobs/sha256:d6b642fadba654351d3fc430b0b9a51f7044351daaed3d27055b19044d29ec66": dial tcp: lookup ghcr.io on 168.63.129.16:53: read udp 172.17.0.2:40862->168.63.129.16:53: i/o timeout

These are all temporary errors that disappear the moment I re-run the job. Instead, what I wish for is that in such cases - like timeouts, or server errors or too many requests errors, that some sort of automated backoff retry system exists, with configurable limitations.

The text was updated successfully, but these errors were encountered:

Tenzer · 2022-11-16T11:52:18Z

We are occasionally seeing errors like these as well. For the cache-related ones, I guess an option also could be to continue despite the failure, as writing to the cache probably doesn't constitute a critical failure that would require the entire workflow to fail.

nick-at-work · 2023-01-12T16:40:48Z

We run into this somewhat often. Is a retry/retries option feasible for pushing images?

ddelange · 2023-01-25T08:28:48Z

some additional info: https://github.com/ddelange/pycuda/actions/runs/3972373867/jobs/6830922090

#22 DONE 31.0s

#23 exporting to image
#23 exporting layers
#23 exporting layers 9.9s done
#23 exporting manifest sha256:5cc09704d37dcab52f35d0dc1163acdae52fbb9a265cbb1fe9d55625a61307e9 done
#23 exporting config sha256:50aea776612d2a2916b237afb2e1e59c96b8134105791f65029ac253797dc840 done
#23 exporting attestation manifest sha256:c76e2b4622ef918f2ecc26fd4710f0a68f3f9b57f744ba23ff8130a21b5b3a7d done
#23 exporting manifest sha256:b82c6873647b10e6c1d13f754a225756d879380d6d51983250c0167b1af84874 done
#23 exporting config sha256:af2a1fee215b05d4d8b7893bc20a305a7a257fed7cb59c1ab21715286c89d08f done
#23 exporting attestation manifest sha256:52a9106f51393e6435cbed771d7b4e288c820d84e745c681dbdcc3d3a72bc67d done
#23 ...

#24 [auth] ddelange/pycuda/jupyter:pull,push token for ghcr.io
#24 DONE 0.0s

#23 exporting to image
#23 exporting manifest list sha256:dd095d62b30f27dd9ee27b81a0eabd77ab15387dc44f6833686ff20a005452a2 done
#23 pushing layers
#23 pushing layers 2.0s done
#23 ERROR: failed to push ghcr.io/ddelange/pycuda/jupyter:3.9-master: failed to copy: io: read/write on closed pipe

these images are each around 2GB+, so a retry might again error if there's no 'resume' ie successful layers don't need to be pushed again

kkom · 2023-06-14T22:47:04Z

I would like to +1 this! For us, this happens maybe once a day just during continuous deployment (so not counting PRs).

The errors we're seeing look like transient infrastructure errors:

buildx failed with: ERROR: failed to solve: failed to push ghcr.io/<our_org_name>/<our_repo_name>/<our_image_name>:2023.06.14-1605-f0c78f6: failed to copy: failed to do request: Put "https://ghcr.io/v2/<our_org_name>/<our_repo_name>/<our_image_name>/blobs/upload/9508e842-68bb-4779-9f40-6d8cf25357ff?digest=sha256%3A7cb00f153a2766267a4fbe7b14f830de29010a56c96486af21b7b9bf3c8838f0": write tcp 172.17.0.2:35594->140.82.114.33:443: write: broken pipe

Having an internal option to retry the step would be fantastic. We don't want to retry the whole job, as that could mean re-running things that should not be retried, like the test suite.

patrick-stephens · 2023-06-26T15:38:17Z

We see this too, both on simple container image promotion for Fluent Bit releases - usually pushing to ghcr.io in parallel the 3 supported architectures at least one of them fails - but also when building the multi-arch images which is a huge time sink as it takes a long time to build with QEMU then just fails to push so we have to restart the whole lot again.

dinvlad · 2023-07-25T23:35:49Z

I'm seeing this very often now, especially with parallel builds..
This is without any flakiness reported on the GH side: https://www.githubstatus.com

I have tried the pinned versions of Buildkit v0.10.6 and v0.12.0 but that didn't seem to help much:
#761 (comment)

It would be good to have more resilient retries (with the understanding that 100% reliability is obviously not achievable).

crazy-max · 2023-10-18T16:06:02Z

But I'm not convinced this would fix the reported (and out) issue.

Does it not work on your side?

Edit: Oh you mean when fetching cache I think right?

mfridman · 2023-10-20T15:07:22Z

I marked my comments above as off-topic since it was an isolated incident related to GCP artifact registry.

However, I'd still +1 this feature request. During that incident, retries would have been extremely useful.

Santas · 2023-11-10T11:08:18Z

Seeing this regularly.

#12 [backend 5/7] COPY Website .
#12 ERROR: failed to copy: read tcp 172.17.0.2:59196->20.209.147.161:443: read: connection timed out
------
...
--------------------
ERROR: failed to solve: failed to compute cache key: failed to copy: read tcp 172.17.0.2:59196->20.209.147.161:443: read: connection timed out

tonynajjar · 2024-03-01T07:44:48Z

+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:

buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized

richaarora01 · 2024-03-16T00:01:38Z

+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:

buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized

DhanshreeA · 2024-04-09T07:25:18Z

Is this feature under development, or is it still being considered? 👀

tmccoy · 2024-06-05T18:00:10Z

+1, We also experience transient issues that could be handled by retries.

mcbenjemaa · 2024-06-11T12:08:12Z

+1 for backoff retries

eiriksm · 2024-06-14T07:54:15Z

Just wanted to share my solution to this problem. The most naive stuff, wait a minute and try again if it failed.

      - name: Build and push Docker image
        continue-on-error: true
        id: buildx1
        uses: docker/build-push-action
        # ...and so on. Then:
      - name: Wait to retry
        if: steps.buildx1.outcome != 'success'
        run: |
          sleep 60
      - name: Build and push Docker image
        uses: docker/build-push-action
        if: steps.buildx1.outcome != 'success'
        # ...and so on

This has reduced my random failures to almost 0 I must say

koen-venly · 2024-12-04T12:38:22Z

+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:

buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized

Yes, this... We are also still seeing this. A retry solves it. But quite annoying.

#12 ERROR: failed to push 2893***197.dkr.ecr.eu-west-1.amazonaws.com/foo-bar-eu-west-1:0.35.1: unexpected status from HEAD request to https://2893***197.dkr.ecr.eu-west-1.amazonaws.com/v2/foo-bar-eu-west-1/blobs/sha256:f39d714dbb2487f692936bdc0a50902c7806aeedb771f1e8e6a86134eacd3e82: 401 Unauthorized

Except for implementing a retry logic... Is there anything else we can do? This is from a job that only ran 2 workflows in parallel. 1 succeeded, 1 failed.
Retrying the failed one, worked.

crazy-max added the registry/github label Nov 29, 2022

This comment was marked as off-topic.

Sign in to view

crazy-max mentioned this issue Nov 10, 2023

Add retry on build and push action. #974

Closed

This was referenced Jul 26, 2024

Push does not retry after 429 response from registry #1194

Closed

Docker push to Pulp registry gives 429 pulp/pulp_container#1716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Automated Recovery #719

Feature Request: Automated Recovery #719

navarr commented Nov 4, 2022

Tenzer commented Nov 16, 2022

nick-at-work commented Jan 12, 2023

ddelange commented Jan 25, 2023

kkom commented Jun 14, 2023 •

edited

Loading

patrick-stephens commented Jun 26, 2023

dinvlad commented Jul 25, 2023 •

edited

Loading

This comment was marked as off-topic.

crazy-max commented Oct 18, 2023 •

edited

Loading

This comment was marked as off-topic.

mfridman commented Oct 20, 2023

Santas commented Nov 10, 2023

tonynajjar commented Mar 1, 2024

richaarora01 commented Mar 16, 2024

DhanshreeA commented Apr 9, 2024

tmccoy commented Jun 5, 2024

mcbenjemaa commented Jun 11, 2024

eiriksm commented Jun 14, 2024

koen-venly commented Dec 4, 2024

Feature Request: Automated Recovery #719

Feature Request: Automated Recovery #719

Comments

navarr commented Nov 4, 2022

Tenzer commented Nov 16, 2022

nick-at-work commented Jan 12, 2023

ddelange commented Jan 25, 2023

kkom commented Jun 14, 2023 • edited Loading

patrick-stephens commented Jun 26, 2023

dinvlad commented Jul 25, 2023 • edited Loading

This comment was marked as off-topic.

crazy-max commented Oct 18, 2023 • edited Loading

This comment was marked as off-topic.

mfridman commented Oct 20, 2023

Santas commented Nov 10, 2023

tonynajjar commented Mar 1, 2024

richaarora01 commented Mar 16, 2024

DhanshreeA commented Apr 9, 2024

tmccoy commented Jun 5, 2024

mcbenjemaa commented Jun 11, 2024

eiriksm commented Jun 14, 2024

koen-venly commented Dec 4, 2024

kkom commented Jun 14, 2023 •

edited

Loading

dinvlad commented Jul 25, 2023 •

edited

Loading

crazy-max commented Oct 18, 2023 •

edited

Loading