Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition in 'az acr import' can lead to 'manifest unknown' error in target registry #29974

Open
HenryvanderVegte opened this issue Sep 25, 2024 · 8 comments
Labels
Auto-Assign Auto assign by bot Auto-Resolve Auto resolve by bot bug This issue requires a change to an existing behavior in the product in order to be resolved. Container Registry az acr customer-reported Issues that are reported by GitHub users external to the Azure organization. Service Attention This issue is responsible by Azure service team. Similar-Issue

Comments

@HenryvanderVegte
Copy link

Describe the bug

When running az acr import like

az acr import --name targetacr --source sourceacr.azurecr.io/myimage:latest --image myimage:latest --force

to copy the image by tag (e.g. 'latest') from sourceacr to targetacr, there is a race condition when the manifest for the tag in the source registry changes while the az acr import command is in progress.

In that case, the 'az acr import' command completes without any errors. However, docker pull fails with

PS C:\Users> docker pull targetacr.azurecr.io/myimage:latest

What's next:
    View a summary of image vulnerabilities and recommendations → docker scout quickview targetacr.azurecr.io/myimage:latest
Error response from daemon: manifest for targetacr.azurecr.io/myimage:latest not found: manifest unknown: manifest sha256:02f3*** is not found

Looking into the azure ACR I can see the tag + digest:

1

but receive a 404 NotFound error when trying to fetch the manifest:

2

I believe this is the same issue that was described in #21944.

As described in #21944, this is very dangerous if the ACR is used by a kubernetes cluster, since it results in pod startup issues with ImagePullBackoff errors.

Related command

Here's a timeline of all commands that ran to bring the ACR in a bad state:

1) myimage:142506623 with digest 02f3... pushed to source acr and gets tagged with latest

2024-09-24T09:38:19.0032592Z docker push ***/myimage:142506623
2024-09-24T09:38:20.7350441Z az acr import --name sourceacr --source sourceacr.azurecr.io/myimage:142506623 --image sourceacr.azurecr.io/myimage:latest --force --no-wait

2) az acr import to target registry starts

2024-09-24T09:40:11.4813456Z az acr import --name targetacr --source sourceacr.azurecr.io/myimage:latest --image myimage:latest --force

3) myimage:142506638 with digest 3107... pushed to source acr and tagged with latest

2024-09-24T09:40:39.2704699Z docker push ***/myimage:142506638 
2024-09-24T09:40:40.7808698Z az acr import --name sourceacr --source sourceacr.azurecr.io/myimage:142506638 --image sourceacr.azurecr.io/myimage:latest --force --no-wait

4) az acr import to target registry completes

2024-09-24T09:41:42.3197203Z INFO: ===> Completed in 90.84s: [az acr import --name targetacr --source sourceacr.azurecr.io/myimage:latest --image myimage:latest --force]

The az acr import in 4) completes without any errors, but from that time on the target registry is in a bad state.

Probably does not make a difference, but we're using a PullToken to connect to the source registry when transferring the image like

az acr import --name targetacr --source sourceacr.azurecr.io/myimage:latest --image myimage:latest --force --password *** --username myPullToken

Errors

docker pull on target acr fails with:

PS C:\Users> docker pull targetacr.azurecr.io/myimage:latest

What's next:
    View a summary of image vulnerabilities and recommendations → docker scout quickview targetacr.azurecr.io/myimage:latest
Error response from daemon: manifest for targetacr.azurecr.io/myimage:latest not found: manifest unknown: manifest sha256:02f3*** is not found

az acr import to copy the image from target acr to a different acr fails with:

az acr import --name testacr --source targetacr .azurecr.io/myimage:latest --image myimage:latest --force --password *** --username myPullToken

(InvalidParameters) Operation registries-*** failed. Resource /subscriptions/***/resourceGroups/***/providers/Microsoft.ContainerRegistry/registries/testacr Invalid message NotFound Not Found {"errors":[{"code":"MANIFEST_UNKNOWN","message":"manifest sha256:02f3*** is not found","detail":{"Name":"myimage","Revision":"sha256:02f3***"}}]}

Code: InvalidParameters
Message: Operation registries-*** failed. Resource /subscriptions/***/resourceGroups/***/providers/Microsoft.ContainerRegistry/registries/testacr Invalid message NotFound Not Found {"errors":[{"code":"MANIFEST_UNKNOWN","message":"manifest sha256:02f3*** is not found","detail":{"Name":"myimage","Revision":"sha256:02f3***"}}]}

Issue script & Debug output

Captured debug output via

az acr import --debug --name testacr --source targetacr .azurecr.io/myimage:latest --image myimage:latest --force --password *** --username myPullToken

but afraid that it might contain sensitive information. Will provide if required.

Expected behavior

az acr import should leave the registry in a consistent state. it should either use the old or the new tag, and keep the corresponding manifest.

If the image associated with 'latest' changes while the command is running, it should either:

  1. fail the az acr import command and not update anything
  2. update the acr with the image that was associated with 'latest' when update started
  3. update the acr with the new 'latest' image

Environment Summary

azure-cli 2.245.5

Additional context

No response

@HenryvanderVegte HenryvanderVegte added the bug This issue requires a change to an existing behavior in the product in order to be resolved. label Sep 25, 2024
Copy link

Hi @HenryvanderVegte,

2.245.5 is not the latest Azure CLI(2.64.0).

If you haven't already attempted to do so, please upgrade to the latest Azure CLI version by following https://learn.microsoft.com/en-us/cli/azure/update-azure-cli.

@azure-client-tools-bot-prd azure-client-tools-bot-prd bot added the Auto-Resolve Auto resolve by bot label Sep 25, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot added the customer-reported Issues that are reported by GitHub users external to the Azure organization. label Sep 25, 2024
@yonzhan
Copy link
Collaborator

yonzhan commented Sep 25, 2024

Thank you for opening this issue, we will look into it.

@microsoft-github-policy-service microsoft-github-policy-service bot added Auto-Assign Auto assign by bot Service Attention This issue is responsible by Azure service team. Container Registry az acr labels Sep 25, 2024
Copy link

Here are some similar issues that might help you. Please check if they can solve your problem.

Copy link
Contributor

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @toddysm, @luisdlp, @northtyphoon.

@kichalla
Copy link

cc @nathana1

@terencet-dev
Copy link

@HenryvanderVegte, is this still an issue you're encountering? Have you upgraded your AzureCLI yet?

@HenryvanderVegte
Copy link
Author

Hey @terencet-dev ,

we worked around the issue by copying the images by digest, not by tag - but since that's just a workaround, it is very likely that the issue still exists.

Have you upgraded your AzureCLI yet?

I might be wrong, but to me this looks like a service-side issue. Even old versions of the AzureCLI should not be able to trigger a race condition on the container registry.

CC @mahilleb-msft / @Michael-Sinz

@HenryvanderVegte
Copy link
Author

@yonzhan / @kichalla / @terencet-dev , is there any update you could share?
and could you assign the issue to someone?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auto-Assign Auto assign by bot Auto-Resolve Auto resolve by bot bug This issue requires a change to an existing behavior in the product in order to be resolved. Container Registry az acr customer-reported Issues that are reported by GitHub users external to the Azure organization. Service Attention This issue is responsible by Azure service team. Similar-Issue
Projects
None yet
Development

No branches or pull requests

4 participants