Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sandbox container image being GC'd in 1.29 #1597

Closed
nightmareze1 opened this issue Jan 29, 2024 · 65 comments · Fixed by #1605
Closed

Sandbox container image being GC'd in 1.29 #1597

nightmareze1 opened this issue Jan 29, 2024 · 65 comments · Fixed by #1605

Comments

@nightmareze1
Copy link

nightmareze1 commented Jan 29, 2024

AMI: amazon-eks-node-1.29-v20240117

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.eu-west-2.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

1 day after upgrading EKS to 1.29

@cartermckinnon
Copy link
Member

cartermckinnon commented Jan 29, 2024

It sounds like something deleted your pause container image.

I would check:

  1. Make sure that the --pod-infra-container-image flag passed to kubelet matches the sandbox_image in /etc/containerd/config.toml. This will prevent kubelet from deleting it during its image garbage collection process.
  2. Look for RemoveImage CRI calls in your containerd logs. It's likely that some other CRI client (not kubelet) is deleting the image.

@nightmareze1
Copy link
Author

[~]# systemctl kubelet status

          └─3729 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-credential-provider --pod-infra-container-image=602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5 --v=2 

[~]# cat /etc/containerd/config.toml |grep 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5

sandbox_image = "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5"

@jrsparks86
Copy link

We also have noticed this issue after updating to 1.29. If we rotate out the nodes it recovers for some time then comes back a day later.

@nightmareze1
Copy link
Author

I'm using a temporal workaround proposed by a person in the issue created in aws-node repo(I modified a little but works)

curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz
tar zxf crictl.tar.gz
chmod u+x crictl
mv crictl /usr/bin/crictl


cat <<EOF > /etc/eks/eks_creds_puller.sh
IMAGE_TOKEN=@@@(aws ecr get-login-password --region eu-west-2)
crictl --runtime-endpoint=unix:///run/containerd/containerd.sock  pull --creds "AWS:\$IMAGE_TOKEN" 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5
EOF
 
sed -i 's/@@@/\$/g' /etc/eks/eks_creds_puller.sh

chmod u+x /etc/eks/eks_creds_puller.sh

echo "*/5 * * * * /etc/eks/eks_creds_puller.sh >> /var/log/eks_creds_puller 2>&1" | crontab -

@nightmareze1 nightmareze1 changed the title Pods stuck in ContainerCreating due to pull error unauthorized Pods stuck in ContainerCreating due to pull error 401 Unauthorized Jan 29, 2024
@ohrab-hacken
Copy link

I am experience same issue. --pod-infra-container-image flag is set on kubelet. I found that my disk on node really become full after some time and kubelet garbage collector delete pause image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work.
I found the reason of full disk. In my case, I have ttlSecondsAfterFinished: 7200 for dagster jobs, and it consume all disk space. I've changed it to ttlSecondsAfterFinished: 120 and jobs cleaned up more frequently and we don't have this issue any more.
It's strange cause I didn't have this issue on 1.28, and I didn't change any Dagster configuration between version upgrade. My guess, it kubelet image garbage collector works different in 1.28 and 1.29.

@ghost
Copy link

ghost commented Jan 30, 2024

We're experiencing the same issue as well.

@wiseelf
Copy link

wiseelf commented Jan 30, 2024

I'm having that same issue after upgrading to 1.29 on both AL2 and Bottlerocket nodes.

@havilchis
Copy link

The Kubelet flag --pod-infra-container-image is deprecated in 1.27+ [https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/]. The current implementation is that GC reads the properties of the Image set by the Container Runtime.

In the case of containerd, the GC should avoid images tagged with the property "pinned: true".

And containerd should flag the sandbox_image as pinned [https://github.com/containerd/containerd/pull/7944].

I believe that issue is related to containerD and the sandbox_image.

Although is set in config.toml, this is not flagged as "pinned: true".

I do not know if this is a general issue in ContainerD, but at least in my EKS Cluster in 1.29 the sandbox image appears as "pinned:false";

./crictl images | grep pause | grep us-east-1 | grep pause
602401143452.dkr.ecr-fips.us-east-1.amazonaws.com/eks/pause                    3.5                          6996f8da07bd4       299kB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause                         3.5                          6996f8da07bd4       299kB

./crictl inspecti 6996f8da07bd4 | grep pinned
    "pinned": false

@cartermckinnon
Copy link
Member

It definitely seems like image pinning is the problem here. I'm trying to put a fix together 👍

@cartermckinnon
Copy link
Member

I think the issue here is the version of containerd being used by Amazon Linux does not have pinned image support, which was added in 1.7.3: containerd/containerd@v1.7.2...v1.7.3

I'm verifying that this hasn't been cherry-picked by the AL team. We'll probably have to do a hotfix in the immediate term.

@cartermckinnon
Copy link
Member

AL intends to push containerd-1.7.11 to the package repositories soon, but I'll go ahead and put together a hotfix on our end.

@cartermckinnon
Copy link
Member

cartermckinnon commented Jan 30, 2024

I think the best bandaid for now is to periodically pull the sandbox image (if necessary), that's what #1601 does. @mmerkes @suket22 PTAL.

@cartermckinnon cartermckinnon changed the title Pods stuck in ContainerCreating due to pull error 401 Unauthorized Sandbox container image being GC'd in 1.29 Jan 30, 2024
@Idan-Lazar
Copy link

any updates?

@StefanoMantero
Copy link

We're experiencing the same issue as well, pretty random tho, any updates ?

@dekelummanu
Copy link

+1

@spatelwearpact
Copy link

None of our applications or jobs are running in the cluster now! This is literally the highest priority issue with 1.29!

@Tenzer
Copy link

Tenzer commented Jan 31, 2024

A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image.

@wiseelf
Copy link

wiseelf commented Jan 31, 2024

A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image.

I did the same, it just increases a time for issue to occur and brings additional expenses. Agree that it is a top priority issue because it is impossible to downgrade to 1.28 without recreating the cluster.

@cartermckinnon
Copy link
Member

cartermckinnon commented Jan 31, 2024

The way we pull the image is part of the problem, this label is only applied (with containerd 1.7.3+) at pull time in the cri-containerd server, ctr pull won't do the trick.

@dims
Copy link
Member

dims commented Jan 31, 2024

cc @henry118

@cartermckinnon
Copy link
Member

cartermckinnon commented Jan 31, 2024

While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:

  • registry.k8s.io/pause:3.9
  • public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest

@mlagoma
Copy link

mlagoma commented Jan 31, 2024

While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:

  • registry.k8s.io/pause:3.9
  • public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest

Greetings, does anybody have any guidance on how I can make this modification to my EKS cluster? Is it part of the Dockerfile build of the container image? The kube deployment manifest (which uses my container image)? Somewhere else? Better to just wait it out for the fix?

@dims
Copy link
Member

dims commented Jan 31, 2024

@mlagoma /etc/containerd/config.toml is the configuration file for containerd, you will see an entry (key / value) for a sandbox_image this points to an image in ECR usually. @cartermckinnon was talking about switching that.

However, it is better to talk to AWS support and get help if you are not comfortable.

@dims
Copy link
Member

dims commented Feb 5, 2024

@doramar97 downgrade workers with image v1.28 and it's better to wait a few weeks with the update, because they don't test anything (i.e. they test it in production with customers)

you are welcome to do what works for you. please bear with us as this was a tricky one.

@dims
Copy link
Member

dims commented Feb 5, 2024

any updates on bottlerocket ?

@marcin99 if you need a solid ETA for production, it's better to approach via support escalation channels. suffice to say, it's in progress.

@tzneal
Copy link
Contributor

tzneal commented Feb 5, 2024

@marcin99 I'm not sure that I can downgrade EKS version without replacing the cluster with a new one, It is a production cluster and i'm looking for a reliable fix until they will issue a fix.

The BottleRocket team confirmed that the DaemonSet prevention solution I posted above works for BotleRocket as well.

@marcin99
Copy link

marcin99 commented Feb 5, 2024

@doramar97 you don't need downgrade cluster version, but you can use the image from the previous version for workers

@RamazanBiyik77
Copy link

This issue should be fixed in AMI release v20240202. We were able to include containerd-1.7.11 which properly reports the sandbox_image as pinned to kubelet, after the changes in #1605.

How can i apply this changes to my existing AMI?

i can confirm that my sandbox image is still 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5.

@RamazanBiyik77
Copy link

Okay found it on AWS EKS Compute section. There was a notification for new AMI release.

@odellcraig
Copy link

After reading through the thread, I see that this is fixed with v20240202. To apply this change, do you have to go update the launch template to point at the new AMI? I see that a new EKS cluster I created yesterday via Terraform is using the latest AMI (ami-0a5010afd9acfaa26 - amazon-eks-node-1.29-v20240227), But a cluster I created about a month ago before this change is still on ami-0c482d7ce1aa0dd44 (amazon-eks-node-1.29-v20240117). Is there a way tell my existing clusters to use the latest AMI?

@bryantbiggs
Copy link
Contributor

@odellcraig you do that via the release_version

@odellcraig
Copy link

@bryantbiggs Thank you.

For anyone using Terraform and eks_managed_node_groups you can specify using:

eks_managed_node_groups = {
    initial = {
      ami_release_version = "1.29.0-20240227" # this is the latest version as of this comment
      name           = "..."
      instance_types = [...]
      min_size       = ...
      max_size       = ...
      desired_size   = ...
...

@korncola
Copy link

korncola commented Jun 3, 2024

can you please fix the damn issue after half a year? Still happens with EKS managed nodegroup and AMI
amazon/amazon-eks-node-1.29-v20240522

Error in kubelet on node:
unexpected status from HEAD request to https://602401143452.dkr.ecr.eu-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 403 Forbidden"

migration to EKS halted here

@shamallah
Copy link

Same error with amazon-eks-node-1.29-v20240315
failed" error="failed to pull and unpack image \"602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5\": failed to copy: httpReadSeeker: failed open: unexpected status code https://602401143452.dkr.ecr.eu-central-1.amazonaws.com/v2/eks/pause/blobs/sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7: 403 Forbidden"

@tzneal
Copy link
Contributor

tzneal commented Jun 3, 2024

Are the permissions on your node role correct per https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html? Specifically, does it have the AmazonEC2ContainerRegistryReadOnly policy?

@shamallah
Copy link

AmazonEC2ContainerRegistryReadOnly policy?

The policy is attached.

@korncola
Copy link

korncola commented Jun 3, 2024

Policy AmazonEC2ContainerRegistryReadOnly is attached here also.
Cant you just use a REAL public repo instead of this half baked half private/public repo in the configs and init scripts? cause hacking the bootstrapping script with public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest works - but only until reboot, cause init-scripts will always place this damn non working URL in /etc/containerd/config.toml

@cartermckinnon
Copy link
Member

@korncola can you open a ticket with AWS support so we can look into the specifics of your environment?

@korncola
Copy link

korncola commented Jun 3, 2024

thanks @cartermckinnon , will do that.
But still: Why no true public repo?!

Did a cluster via terraform and GUI, triple checked policies. Also disabled all SCP. Still same error.
Also nodegroups with AL2023 image or AL2 no success.

@cartermckinnon
Copy link
Member

ECR Public is only hosted in a few regions; so we still use regional ECR repositories for lower latency and better availability. ECR Public also has a monthly bandwidth limit for anonymous pulls that cannot be increased; so if you're using it in production, make sure you're not sending anonymous requests.

@korncola
Copy link

korncola commented Jun 3, 2024

[...] and better availability. [...]

yeah i see the availability in this and the other tickets...

ECR Public also has a monthly bandwidth limit for anonymous pulls that cannot be increased;

As i said above use a real public service...
And AWS owns that service, so make it worth...
This are bad excuse for this design decision. Sorry for my rant, but I don't get this decisions, when I look at the scripts with all the hardcoded account IDs to compose an ECR repo URL, with scripts in scripts in scripts, I mean come on, you can do better at AWS.

But as always in the end, I will have a certain typo or whatever on my side causing my ECR pull error and you will all laugh at me :-)

@bryantbiggs
Copy link
Contributor

@korncola lets keep it professional. The best course of action is to work with the team through the support ticket. There are many factors that go into decisions that users are not usually aware of. The team is very responsive in terms of investigating and getting a fix rolled out (as needed)

@korncola
Copy link

korncola commented Jun 3, 2024

yep you are right 👍 team here is very helpful and responsive, thank you for the support here! Will report when issue is resolved, so others can use that info.

@mlagoma
Copy link

mlagoma commented Jun 4, 2024

If I understand correctly, the same or a similar (in that it will definitely occur over time) bug was perhaps reintroduced/introduced? So should it be advised to not upgrade nodes? Or is this a separate issue (e.g. anonymous pulls)?

No sign of the issue on older version (1.29.0-20240202)

@cartermckinnon
Copy link
Member

No, at this point we don’t have evidence of a new bug or a regression.

I’m going to lock this thread to avoid confusion, please open a new issue for follow-ups.

@awslabs awslabs locked as resolved and limited conversation to collaborators Jun 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.