-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sandbox container image being GC'd in 1.29 #1597
Comments
It sounds like something deleted your pause container image. I would check:
|
[~]# systemctl kubelet status
[~]# cat /etc/containerd/config.toml |grep 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5
|
We also have noticed this issue after updating to 1.29. If we rotate out the nodes it recovers for some time then comes back a day later. |
I'm using a temporal workaround proposed by a person in the issue created in aws-node repo(I modified a little but works)
|
I am experience same issue. |
We're experiencing the same issue as well. |
I'm having that same issue after upgrading to 1.29 on both AL2 and Bottlerocket nodes. |
The Kubelet flag In the case of containerd, the GC should avoid images tagged with the property "pinned: true". And containerd should flag the sandbox_image as pinned [https://github.com/containerd/containerd/pull/7944]. I believe that issue is related to containerD and the sandbox_image. Although is set in config.toml, this is not flagged as "pinned: true". I do not know if this is a general issue in ContainerD, but at least in my EKS Cluster in 1.29 the sandbox image appears as "pinned:false";
|
It definitely seems like image pinning is the problem here. I'm trying to put a fix together 👍 |
I think the issue here is the version of I'm verifying that this hasn't been cherry-picked by the AL team. We'll probably have to do a hotfix in the immediate term. |
AL intends to push |
any updates? |
We're experiencing the same issue as well, pretty random tho, any updates ? |
+1 |
None of our applications or jobs are running in the cluster now! This is literally the highest priority issue with 1.29! |
A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image. |
I did the same, it just increases a time for issue to occur and brings additional expenses. Agree that it is a top priority issue because it is impossible to downgrade to 1.28 without recreating the cluster. |
The way we pull the image is part of the problem, this label is only applied (with |
cc @henry118 |
While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:
|
Greetings, does anybody have any guidance on how I can make this modification to my EKS cluster? Is it part of the Dockerfile build of the container image? The kube deployment manifest (which uses my container image)? Somewhere else? Better to just wait it out for the fix? |
@mlagoma However, it is better to talk to AWS support and get help if you are not comfortable. |
you are welcome to do what works for you. please bear with us as this was a tricky one. |
@marcin99 if you need a solid ETA for production, it's better to approach via support escalation channels. suffice to say, it's in progress. |
The BottleRocket team confirmed that the DaemonSet prevention solution I posted above works for BotleRocket as well. |
@doramar97 you don't need downgrade cluster version, but you can use the image from the previous version for workers |
How can i apply this changes to my existing AMI? i can confirm that my sandbox image is still 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5. |
Okay found it on AWS EKS Compute section. There was a notification for new AMI release. |
After reading through the thread, I see that this is fixed with v20240202. To apply this change, do you have to go update the launch template to point at the new AMI? I see that a new EKS cluster I created yesterday via Terraform is using the latest AMI (ami-0a5010afd9acfaa26 - amazon-eks-node-1.29-v20240227), But a cluster I created about a month ago before this change is still on ami-0c482d7ce1aa0dd44 (amazon-eks-node-1.29-v20240117). Is there a way tell my existing clusters to use the latest AMI? |
@odellcraig you do that via the |
@bryantbiggs Thank you. For anyone using Terraform and eks_managed_node_groups you can specify using:
|
can you please fix the damn issue after half a year? Still happens with EKS managed nodegroup and AMI Error in kubelet on node: migration to EKS halted here |
Same error with amazon-eks-node-1.29-v20240315 |
Are the permissions on your node role correct per https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html? Specifically, does it have the AmazonEC2ContainerRegistryReadOnly policy? |
The policy is attached. |
Policy AmazonEC2ContainerRegistryReadOnly is attached here also. |
@korncola can you open a ticket with AWS support so we can look into the specifics of your environment? |
thanks @cartermckinnon , will do that. Did a cluster via terraform and GUI, triple checked policies. Also disabled all SCP. Still same error. |
ECR Public is only hosted in a few regions; so we still use regional ECR repositories for lower latency and better availability. ECR Public also has a monthly bandwidth limit for anonymous pulls that cannot be increased; so if you're using it in production, make sure you're not sending anonymous requests. |
yeah i see the availability in this and the other tickets...
As i said above use a real public service... But as always in the end, I will have a certain typo or whatever on my side causing my ECR pull error and you will all laugh at me :-) |
@korncola lets keep it professional. The best course of action is to work with the team through the support ticket. There are many factors that go into decisions that users are not usually aware of. The team is very responsive in terms of investigating and getting a fix rolled out (as needed) |
yep you are right 👍 team here is very helpful and responsive, thank you for the support here! Will report when issue is resolved, so others can use that info. |
If I understand correctly, the same or a similar (in that it will definitely occur over time) bug was perhaps reintroduced/introduced? So should it be advised to not upgrade nodes? Or is this a separate issue (e.g. anonymous pulls)? No sign of the issue on older version (1.29.0-20240202) |
No, at this point we don’t have evidence of a new bug or a regression. I’m going to lock this thread to avoid confusion, please open a new issue for follow-ups. |
AMI: amazon-eks-node-1.29-v20240117
1 day after upgrading EKS to 1.29
The text was updated successfully, but these errors were encountered: