-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aarch64 nodes fail to pull eks/pause
during node init
#2778
Comments
eks/pause
during node initeks/pause
during node init
Hi @pat-s ! I just attempted to reproduce this behavior in I confirmed the AMI is the same as you are using: I created a cluster using
Did you create this ASG by hand? Can you confirm the above policies are attached to the IAM role? |
@zmrow Thanks for checking so quickly! Interesting, that you couldn't reproduce it!
No. Via Terraform using https://github.com/terraform-aws-modules/terraform-aws-eks. It's a PROD cluster and not a "new config". We're using BR since months. The only thing I changed in the ASG config was the instance type and AMI. BR aarch fails with the mentioned issue, using an AL2 AMI instead works (with the same ASG config). I just c/p our existing ASG config (see below) which works fine for base_arm64_bottlerocket = {
name = "base_arm64_br-${var.ENV}"
min_size = var.ASG_MIN_base_arm64_br
desired_size = var.ASG_MIN_base_arm64_br
max_size = var.ASG_MAX_base_arm64_br
bootstrap_extra_args = <<-EOT
"container-log-max-size" = "500M"
[settings.kernel.sysctl]
"user.max_user_namespaces" = "16384"
EOT
ami_id = data.aws_ami.aws-bottlerocket-arm64.id
use_mixed_instances_policy = true
mixed_instances_policy = {
instances_distribution = {
on_demand_percentage_above_base_capacity = 0
spot_allocation_strategy = "capacity-optimized"
}
override = [
{
instance_type = "t4g.medium"
weighted_capacity = "1"
},
{
instance_type = "t4g.large"
weighted_capacity = "2"
},
]
}
}
Jup. |
This is an unusual error because the pause container should be pulled before kubelet starts. That implies that Do you have access to the nodes either via SSM or SSH? Both would also require |
On the off chance that
Based on the current EKS AMI bootstrap.sh, that might be:
The actual value should be in one of these files on the AL2 nodes:
|
Thanks for the detailed help!
Not in the current setup, I think I would need to add some bootstrap args to the ASG first AFAIK. Could do it, if you tell me what I should check for then :) I am wondering why it works on the After testing with
If that would apply, all ASGs in our setup would not be able to pull. |
I went and checked my cluster and it appears that my cluster is using the
|
I attempted to reproduce this as well, and could not. I was able to successfully pull the
|
Hi @pat-s , I'm glad you were able to find a workaround -- we haven't been able to repro this. I'm closing this for now. If you run into this again, please feel free to re-open this issue. |
@mchaker All good, I understand. Still wondering what might be the culprit on my end but as long as there is a workaround and others don't have the issue, all is good! Thanks for doing a deep check anyhow, I think it was worth it! 🤝 |
Sorry for coming back to this but I just ran into the same issue again for Could it be that many tags (excluding
|
@pat-s no worries, thanks for following up. This appears like it could be a similar issue to that of aws/amazon-vpc-cni-k8s#2030. Since it's been some time since this thread has been open, could you provide some information about your current environment (Bottlerocket version, EKS addons, etc)? |
Yes, I pinned the image due to awslabs/amazon-eks-ami#1597 as I also faced this issue.
Bottlerocket OS 1.19.1 (aws-k8s-1.29) Latest addon versions from EKS (node, proxy, dns) |
Sidebar on the main issue at hand - Bottlerocket 1.19.1 includes a patch that will pin pause containers such that they are not garbage collected on k8s-1.29 variants #3757. So far we haven't had reports back that this patch isn't working, could you confirm that you can remove the manual pin on your side and see your pause/sandbox containers running?
Earlier in this thread it was called out that EKS is using I'll continue looking into it and attempt a repro to try to get to the bottom of this. |
Hey @pat-s, I haven't been able to reproduce this. I've been working out of us-west-2, but both Looking back in this thread, it appears you've already taken the suggested IAM policies... do you have any other data or logs other than the "unauthorized" log? Can you share if this is a reoccurring issue on the aarch64 aws-k8s-1.29 variant or if this intermittent / one time? Otherwise, if you're unblocked via the 3.5 tag, I'll keep this issue closed. |
Thanks for digging and the persistence in this old thread! I am really wondering what is causing this, and it seems to be an issue for our config only.
I ran BR 1.19.1 before the issue (re-)appeared. I don't think it is related as the issue is with the generic pull of the image / auth denial and the node does not even come up on a fresh start.
That's the only one I see. It also appears early in the pod creation, so there is nothing else. I've now removed the bootstrap args entirely again and this is what I get using
If I set My only idea left is that it might be a region-related issue. I've hardcoded |
I just ran into the same issue with a few nodes (in
The only node group that didn't have affected nodes in the cluster was an EKS-managed node group running default AWS Linux AMI. This is also all on
I also saw some logs that I think were from when it first started where it said it couldn't connect, Ran an instance refresh on all the nodegroups to terminate/recreate all nodes and everything is working fine now. |
Image I'm using:
bottlerocket-aws-k8s-1.24-aarch64-v1.12.0-6ef1139f
What I expected to happen:
Nodes are able to pull the
eks/pause
image and join the cluster.What actually happened:
Nodes fail to pull the
eks/pause
image during the init container of theaws-node
pod.How to reproduce the problem:
Create a ASG with a arm64 graviton instance (e.g. t4g) and the linked BR image.
Additional information:
bottlerocket-aws-k8s-1.24-x86_64-v1.12.0-6ef1139f
and the exact same ASG configuration work fineamazon-eks-arm64-node-1.24-v20230127
) work fineAmazonEC2ContainerRegistryReadOnly
attachedThe text was updated successfully, but these errors were encountered: