Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 nodes fail to pull eks/pause during node init #2778

Closed
pat-s opened this issue Feb 4, 2023 · 16 comments
Closed

aarch64 nodes fail to pull eks/pause during node init #2778

pat-s opened this issue Feb 4, 2023 · 16 comments
Labels
status/research This issue is being researched type/bug Something isn't working

Comments

@pat-s
Copy link

pat-s commented Feb 4, 2023

Image I'm using:

bottlerocket-aws-k8s-1.24-aarch64-v1.12.0-6ef1139f

What I expected to happen:

Nodes are able to pull the eks/pause image and join the cluster.

What actually happened:

Nodes fail to pull the eks/pause image during the init container of the aws-node pod.

error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.1-eksbuild.1": pulling from host 602401143452.dkr.ecr.eu-central-1.amazonaws.com failed with status code [manifests 3.1-eksbuild.1]: 401 Unauthorized 

How to reproduce the problem:

Create a ASG with a arm64 graviton instance (e.g. t4g) and the linked BR image.

Additional information:

  • x86_64 nodes running bottlerocket-aws-k8s-1.24-x86_64-v1.12.0-6ef1139f and the exact same ASG configuration work fine
  • arm64 graviton instances running AL2 (amazon-eks-arm64-node-1.24-v20230127) work fine
  • I have reproduced the behavior multiple times now to make sure it's not an ASG or k8s issue
  • The ASG group's IAM role has the AmazonEC2ContainerRegistryReadOnly attached
  • EKS addons:
    • CNI: v1.12.1-eksbuild.2
    • proxy: v1.24.9-eksbuild.1
    • DNS: v1.8.7-eksbuild.3
@pat-s pat-s added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Feb 4, 2023
@pat-s pat-s changed the title aarch64 nodes fail to pull eks/pause during node init aarch64 nodes fail to pull eks/pause during node init Feb 4, 2023
@zmrow
Copy link
Contributor

zmrow commented Feb 4, 2023

Hi @pat-s ! I just attempted to reproduce this behavior in eu-central-1 and couldn't! (I chose that region because that was in the error message you pasted above)

I confirmed the AMI is the same as you are using: ami-04ddbb09d9e7726b2 (bottlerocket-aws-k8s-1.24-aarch64-v1.12.0-6ef1139f)

I created a cluster using eksctl and the below config and both nodes came up and joined the cluster properly.

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: bottlerocket-arm
  region: eu-central-1
  version: '1.24'

nodeGroups:
  - name: ng-bottlerocket
    instanceType: t4g.small
    desiredCapacity: 2
    amiFamily: Bottlerocket
    iam:
       attachPolicyARNs:
          - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
          - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
          - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
          - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    ssh:
        allow: true
        publicKeyName: <MY KEY HERE>
    bottlerocket:
      settings:
        motd: "Hello from eksctl!"

Did you create this ASG by hand? Can you confirm the above policies are attached to the IAM role?

@zmrow zmrow added status/research This issue is being researched and removed status/needs-triage Pending triage or re-evaluation labels Feb 4, 2023
@pat-s
Copy link
Author

pat-s commented Feb 4, 2023

@zmrow Thanks for checking so quickly!

Interesting, that you couldn't reproduce it!

Did you create this ASG by hand?

No. Via Terraform using https://github.com/terraform-aws-modules/terraform-aws-eks. It's a PROD cluster and not a "new config". We're using BR since months. The only thing I changed in the ASG config was the instance type and AMI.

BR aarch fails with the mentioned issue, using an AL2 AMI instead works (with the same ASG config). I just c/p our existing ASG config (see below) which works fine for x86_64 and t3.* nodes.

    base_arm64_bottlerocket = {
      name         = "base_arm64_br-${var.ENV}"
      min_size     = var.ASG_MIN_base_arm64_br
      desired_size = var.ASG_MIN_base_arm64_br
      max_size     = var.ASG_MAX_base_arm64_br

      bootstrap_extra_args = <<-EOT
      "container-log-max-size" = "500M"
      [settings.kernel.sysctl]
      "user.max_user_namespaces" = "16384"
    EOT
      ami_id               = data.aws_ami.aws-bottlerocket-arm64.id

      use_mixed_instances_policy = true
      mixed_instances_policy = {
        instances_distribution = {
          on_demand_percentage_above_base_capacity = 0
          spot_allocation_strategy                 = "capacity-optimized"
        }

        override = [
          {
            instance_type     = "t4g.medium"
            weighted_capacity = "1"
          },
          {
            instance_type     = "t4g.large"
            weighted_capacity = "2"
          },
        ]
      }
    }

Can you confirm the above policies are attached to the IAM role?

Jup.

image

@bcressey
Copy link
Contributor

bcressey commented Feb 4, 2023

error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.1-eksbuild.1": ... 401 Unauthorized 

This is an unusual error because the pause container should be pulled before kubelet starts.

That implies that host-ctr might be failing to pull the image, and somehow still "succeeding" so that kubelet starts.

Do you have access to the nodes either via SSM or SSH? Both would also require host-ctr to be working in order to pull the respective images from ECR.

@bcressey
Copy link
Contributor

bcressey commented Feb 4, 2023

On the off chance that eks-node-policy-prod is somehow restricting ECR pulls, you could potentially try overriding the default pause container image to match what the AL2 arm64 instance is using.

      bootstrap_extra_args = <<-EOT
      "container-log-max-size" = "500M"
      "pod-infra-container-image" = "<ECR IMAGE OVERRIDE>"
      [settings.kernel.sysctl]
      "user.max_user_namespaces" = "16384"
    EOT

Based on the current EKS AMI bootstrap.sh, that might be:

602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5

The actual value should be in one of these files on the AL2 nodes:

/etc/systemd/system/kubelet.service.d/10-kubelet-args.conf
/etc/eks/containerd/containerd-config.toml

@pat-s
Copy link
Author

pat-s commented Feb 4, 2023

Thanks for the detailed help!

Do you have access to the nodes either via SSM or SSH? Both would also require host-ctr to be working in order to pull the respective images from ECR.

Not in the current setup, I think I would need to add some bootstrap args to the ASG first AFAIK. Could do it, if you tell me what I should check for then :)

I am wondering why it works on the x86_64 nodes with the exact same config.

After testing with 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5 in the bootstrap args, I am happy to confirm that this did the trick! 🎉
Could it be that eks/pause:3.1-eksbuild.1 is simply missing thearm64 arch variant?

On the off chance that eks-node-policy-prod is somehow restricting ECR pulls

If that would apply, all ASGs in our setup would not be able to pull.

@zmrow
Copy link
Contributor

zmrow commented Feb 6, 2023

Could it be that eks/pause:3.1-eksbuild.1 is simply missing thearm64 arch variant?

I went and checked my cluster and it appears that my cluster is using the eks/pause:3.1-eksbuild.1 image?

Feb 06 17:37:30 ... host-ctr[1148]: time="2023-02-06T17:37:30Z" level=info msg="pulled image successfully" 
img="ecr.aws/arn:aws:ecr:eu-central-1:602401143452:repository/eks/pause:3.1-eksbuild.1

@mchaker
Copy link
Contributor

mchaker commented Feb 13, 2023

I attempted to reproduce this as well, and could not.

I was able to successfully pull the eks/pause:3.1-eksbuild.1 image:

Feb 13 19:29:18 ... host-ctr[1117]: time="2023-02-13T19:29:18Z" level=info msg="pulled image successfully" img="ecr.aws/arn:aws:ecr:eu-central-1:602401143452:repository/eks/pause:3.1-eksbuild.1"

@mchaker
Copy link
Contributor

mchaker commented Feb 13, 2023

Hi @pat-s , I'm glad you were able to find a workaround -- we haven't been able to repro this. I'm closing this for now. If you run into this again, please feel free to re-open this issue.

@mchaker mchaker closed this as completed Feb 13, 2023
@pat-s
Copy link
Author

pat-s commented Feb 14, 2023

@mchaker All good, I understand. Still wondering what might be the culprit on my end but as long as there is a workaround and others don't have the issue, all is good! Thanks for doing a deep check anyhow, I think it was worth it! 🤝

@pat-s
Copy link
Author

pat-s commented Feb 21, 2024

Sorry for coming back to this but I just ran into the same issue again for eks/pause:3.9 for an arm64 node. Switched back to eks/pause:3.5

Could it be that many tags (excluding :3.5 (and possibly others)) do not have an arm64 variant? And will that continue? Again, the only thing I am changing is the tag.

602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5

@ginglis13
Copy link
Contributor

ginglis13 commented Feb 22, 2024

@pat-s no worries, thanks for following up. This appears like it could be a similar issue to that of aws/amazon-vpc-cni-k8s#2030. Since it's been some time since this thread has been open, could you provide some information about your current environment (Bottlerocket version, EKS addons, etc)?

@pat-s
Copy link
Author

pat-s commented Feb 22, 2024

Yes, I pinned the image due to awslabs/amazon-eks-ami#1597 as I also faced this issue.
However, when pinning, only 3.5 works and I noticed there are actually newer versions available in general. Of course, in general I would be happy not having to add any manual pin.
We are running a mixed-arch cluster and I remember it being only an issue for arm64 nodes though until awslabs/amazon-eks-ami#1597 I was good without any pins.

      bootstrap_extra_args       = <<-EOT
      "container-log-max-size" = "500M"
      "pod-infra-container-image" = "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5"
      [settings.kernel.sysctl]
      "user.max_user_namespaces" = "16384"
    EOF

Bottlerocket OS 1.19.1 (aws-k8s-1.29)

Latest addon versions from EKS (node, proxy, dns)

@ginglis13
Copy link
Contributor

Of course, in general I would be happy not having to add any manual pin.

Sidebar on the main issue at hand - Bottlerocket 1.19.1 includes a patch that will pin pause containers such that they are not garbage collected on k8s-1.29 variants #3757. So far we haven't had reports back that this patch isn't working, could you confirm that you can remove the manual pin on your side and see your pause/sandbox containers running?

only 3.5 works and I noticed there are actually newer versions available in general

Earlier in this thread it was called out that EKS is using pause:3.5 in their AMI: #2778 (comment) and I double checked to see if they've bumped up to a newer tag of the pause image, but looking here they're still defaulting to 3.5: https://github.com/awslabs/amazon-eks-ami/blob/8d7b5f89f511ef018905c8e24a6c1917e3b8bbdb/files/bootstrap.sh#L218

I'll continue looking into it and attempt a repro to try to get to the bottom of this.

@ginglis13
Copy link
Contributor

Hey @pat-s, I haven't been able to reproduce this. I've been working out of us-west-2, but both 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5 and 02401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.9 have worked for me... I also double checked with some EKS contacts and it seems the EKS default to 3.5 is arbitrary, there is no specific reason that version shows in their bootstrap script.

Looking back in this thread, it appears you've already taken the suggested IAM policies... do you have any other data or logs other than the "unauthorized" log? Can you share if this is a reoccurring issue on the aarch64 aws-k8s-1.29 variant or if this intermittent / one time? Otherwise, if you're unblocked via the 3.5 tag, I'll keep this issue closed.

@pat-s
Copy link
Author

pat-s commented Feb 27, 2024

Thanks for digging and the persistence in this old thread! I am really wondering what is causing this, and it seems to be an issue for our config only.

Sidebar on the main issue at hand - Bottlerocket 1.19.1 includes a patch that will pin pause containers such that they are not garbage collected on k8s-1.29 variants #3757. So far we haven't had reports back that this patch isn't working, could you confirm that you can remove the manual pin on your side and see your pause/sandbox containers running?

I ran BR 1.19.1 before the issue (re-)appeared. I don't think it is related as the issue is with the generic pull of the image / auth denial and the node does not even come up on a fresh start.

do you have any other data or logs other than the "unauthorized" log

That's the only one I see. It also appears early in the pod creation, so there is nothing else.

I've now removed the bootstrap args entirely again and this is what I get using bottlerocket-aws-k8s-1.29-aarch64-v1.19.2-29cc92cc:

   Warning  FailedCreatePodSandBox  6s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pa │
│ use:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pau │
│ se:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.1-eksbuild.1": pull access denied, repository does not exist or may require authorization: author │
│ ization failed: no basic auth credentials 

If I set 02401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.9, the node doesn't even join the cluster.

My only idea left is that it might be a region-related issue. I've hardcoded 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5 again, which is the only one that works. I am not so happy with this approach, as I need to regularly check if this has resolved itself at some point to avoid using an image at some point.

@ForbiddenEra
Copy link

My only idea left is that it might be a region-related issue. I've hardcoded 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5 again, which is the only one that works. I am not so happy with this approach, as I need to regularly check if this has resolved itself at some point to avoid using an image at some point.

I just ran into the same issue with a few nodes (in ca-central) and it only affected 4-5 random nodes out of 12 nodes spread across three AZs. Nothing correlated between them; all using the same AMI (which is actually the latest 1.29 Ubuntu Jammy AMI, not BR) and affected instances were randomly across different self-managed node groups, etc. No commonality between the affected nodes.

nodes failing:
az-b : service/t3.m
az-a : service/t3.m
az-d : control/t3.l
az-a : data/t3.l

not failing:
az-b : data/t3.l
az-b : control/t3.l
az-a : control/t3.l
az-d : data/t3.l

unknown:
az-d: service/t3.m

service, control and data are three different self-managed node groups.

The only node group that didn't have affected nodes in the cluster was an EKS-managed node group running default AWS Linux AMI.

This is also all on amd64 BTW:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ca-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

I also saw some logs that I think were from when it first started where it said it couldn't connect, tcp dial timeout or something but same pod had an entry like above afterwards.

Ran an instance refresh on all the nodegroups to terminate/recreate all nodes and everything is working fine now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/research This issue is being researched type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants