Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] [bug]: efs/ebs csi-drivers sometimes do not remove taints from nodes #2470

Open
mamoit opened this issue Nov 12, 2024 · 4 comments
Open
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@mamoit
Copy link

mamoit commented Nov 12, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Please tackle what seems to be an ebs and efs csi-driver race condition.
This roadmap request was opened as suggested by AWS support.

Which service(s) is this request for?
EKS, more specifically EBS and EFS CSI drivers.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Sometimes a node remains tainted by the ebs/efs csi-driver taints even after they have successfully been initialized.

In our current setup, we are using EKS, a karpenter managed nodepool, and ebs and efs csi-drivers installed through helm.
We're running:

  • EKS v1.30
  • karpenter chart v1.0.2
  • ebs-csi-driver chart v2.36.0
  • efs-csi-driver chart v3.0.8

This is the lifecycle of a problematic node:

  1. Karpenter starts up a new node with both efs.csi.aws.com/agent-not-ready:NoExecute and ebs.csi.aws.com/agent-not-ready:NoExecute taints configured as startupTaints.
  2. Both the ebs and efs csi drivers get scheduled on the node and state in their logs that they have removed their respective taint from the node,
  3. The issue is that sometimes it indeed gets removed, but sometimes it doesn't. When one, or both of the taints don't get removed we end up with a node that can't have workloads scheduled on and that can't be disrupted by karpenter either since karpenter refuses to disrupt the nodes with the following message:
Cannot disrupt Node: state node isn't initialized
  1. We're left with a stuck node.

I've attached logs for the 2 daemonset pods that were scheduled on an instance where neither the ebs nor the efs taints were removed.

Are you currently working around this issue?
We are currently unable to work around this issue in an unattended way and we're left with tainted, unconsolidatable nodes in our pool.

The easiest manual workaround is to cordon the node and untaint it so that karpenter can remove it from the pool.
Unfortunately, due to the scale of our deployments this is not really feasible to perform this manual task multiple times a day.

Additional context
This issue was opened as suggested by AWS support.
We have raised this issue with our TAM on the 11th of November.

There are 2 issues on github, one for EBS and another for EFS, the latter is currently closed, but the issue seems very similar:
kubernetes-sigs/aws-ebs-csi-driver#2199
kubernetes-sigs/aws-efs-csi-driver#1491

Attachments
ebs-csi-node.log
efs-csi-node.log

@mamoit mamoit added the Proposed Community submitted issue label Nov 12, 2024
@mikestef9 mikestef9 added the EKS Amazon Elastic Kubernetes Service label Nov 12, 2024
@ConnorJC3
Copy link

ConnorJC3 commented Nov 22, 2024

Reposting the response from the EBS CSI bug report:

Hi, based on our testing, the issue described here is caused by a bug with Karpenter replacing the taint from the EBS CSI Driver after the driver has already removed the taint.

This bug was already fixed on the Karpenter side (fix: kubernetes-sigs/karpenter#1658 cherry-pick: kubernetes-sigs/karpenter#1705), but the fix was not included until Karpenter release v1.0.4.

Please upgrade to Karpenter v1.0.4 or later and see if that fixes the issue for you, thanks!

@mamoit
Copy link
Author

mamoit commented Nov 22, 2024

Reposting the response from the EBS CSI bug report:


After checking the audit logs of a random affected node in one of our clusters I can see that:

  1. the ebs-csi-driver removes its taint
  2. then the efs-csi-driver removes its taint
  3. then karpenter comes 3s later to patch both the taints back in the node.

So this is not an ebs nor efs csi-driver issue, but instead a karpenter issue.
The curve ball is that we're running karpenter v1.0.7 (of the karpenter-provider-aws), so the issue should be fixed there too unless there was a regression of some sorts.


Feel free to continue the discussion just here since this is aparently not an ebs/efs csi-driver issue.

@mamoit
Copy link
Author

mamoit commented Nov 22, 2024

This issues seem to be correlated with karpenter pod crashes, where one of the 2 karpenter pods that we're running crashes with:

{"level":"ERROR","time":"2024-11-22T14:21:51.541Z","logger":"controller","message":"Failed to update lock optimitically: Put \"https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/karpenter-leader-election\": context deadline exceeded, falling back to slow path","commit":"901a5dc"}
{"level":"INFO","time":"2024-11-22T14:21:56.718Z","logger":"controller","message":"Waited for 1.062077844s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/karpenter-leader-election","commit":"901a5dc"}
{"level":"ERROR","time":"2024-11-22T14:22:02.491Z","logger":"controller","message":"error retrieving resource lock kube-system/karpenter-leader-election: client rate limiter Wait returned an error: context deadline exceeded","commit":"901a5dc"}
{"level":"INFO","time":"2024-11-22T14:22:06.312Z","logger":"controller","message":"failed to renew lease kube-system/karpenter-leader-election: timed out waiting for the condition","commit":"901a5dc"}
{"level":"ERROR","time":"2024-11-22T14:22:13.171Z","logger":"controller","message":"Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io \"karpenter-leader-election\": the object has been modified; please apply your changes to the latest version and try again","commit":"901a5dc"}
panic: leader election lost

goroutine 194 [running]:
github.com/samber/lo.must({0x27339e0, 0xc0137363d0}, {0x0, 0x0, 0x0})
        github.com/samber/[email protected]/errors.go:53 +0x1df
github.com/samber/lo.Must0(...)
        github.com/samber/[email protected]/errors.go:72
sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start.func1()
        sigs.k8s.io/[email protected]/pkg/operator/operator.go:258 +0x7c
created by sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start in goroutine 1
        sigs.k8s.io/[email protected]/pkg/operator/operator.go:256 +0xe5

The other karpenter pod also crashed at the same time, but there is no stack trace.

@mamoit
Copy link
Author

mamoit commented Nov 27, 2024

This may be caused afterall by an OOM on karpenter.

We bumped the memory request of the deployment and will keep an eye out on it for the next couple of days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

3 participants