-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] [bug]: efs/ebs csi-drivers sometimes do not remove taints from nodes #2470
Comments
Reposting the response from the EBS CSI bug report: Hi, based on our testing, the issue described here is caused by a bug with Karpenter replacing the taint from the EBS CSI Driver after the driver has already removed the taint. This bug was already fixed on the Karpenter side (fix: kubernetes-sigs/karpenter#1658 cherry-pick: kubernetes-sigs/karpenter#1705), but the fix was not included until Karpenter release Please upgrade to Karpenter |
Reposting the response from the EBS CSI bug report: After checking the audit logs of a random affected node in one of our clusters I can see that:
So this is not an ebs nor efs csi-driver issue, but instead a karpenter issue. Feel free to continue the discussion just here since this is aparently not an ebs/efs csi-driver issue. |
This issues seem to be correlated with karpenter pod crashes, where one of the 2 karpenter pods that we're running crashes with:
The other karpenter pod also crashed at the same time, but there is no stack trace. |
This may be caused afterall by an OOM on karpenter. We bumped the memory request of the deployment and will keep an eye out on it for the next couple of days. |
Community Note
Tell us about your request
Please tackle what seems to be an ebs and efs csi-driver race condition.
This roadmap request was opened as suggested by AWS support.
Which service(s) is this request for?
EKS, more specifically EBS and EFS CSI drivers.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Sometimes a node remains tainted by the ebs/efs csi-driver taints even after they have successfully been initialized.
In our current setup, we are using EKS, a karpenter managed nodepool, and ebs and efs csi-drivers installed through helm.
We're running:
v1.30
v1.0.2
v2.36.0
v3.0.8
This is the lifecycle of a problematic node:
efs.csi.aws.com/agent-not-ready:NoExecute
andebs.csi.aws.com/agent-not-ready:NoExecute
taints configured as startupTaints.I've attached logs for the 2 daemonset pods that were scheduled on an instance where neither the ebs nor the efs taints were removed.
Are you currently working around this issue?
We are currently unable to work around this issue in an unattended way and we're left with tainted, unconsolidatable nodes in our pool.
The easiest manual workaround is to cordon the node and untaint it so that karpenter can remove it from the pool.
Unfortunately, due to the scale of our deployments this is not really feasible to perform this manual task multiple times a day.
Additional context
This issue was opened as suggested by AWS support.
We have raised this issue with our TAM on the 11th of November.
There are 2 issues on github, one for EBS and another for EFS, the latter is currently closed, but the issue seems very similar:
kubernetes-sigs/aws-ebs-csi-driver#2199
kubernetes-sigs/aws-efs-csi-driver#1491
Attachments
ebs-csi-node.log
efs-csi-node.log
The text was updated successfully, but these errors were encountered: