Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter is always disrupting node via drift after upgrading to 0.37.3 from #7049

Open
ariretiarno opened this issue Sep 21, 2024 · 6 comments
Labels
bug Something isn't working triage/needs-information Marks that the issue still needs more information to properly triage

Comments

@ariretiarno
Copy link

Description

Observed Behavior:
Karpenter is always disrupting node via drift, even consolidate configuration is expireAfter: Never and consolidateAfter: Never.
This issue is happen after i've upgrade from 0.32.10 to 0.37.3.

And it's happen to whole nodes in the cluster.

Logs


{"level":"INFO","time":"2024-09-20T09:31:25.580Z","logger":"controller","message":"disrupting via drift replace, terminating 1 nodes (1 pods) ip-172-31-125-79.ap-southeast-1.compute.internal/t3a.xlarge/on-demand and replacing with on-demand node from types t3a.xlarge","commit":"378e8b1","controller":"disruption","command-id":"a49138f2-51e9-4bb1-95b1-1973aa9d694f"}
--



Expected Behavior:
Karpenter should not disrupt node when i defined

disruption:
    expireAfter: Never
    consolidateAfter: Never

Reproduction Steps (Please include YAML):
Nodepoool

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: jenkins-master
spec:
  disruption:
    expireAfter: Never
    consolidateAfter: Never
  template:
    metadata: {}
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - t
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t3a.xlarge
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
      - key: evermos.com/serviceClass
        operator: In
        values:
        - jenkins-master
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.k8s.aws/instance-cpu
        operator: In
        values:
        - "2"
        - "4"
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      startupTaints:
      - effect: NoExecute
        key: node.cilium.io/agent-not-ready
        value: "true"
      taints:
      - effect: NoSchedule
        key: jenkins-master
        value: "true"

NodeClass

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      volumeSize: 50Gi
      volumeType: gp3
  role: KarpenterNodeRole-evermos-dev
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: evermos-dev
  subnetSelectorTerms:
  - tags:
      Name: private-a-dev

Versions:

  • Chart Version: 0.37.3
  • Kubernetes Version (kubectl version): v1.28.12-eks-a18cd3a
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ariretiarno ariretiarno added bug Something isn't working needs-triage Issues that need to be triaged labels Sep 21, 2024
@rschalo
Copy link
Contributor

rschalo commented Sep 23, 2024

How often are you seeing drift occur? Also, do you have Karpenter logs from when drift occurred? Should see something generated from: https://github.com/kubernetes-sigs/karpenter/blob/v0.37.3/pkg/controllers/nodeclaim/disruption/drift.go#L82

@ariretiarno
Copy link
Author

The drift event is happened every 24h and it's affected in whole nodes, even i don’t make any changes on nodepool/ec2nodeclass..
I'm using Loki and here is my query {app="karpenter"} |~ "disrupting" |~ "drift"
image

Anyway i didn’t found any related logs from this https://github.com/kubernetes-sigs/karpenter/blob/v0.37.3/pkg/controllers/nodeclaim/disruption/drift.go#L82

Is there any possibilities the logs is from this code? https://github.com/kubernetes-sigs/karpenter/blob/04a921c00ad837c5c82fe190f4b6c39f4dffe6fa/pkg/controllers/disruption/controller.go#L151
If so, from where the drift is coming from?

@ariretiarno
Copy link
Author

anyway, on version 0.37.3 the drift is can be modified to disable/enable, but when i plan to upgrade to 1.0.0 the feature was dropped and cannot be disabled, so this is my concern when i upgrade karpenter to 1.0.0

@lucasfnds
Copy link

This problem is also occurring here.
This problem occurred on all my nodes, different nodeclasses and nodepools, in spot and on-demand nodes.

Chart Version: 0.37.0
Kubernetes Version (kubectl version): v1.30.3-eks-a18cd3a

karpenter-logs.log

@ariretiarno
Copy link
Author

any updates @rschalo

@jmdeal
Copy link
Contributor

jmdeal commented Oct 3, 2024

The logs @rschalo was looking for are only available if you had debug logging enabled, it doesn't look like either of you did. Given both of your examples happened within close proximity to an eks-optimized AMI release, I suspect Nodes were drifted due to that. To know for sure we would either need to see debug logs, the status conditions on the drifted NodeClaims, or the reason from the karpenter_nodeclaims_drifted metric.

Karpenter should not disrupt node when i defined

disruption:
   expireAfter: Never
   consolidateAfter: Never

I would also like to clarify that this isn't accurate. Drift is a separate disruption mechanism from consolidation and expiration, these fields are not intended to have any affect on it. As you found, you can disable drift globally through the feature gate pre-v1, and post v1 you can effectively disable it per NodePool via disruption budgets.

@jmdeal jmdeal added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

No branches or pull requests

4 participants