Karpenter is always disrupting node via drift after upgrading to 0.37.3 from #7049

ariretiarno · 2024-09-21T06:55:38Z

Description

Observed Behavior:
Karpenter is always disrupting node via drift, even consolidate configuration is expireAfter: Never and consolidateAfter: Never.
This issue is happen after i've upgrade from 0.32.10 to 0.37.3.

And it's happen to whole nodes in the cluster.

Logs


{"level":"INFO","time":"2024-09-20T09:31:25.580Z","logger":"controller","message":"disrupting via drift replace, terminating 1 nodes (1 pods) ip-172-31-125-79.ap-southeast-1.compute.internal/t3a.xlarge/on-demand and replacing with on-demand node from types t3a.xlarge","commit":"378e8b1","controller":"disruption","command-id":"a49138f2-51e9-4bb1-95b1-1973aa9d694f"}
--

Expected Behavior:
Karpenter should not disrupt node when i defined

disruption:
    expireAfter: Never
    consolidateAfter: Never

Reproduction Steps (Please include YAML):
Nodepoool

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: jenkins-master
spec:
  disruption:
    expireAfter: Never
    consolidateAfter: Never
  template:
    metadata: {}
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - t
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t3a.xlarge
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
      - key: evermos.com/serviceClass
        operator: In
        values:
        - jenkins-master
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.k8s.aws/instance-cpu
        operator: In
        values:
        - "2"
        - "4"
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      startupTaints:
      - effect: NoExecute
        key: node.cilium.io/agent-not-ready
        value: "true"
      taints:
      - effect: NoSchedule
        key: jenkins-master
        value: "true"

NodeClass

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      volumeSize: 50Gi
      volumeType: gp3
  role: KarpenterNodeRole-evermos-dev
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: evermos-dev
  subnetSelectorTerms:
  - tags:
      Name: private-a-dev

Versions:

Chart Version: 0.37.3
Kubernetes Version (kubectl version): v1.28.12-eks-a18cd3a

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

rschalo · 2024-09-23T20:41:43Z

How often are you seeing drift occur? Also, do you have Karpenter logs from when drift occurred? Should see something generated from: https://github.com/kubernetes-sigs/karpenter/blob/v0.37.3/pkg/controllers/nodeclaim/disruption/drift.go#L82

ariretiarno · 2024-09-24T01:51:21Z

The drift event is happened every 24h and it's affected in whole nodes, even i don’t make any changes on nodepool/ec2nodeclass..
I'm using Loki and here is my query {app="karpenter"} |~ "disrupting" |~ "drift"

Anyway i didn’t found any related logs from this https://github.com/kubernetes-sigs/karpenter/blob/v0.37.3/pkg/controllers/nodeclaim/disruption/drift.go#L82

Is there any possibilities the logs is from this code? https://github.com/kubernetes-sigs/karpenter/blob/04a921c00ad837c5c82fe190f4b6c39f4dffe6fa/pkg/controllers/disruption/controller.go#L151
If so, from where the drift is coming from?

ariretiarno · 2024-09-24T07:45:22Z

anyway, on version 0.37.3 the drift is can be modified to disable/enable, but when i plan to upgrade to 1.0.0 the feature was dropped and cannot be disabled, so this is my concern when i upgrade karpenter to 1.0.0

lucasfnds · 2024-09-26T21:02:27Z

This problem is also occurring here.
This problem occurred on all my nodes, different nodeclasses and nodepools, in spot and on-demand nodes.

Chart Version: 0.37.0
Kubernetes Version (kubectl version): v1.30.3-eks-a18cd3a

karpenter-logs.log

ariretiarno · 2024-10-03T02:09:38Z

any updates @rschalo

jmdeal · 2024-10-03T16:50:10Z

The logs @rschalo was looking for are only available if you had debug logging enabled, it doesn't look like either of you did. Given both of your examples happened within close proximity to an eks-optimized AMI release, I suspect Nodes were drifted due to that. To know for sure we would either need to see debug logs, the status conditions on the drifted NodeClaims, or the reason from the karpenter_nodeclaims_drifted metric.

Karpenter should not disrupt node when i defined
disruption:
   expireAfter: Never
   consolidateAfter: Never

I would also like to clarify that this isn't accurate. Drift is a separate disruption mechanism from consolidation and expiration, these fields are not intended to have any affect on it. As you found, you can disable drift globally through the feature gate pre-v1, and post v1 you can effectively disable it per NodePool via disruption budgets.

ariretiarno added bug Something isn't working needs-triage Issues that need to be triaged labels Sep 21, 2024

jmdeal added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter is always disrupting node via drift after upgrading to 0.37.3 from #7049

Karpenter is always disrupting node via drift after upgrading to 0.37.3 from #7049

ariretiarno commented Sep 21, 2024

rschalo commented Sep 23, 2024 •

edited

Loading

ariretiarno commented Sep 24, 2024

ariretiarno commented Sep 24, 2024

lucasfnds commented Sep 26, 2024

ariretiarno commented Oct 3, 2024

jmdeal commented Oct 3, 2024

Karpenter is always disrupting node via drift after upgrading to 0.37.3 from #7049

Karpenter is always disrupting node via drift after upgrading to 0.37.3 from #7049

Comments

ariretiarno commented Sep 21, 2024

Description

rschalo commented Sep 23, 2024 • edited Loading

ariretiarno commented Sep 24, 2024

ariretiarno commented Sep 24, 2024

lucasfnds commented Sep 26, 2024

ariretiarno commented Oct 3, 2024

jmdeal commented Oct 3, 2024

rschalo commented Sep 23, 2024 •

edited

Loading