-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim #7046
Comments
I have exactly the same bug |
Could you please share your nodeclass and nodepool configurations? As well as any other steps to reproduce? |
i have same bug |
seems that for me too. |
@rschalo We've also experienced this issue. More information: We started suspecting there's something wrong with the AmiSelectorTerms, but we couldn't figure it out. We tried using id (for an AL2 AMI), and then switched to alias (with AL2023), but it made no difference. |
Can someone provide the kubelet logs from a node that fails to register? |
One thing to note (not sure if this is the issue) but on 0.37+ version of Karpenter there is a new readiness check on EC2NodeClass CRD. Was this updated?
https://karpenter.sh/v1.0/upgrading/upgrade-guide/#upgrading-to-0370 |
I see this when I am trying to migrate from Cluster Autoscaler. (Almost a fresh installation) Event Message for Node: Events: Normal DisruptionBlocked 4m32s (x211 over 7h4m) karpenter Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim The below is the log from the karpenter: {"level":"ERROR","time":"2024-09-27T20:19:23.141Z","logger":"webhook.ConversionWebhook","message":"Reconcile error","commit":"688ea21","knative.dev/traceid":"b844441e-37e2-4c12-bdd3-8b3395383977","knative.dev/key":"nodeclaims.karpenter.sh","duration":"167.207687ms","error":"failed to update webhook: Operation cannot be fulfilled on customresourcedefinitions.apiextensions.k8s.io "nodeclaims.karpenter.sh": the object has been modified; please apply your changes to the latest version and try again"} I see that the karpenter pods are up and running without any issue! I tried to patch and update the crd's but no luck |
In our case the problem was with missing toleration for taint as described here: kubernetes-sigs/aws-ebs-csi-driver#2158 |
UPDATE: We figured out what went wrong for us: Our cluster is still using aws-auth ConfigMap and we missed updating the role name there. |
@midestefanis can you confirm if the issue that @roi-zentner ran into is relevant to your issue? If not then are you able to share other info about how to reproduce the issue you're seeing? |
I have right aws-auth and still get Normal DisruptionBlocked 2m37s (x716 over 23h) karpenter Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim and this node is not even managed by Karpenter |
I have the same issue. |
same issue, I get this error on nodes NOT managed by karpenter |
Hi All, I've attempted to reproduce with a fresh install of
On the nodeclaim while it was waiting for the node to spin up. Are node objects and instances being created for these nodeclaims that have this? |
Also, we used to emit events for non-managed nodes but that was addressed in kubernetes-sigs/karpenter#1644 which has been merged to main. |
I think the above log is a red herring to the issue. I agree we should change our eventing messages here to be more descriptive of what's actually happening, rather than describing internal karpenter state. https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/state/statenode.go#L177-L185 |
I just ran into this. In my case the node claim would appear and the instance would be provisioned but remain in unknown status, never getting the node info or joining the cluster. The issue for me was the tag I picked to use on the subnetSelectorTerms. I used kubernetes.io/cluster/clustername: shared and as soon as I changed that selector to a different tag the node joins the cluster. |
In my case the node comes up and joins the cluster but the nodeclaim remains in unknown status. I'm using loki to store the kubernetes event log, with these queries:
I get:
Note that I had issues with the conversion webhook being broken, so removed it from the CRDs, but now I get:
EDIT oh no I might have found the issue in the nodeclaim: status:
- lastTransitionTime: "2024-10-24T11:41:18Z"
message: Resource "nvidia.com/gpu" was requested but not registered
reason: ResourceNotRegistered
status: Unknown
type: Initialized
- lastTransitionTime: "2024-10-24T11:41:03Z"
message: Initialized=Unknown
reason: UnhealthyDependents
status: Unknown
type: Ready I'm using nodepool.yamlapiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"kind":"EC2NodeClass","name":"large-disk","apiVersion":"karpenter.k8s.aws/v1beta1"}'
karpenter.sh/nodepool-hash: "4010951020068392240"
karpenter.sh/nodepool-hash-version: v3
karpenter.sh/stored-version-migrated: "true"
creationTimestamp: "2024-05-29T10:11:27Z"
generation: 3
name: mynodepool
resourceVersion: "445358182"
uid: 423a86e4-1596-4768-b0bd-c1bd4fbbd051
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 5m
consolidationPolicy: WhenEmpty
template:
metadata:
labels:
nvidia.com/gpu: A10G
spec:
expireAfter: Never
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: large-disk
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- g5
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "20000"
- key: karpenter.k8s.aws/instance-memory
operator: Lt
values:
- "60000"
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: "true"
weight: 40
status:
conditions:
- lastTransitionTime: "2024-09-30T12:22:53Z"
message: ""
reason: NodeClassReady
status: "True"
type: NodeClassReady
- lastTransitionTime: "2024-09-30T12:22:53Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-09-30T12:22:51Z"
message: ""
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
resources:
cpu: "40"
ephemeral-storage: 1023670Mi
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 162288212Ki
nodes: "5"
nvidia.com/gpu: "4"
pods: "290"
vpc.amazonaws.com/pod-eni: "17" nodeclaim.yamlapiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
annotations:
compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
compatibility.karpenter.k8s.aws/kubelet-drift-hash: "15379597991425564585"
karpenter.k8s.aws/ec2nodeclass-hash: "6440581379273964080"
karpenter.k8s.aws/ec2nodeclass-hash-version: v3
karpenter.k8s.aws/tagged: "true"
karpenter.sh/nodepool-hash: "4010951020068392240"
karpenter.sh/nodepool-hash-version: v3
karpenter.sh/stored-version-migrated: "true"
creationTimestamp: "2024-10-24T11:40:35Z"
finalizers:
- karpenter.sh/termination
generateName: mynodepool-
generation: 1
labels:
karpenter.k8s.aws/instance-category: g
karpenter.k8s.aws/instance-cpu: "8"
karpenter.k8s.aws/instance-cpu-manufacturer: amd
karpenter.k8s.aws/instance-ebs-bandwidth: "3500"
karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
karpenter.k8s.aws/instance-family: g5
karpenter.k8s.aws/instance-generation: "5"
karpenter.k8s.aws/instance-gpu-count: "1"
karpenter.k8s.aws/instance-gpu-manufacturer: nvidia
karpenter.k8s.aws/instance-gpu-memory: "24576"
karpenter.k8s.aws/instance-gpu-name: a10g
karpenter.k8s.aws/instance-hypervisor: nitro
karpenter.k8s.aws/instance-local-nvme: "450"
karpenter.k8s.aws/instance-memory: "32768"
karpenter.k8s.aws/instance-network-bandwidth: "5000"
karpenter.k8s.aws/instance-size: 2xlarge
karpenter.sh/capacity-type: on-demand
karpenter.sh/nodepool: mynodepool
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node-role: inference
node.kubernetes.io/instance-type: g5.2xlarge
nvidia.com/gpu: A10G
topology.k8s.aws/zone-id: euw1-az3
topology.kubernetes.io/region: eu-west-1
topology.kubernetes.io/zone: eu-west-1c
name: mynodepool-wz4vx
ownerReferences:
- apiVersion: karpenter.sh/v1
blockOwnerDeletion: true
kind: NodePool
name: mynodepool
uid: 423a86e4-1596-4768-b0bd-c1bd4fbbd051
resourceVersion: "445301688"
uid: e0b8e064-e3f5-46a7-a78c-c1c1e89907ad
spec:
expireAfter: Never
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: large-disk
requirements:
- key: karpenter.sh/nodepool
operator: In
values:
- mynodepool
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- g5
- key: node.kubernetes.io/instance-type
operator: In
values:
- g5.2xlarge
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "20000"
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: node-role
operator: In
values:
- inference
- key: nvidia.com/gpu
operator: In
values:
- A10G
resources:
requests:
cpu: 210m
memory: 20720Mi
nvidia.com/gpu: "1"
pods: "9"
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: "true"
status:
allocatable:
cpu: 7910m
ephemeral-storage: "403926258176"
memory: 29317Mi
nvidia.com/gpu: "1"
pods: "58"
vpc.amazonaws.com/pod-eni: "17"
capacity:
cpu: "8"
ephemeral-storage: 450G
memory: 30310Mi
nvidia.com/gpu: "1"
pods: "58"
vpc.amazonaws.com/pod-eni: "17"
conditions:
- lastTransitionTime: "2024-10-24T11:50:37Z"
message: ""
reason: ConsistentStateFound
status: "True"
type: ConsistentStateFound
- lastTransitionTime: "2024-10-24T11:41:18Z"
message: Resource "nvidia.com/gpu" was requested but not registered
reason: ResourceNotRegistered
status: Unknown
type: Initialized
- lastTransitionTime: "2024-10-24T11:40:37Z"
message: ""
reason: Launched
status: "True"
type: Launched
- lastTransitionTime: "2024-10-24T11:41:03Z"
message: Initialized=Unknown
reason: UnhealthyDependents
status: Unknown
type: Ready
- lastTransitionTime: "2024-10-24T11:41:03Z"
message: ""
reason: Registered
status: "True"
type: Registered
imageID: ami-0eae4d86f31ea2ae1
nodeName: i-0573b18a64d7a4ea5.eu-west-1.compute.internal
providerID: aws:///eu-west-1c/i-0573b18a64d7a4ea5 |
Hi All, This log line is part of the normal lifecycle of nodeclaim disruption, we've adjusted the log line to be more clear in kubernetes-sigs/karpenter#1644 and kubernetes-sigs/karpenter#1766. If there is other behavior that is being observed that may be incorrect then please open a new issue. |
I have the same issue node is created but not joining the cluster Message: Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim |
I have the same issue node is created but not joining the cluster Message: Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim |
I'm having the same issue node is created but not joining the cluster and forever stuck in Unknown state
Nodeclaim
|
I fixed it using pod identity. |
Is there any update on this issue ? |
@bshre12 have you assigned a role to your karpenter service account? |
Description
Observed Behavior:
Karpenter is not spinning up nodes
Expected Behavior:
New nodes
Reproduction Steps (Please include YAML):
Versions: 1.0.2
kubectl version
): 1.30Karpenter logs:
Node Claims are showing this:
The text was updated successfully, but these errors were encountered: