Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay in Pod Scheduling After Scale-Up Action #7063

Open
Idan-Lazar opened this issue Sep 24, 2024 · 6 comments
Open

Delay in Pod Scheduling After Scale-Up Action #7063

Idan-Lazar opened this issue Sep 24, 2024 · 6 comments
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@Idan-Lazar
Copy link

Description

Observed Behavior:
Pods are scheduled approximately 30 seconds after a scale-up action.

Expected Behavior:
I want the pods to be scheduled immediately after the scale-up action.

Reproduction Steps (Please include YAML):

Versions:

  • Chart Version: karpenter-1.0.0
  • Kubernetes Version (kubectl version):
    Client Version: v1.28.2
    Server Version: v1.29.7-eks-a18cd3a
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Idan-Lazar Idan-Lazar added bug Something isn't working needs-triage Issues that need to be triaged labels Sep 24, 2024
@jigisha620
Copy link
Contributor

How quickly are you expecting the pods to get scheduled? Pod would start running only when the capacity is provisioned for it to run. And that could take time, but that's not something controlled by Karpenter.

@Idan-Lazar
Copy link
Author

Before Karpenter, it takes less than a second.
For example, let's assume that I have a deployment with two pods of Ngnix, when a pod is terminated a new pod will show after 30 seconds. It's a long time, and without Karpenter it doesn't take long.

@Idan-Lazar
Copy link
Author

Any help?

@avisaradir
Copy link

avisaradir commented Nov 17, 2024

Has been issue been resolved?
Any new information about this?
@jigisha620

@escardoso
Copy link

Please, share with us the output of this command:
kubectl -n Karpenter logs -l app.kubernetes.io/name=karpenter

@avisaradir
Copy link

avisaradir commented Nov 19, 2024

{
    "level": "INFO",    "time": "2024-11-18T08:38:34.340Z",
    "logger": "controller",
    "message": "created nodeclaim",
    "commit": "6174c75",
    "controller": "provisioner",
    "namespace": "",
    "name": "",
    "reconcileID": "dbcd0d53-4c33-4181-b682-f6fdbdbef785",
    "NodePool": {
        "name": "default"
    },
    "NodeClaim": {
        "name": "default-6w7nw"
    },
    "requests": {
        "cpu": "1210m",
        "memory": "1520Mi",
        "pods": "11"
    },
    "instance-types": "c5.2xlarge, c5.4xlarge, c5.xlarge, c5d.2xlarge, c5d.4xlarge and 49 other(s)"
}
{
    "level": "INFO",
    "time": "2024-11-18T08:38:37.640Z",
    "logger": "controller",
    "message": "launched nodeclaim",
    "commit": "6174c75",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "default-6w7nw"
    },
    "namespace": "",
    "name": "default-6w7nw",
    "reconcileID": "0f250b92-96d4-41c4-bf2b-5f7e028fd883",
    "provider-id": "aws:///eu-west-1/i-123123123",
    "instance-type": "t3a.xlarge",
    "zone": "eu-west-1",
    "capacity-type": "spot",
    "allocatable": {
        "cpu": "3920m",
        "ephemeral-storage": "44Gi",
        "memory": "14162Mi",
        "pods": "58"
    }
}
{
    "level": "INFO",
    "time": "2024-11-18T08:39:03.557Z",
    "logger": "controller",
    "message": "registered nodeclaim",
    "commit": "6174c75",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "default-6w7nw"
    },
    "namespace": "",
    "name": "default-6w7nw",
    "reconcileID": "7766d610-26ea-4bd1-994e-5a844040a53b",
    "provider-id": "aws:///eu-west-1/i-123123123",
    "Node": {
        "name": "ip.compute.internal"
    }
}
{
    "level": "INFO",
    "time": "2024-11-18T08:40:11.295Z",
    "logger": "controller",
    "message": "initialized nodeclaim",
    "commit": "6174c75",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "default-6w7nw"
    },
    "namespace": "",
    "name": "default-6w7nw",
    "reconcileID": "ff87761b-3773-488d-a8fc-1b12710adcca",
    "provider-id": "aws:///eu-west-1/i-123123123",
    "Node": {
        "name": "ip.compute.internal"
    },
    "allocatable": {
        "cpu": "3920m",
        "ephemeral-storage": "47233297124",
        "hugepages-1Gi": "0",
        "hugepages-2Mi": "0",
        "memory": "15209888Ki",
        "pods": "58"
    }
}
{
    "level": "INFO",
    "time": "2024-11-18T09:01:21.119Z",
    "logger": "controller",
    "message": "disrupting nodeclaim(s) via delete, terminating 1 nodes (2 pods) ip..compute.internal/t3a.xlarge/spot",
    "commit": "6174c75",
    "controller": "disruption",
    "namespace": "",
    "name": "",
    "reconcileID": "415bfbea-64a6-4c54-b8bf-9a280ec18230",
    "command-id": "ab478961-25fe-4c41-92c7-c6afe0ef63ec",
    "reason": "underutilized"
}
{
    "level": "INFO",
    "time": "2024-11-18T09:01:21.855Z",
    "logger": "controller",
    "message": "tainted node",
    "commit": "6174c75",
    "controller": "node.termination",
    "controllerGroup": "",
    "controllerKind": "Node",
    "Node": {
        "name": "ip.compute.internal"
    },
    "namespace": "",
    "name": "ip.compute.internal",
    "reconcileID": "c04d4880-47a6-40d3-b9d9-8f8a5827e2b7",
    "taint.Key": "karpenter.sh/disrupted",
    "taint.Value": "",
    "taint.Effect": "NoSchedule"
}
{
    "level": "INFO",
    "time": "2024-11-18T09:02:30.726Z",
    "logger": "controller",
    "message": "deleted node",
    "commit": "6174c75",
    "controller": "node.termination",
    "controllerGroup": "",
    "controllerKind": "Node",
    "Node": {
        "name": "ip.compute.internal"
    },
    "namespace": "",
    "name": "ip.compute.internal",
    "reconcileID": "11156a4f-9ef0-4edd-8ee5-e60c4991bc86"
}ip
{
    "level": "INFO",
    "time": "2024-11-18T09:02:30.947Z",
    "logger": "controller",
    "message": "deleted nodeclaim",
    "commit": "6174c75",
    "controller": "nodeclaim.termination",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "default-6w7nw"
    },
    "namespace": "",
    "name": "default-6w7nw",
    "reconcileID": "82642555-6618-4d54-9016-40b2b53f1aeb",
    "Node": {
        "name": "ip.compute.internal"
    },
    "provider-id": "aws:///eu-west-1/i-123123123"
}
{
    "level": "ERROR",
    "time": "2024-11-19T00:18:48.979Z",
    "logger": "webhook.ConversionWebhook",
    "message": "Reconcile error",
    "commit": "6174c75",
    "knative.dev/traceid": "b4dea58c-e33c-4b7a-a339-4420f34cae7e",
    "knative.dev/key": "nodeclaims.karpenter.sh",
    "duration": "210.125657ms",
    "error": "failed to update webhook: Operation cannot be fulfilled on customresourcedefinitions.apiextensions.k8s.io \"nodeclaims.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"
}
{
    "level": "ERROR",
    "time": "2024-11-19T00:18:49.051Z",
    "logger": "webhook.ConversionWebhook",
    "message": "Reconcile error",
    "commit": "6174c75",
    "knative.dev/traceid": "5aff0e4c-c897-426a-b0d4-b88517355c68",
    "knative.dev/key": "ec2nodeclasses.karpenter.k8s.aws",
    "duration": "290.712448ms",
    "error": "failed to update webhook: Operation cannot be fulfilled on customresourcedefinitions.apiextensions.k8s.io \"ec2nodeclasses.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"
}
{
    "level": "INFO",
    "time": "2024-11-07T18:19:17.749Z",
    "logger": "controller.controller-runtime.metrics",
    "message": "Starting metrics server",
    "commit": "6174c75"
}
{
    "level": "INFO",
    "time": "2024-11-07T18:19:17.749Z",
    "logger": "controller.controller-runtime.metrics",
    "message": "Serving metrics server",
    "commit": "6174c75",
    "bindAddress": ":8080",
    "secure": false
}
{
    "level": "INFO",
    "time": "2024-11-07T18:19:17.749Z",
    "logger": "controller",
    "message": "starting server",
    "commit": "6174c75",
    "name": "health probe",
    "addr": "[::]:8081"
}
{
    "level": "INFO",
    "time": "2024-11-07T18:19:17.850Z",
    "logger": "controller",
    "message": "attempting to acquire leader lease karpenter/karpenter-leader-election...",
    "commit": "6174c75"
}
{
    "level": "ERROR",
    "time": "2024-11-12T18:18:48.846Z",
    "logger": "webhook.ConversionWebhook",
    "message": "Reconcile error",
    "commit": "6174c75",
    "knative.dev/traceid": "c4fd79ce-c51b-4cec-b32f-d066c7512b14",
    "knative.dev/key": "nodeclaims.karpenter.sh",
    "duration": "137.005499ms",
    "error": "failed to update webhook: Operation cannot be fulfilled on customresourcedefinitions.apiextensions.k8s.io \"nodeclaims.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"
}
{
    "level": "ERROR",
    "time": "2024-11-12T18:18:48.941Z",
    "logger": "webhook.ConversionWebhook",
    "message": "Reconcile error",
    "commit": "6174c75",
    "knative.dev/traceid": "321db1ef-52c9-433e-96e1-c550edd8324b",
    "knative.dev/key": "ec2nodeclasses.karpenter.k8s.aws",
    "duration": "232.197101ms",
    "error": "failed to update webhook: Operation cannot be fulfilled on customresourcedefinitions.apiextensions.k8s.io \"ec2nodeclasses.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"
}
{
    "level": "ERROR",
    "time": "2024-11-12T18:18:48.947Z",
    "logger": "webhook.ConversionWebhook",
    "message": "Reconcile error",
    "commit": "6174c75",
    "knative.dev/traceid": "6aa1dbdb-ec36-4b02-bfe8-f071391844fb",
    "knative.dev/key": "nodepools.karpenter.sh",
    "duration": "238.083613ms",
    "error": "failed to update webhook: Operation cannot be fulfilled on customresourcedefinitions.apiextensions.k8s.io \"nodepools.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"
}
{
    "level": "ERROR",
    "time": "2024-11-13T09:30:05.621Z",
    "logger": "webhook",
    "message": "http: TLS handshake error from 10.0.0.0:44764: read tcp 10.0.0.0:8443->10.0.0.0:44764: read: connection reset by peer\n",
    "commit": "6174c75"
}
{
    "level": "ERROR",
    "time": "2024-11-19T00:18:48.905Z",
    "logger": "webhook.ConversionWebhook",
    "message": "Reconcile error",
    "commit": "6174c75",
    "knative.dev/traceid": "9434c88d-c341-455d-8b23-ee050fc0e0f4",
    "knative.dev/key": "nodepools.karpenter.sh",
    "duration": "177.944577ms",
    "error": "failed to update webhook: Operation cannot be fulfilled on customresourcedefinitions.apiextensions.k8s.io \"nodepools.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"
}

This is the output of the command above, i removed sensitive data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

4 participants