Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-load-balancer-controller v2.7.0 : no endpoints available for service aws-load-balancer-webhook-service #1058

Open
dev-travelex opened this issue Feb 9, 2024 · 5 comments · May be fixed by #1172
Labels
bug Something isn't working

Comments

@dev-travelex
Copy link

dev-travelex commented Feb 9, 2024

Describe the bug
A concise description of what the bug is.
Trying to deploy Ingress controller via the helm chart, but getting this error

Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service

Logs from ALB controller:

{"level":"info","ts":"2024-02-09T12:55:48Z","msg":"version","GitVersion":"v2.7.0","GitCommit":"ed00c8199179b04fa2749db64e36d4a48ab170ac","BuildDate":"2024-02-01T01:05:49+0000"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"setup","msg":"adding health check for controller"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"setup","msg":"adding readiness check for webhook"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-v1-pod"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-v1-service"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-elbv2-k8s-aws-v1beta1-ingressclassparams"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-networking-v1-ingress"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"setup","msg":"starting podInfo repo"}
{"level":"info","ts":"2024-02-09T12:55:50Z","logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"info","ts":"2024-02-09T12:55:50Z","msg":"Starting server","kind":"health probe","addr":"[::]:61779"}
{"level":"info","ts":"2024-02-09T12:55:50Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2024-02-09T12:55:50Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":"2024-02-09T12:55:50Z","logger":"controller-runtime.webhook","msg":"Serving webhook server","host":"","port":9443}
{"level":"info","ts":"2024-02-09T12:55:50Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
I0209 12:55:50.353502       1 leaderelection.go:248] attempting to acquire leader lease kube-system/aws-load-balancer-controller-leader...

Steps to reproduce
Deployed via helm chart

resource "helm_release" "aws-load-balancer-controller" {
  name       = "aws-load-balancer-controller"
  depends_on = [null_resource.post-policy, kubernetes_service_account.alb_controller]
  atomic     = true
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-load-balancer-controller"
  namespace  = "kube-system"
  timeout    = 600

  set {
    name  = "clusterName"
    value = data.aws_eks_cluster.cluster.name
  }

  set {
    name  = "serviceAccount.create"
    value = false
  }
  set {
    name  = "serviceAccount.name"
    value = "aws-load-balancer-controller"
  }

  set {
    name  = "vpcId"
    value = data.aws_vpc.best.id
  }
}


Expected outcome
A concise description of what you expected to happen.

Environment

  • Chart name: aws-load-balancer-controller
  • Chart version: 2.7.0
  • Kubernetes version:
  • Using EKS (yes/no), if so version? 1.28

Additional Context:

@dev-travelex dev-travelex added the bug Something isn't working label Feb 9, 2024
@M00nF1sh
Copy link
Contributor

M00nF1sh commented Feb 9, 2024

@dev-travelex
would you help run
kubectl -n kube-system describe deployment/aws-load-balancer-controller and kubectl -n kube-system describe endpoints/aws-load-balancer-webhook-service

@dev-travelex
Copy link
Author

I went to 2.6.2 after getting that error. And deployed 2.7.0 again after seeing your comment. This time, I didn't get the webhook error. Nonetheless, here are logs :

kubectl -n kube-system describe deployment/aws-load-balancer-controller 


Name:                   aws-load-balancer-controller
Namespace:              kube-system
CreationTimestamp:      Sat, 10 Feb 2024 01:20:22 +0000
Labels:                 app.kubernetes.io/instance=aws-load-balancer-controller
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=aws-load-balancer-controller
                        app.kubernetes.io/version=v2.7.0
                        helm.sh/chart=aws-load-balancer-controller-1.7.0
Annotations:            deployment.kubernetes.io/revision: 2
                        meta.helm.sh/release-name: aws-load-balancer-controller
                        meta.helm.sh/release-namespace: kube-system
Selector:               app.kubernetes.io/instance=aws-load-balancer-controller,app.kubernetes.io/name=aws-load-balancer-controller
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/instance=aws-load-balancer-controller
                    app.kubernetes.io/name=aws-load-balancer-controller
  Annotations:      prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  aws-load-balancer-controller
  Containers:
   aws-load-balancer-controller:
    Image:       public.ecr.aws/eks/aws-load-balancer-controller:v2.7.0
    Ports:       9443/TCP, 8080/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --cluster-name=nonprod-bliss-eks
      --ingress-class=alb
      --aws-vpc-id=vpc-0c6f43c09d80645f7
    Liveness:     http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
    Readiness:    http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
  Volumes:
   cert:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         aws-load-balancer-tls
    Optional:           false
  Priority Class Name:  system-cluster-critical
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  aws-load-balancer-controller-6d59ff6457 (0/0 replicas created)
NewReplicaSet:   aws-load-balancer-controller-b55586fb8 (2/2 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  22m    deployment-controller  Scaled up replica set aws-load-balancer-controller-6d59ff6457 to 2
  Normal  ScalingReplicaSet  6m51s  deployment-controller  Scaled up replica set aws-load-balancer-controller-b55586fb8 to 1
  Normal  ScalingReplicaSet  5m55s  deployment-controller  Scaled down replica set aws-load-balancer-controller-6d59ff6457 to 1 from 2
  Normal  ScalingReplicaSet  5m55s  deployment-controller  Scaled up replica set aws-load-balancer-controller-b55586fb8 to 2 from 1
  Normal  ScalingReplicaSet  5m3s   deployment-controller  Scaled down replica set aws-load-balancer-controller-6d59ff6457 to 0 from 1

kubectl -n kube-system describe endpoints/aws-load-balancer-webhook-service 

Name:         aws-load-balancer-webhook-service
Namespace:    kube-system
Labels:       app.kubernetes.io/component=webhook
              app.kubernetes.io/instance=aws-load-balancer-controller
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=aws-load-balancer-controller
              app.kubernetes.io/version=v2.7.0
              helm.sh/chart=aws-load-balancer-controller-1.7.0
              prometheus.io/service-monitor=false
Annotations:  <none>
Subsets:
  Addresses:          100.68.90.208,100.68.90.24
  NotReadyAddresses:  <none>
  Ports:
    Name            Port  Protocol
    ----            ----  --------
    webhook-server  9443  TCP

@dev-travelex
Copy link
Author

dev-travelex commented Feb 10, 2024

The logs shows it still stuck at this for more than 24 hours now; Would it impact the actual deployment of a load balancer?

I0210 23:42:56.770284 1 leaderelection.go:248] attempting to acquire leader lease kube-system/aws-load-balancer-controller-leader...

@SebastianScherer88
Copy link

SebastianScherer88 commented Apr 17, 2024

@dev-travelex

I have this same issue with releases of the aws-load-balancer-controller chart >= 1.7. i pinned the load balancer controller pod image to v2.4.2 in both cases, so this is likely a chart level manifest/configuration issue imo

solution 1: downgrading the chart to 1.6.2 fixed it for me.
solution 2: Upgrading the image.tag argument that pins the load balancer controller image to >=2.7 while keeping the chart version >=1.7 also works.
solution 3: (untested): setting the readinessProbe helm chart value to empty might also work (See below why)

a bit more detail on the problem description (for my case) and why the solutions work:

it seems like the pods belonging to the load balancer controller deployment arent coming up for some reason - they are failing their readiness probes in the charts >= 1.7:

NAME                                          READY   STATUS    RESTARTS   AGE
aws-load-balancer-controller-9b866d9c-4n665   0/1     Running   0          6m36s
aws-load-balancer-controller-9b866d9c-fxzsz   0/1     Running   0          6m36s

Pod logs

{"level":"info","ts":1713346839.8052669,"msg":"version","GitVersion":"v2.4.2","GitCommit":"77370be7f8e13787a3ec0cfa99de1647010f1055","BuildDate":"2022-05-24T22:33:27+0000"}
{"level":"info","ts":1713346839.862494,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1713346839.8755138,"logger":"setup","msg":"adding health check for controller"}
{"level":"info","ts":1713346839.8758347,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-v1-pod"}
{"level":"info","ts":1713346839.876017,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
{"level":"info","ts":1713346839.8762178,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
{"level":"info","ts":1713346839.8763244,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-networking-v1-ingress"}
{"level":"info","ts":1713346839.876444,"logger":"setup","msg":"starting podInfo repo"}
I0417 09:40:41.877204       1 leaderelection.go:243] attempting to acquire leader lease kube-system/aws-load-balancer-controller-leader...
{"level":"info","ts":1713346841.877265,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1713346841.8773456,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1713346841.877691,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1713346841.877793,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
{"level":"info","ts":1713346841.8782332,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}

load balancer controller pod description:

Name:                 aws-load-balancer-controller-9b866d9c-4n665
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      aws-load-balancer-controller
Node:                 ip-10-0-60-11.eu-west-2.compute.internal/10.0.60.11
Start Time:           Wed, 17 Apr 2024 09:40:39 +0000
Labels:               app.kubernetes.io/instance=aws-load-balancer-controller
                      app.kubernetes.io/name=aws-load-balancer-controller
                      pod-template-hash=9b866d9c
Annotations:          prometheus.io/port: 8080
                      prometheus.io/scrape: true
Status:               Running
IP:                   10.0.43.232
IPs:
  IP:           10.0.43.232
Controlled By:  ReplicaSet/aws-load-balancer-controller-9b866d9c
Containers:
  aws-load-balancer-controller:
    Container ID:  containerd://e42b03eede8cd18a7912b80e9779c638b5817812eb4b9788800f58636d6d2135
    Image:         public.ecr.aws/eks/aws-load-balancer-controller:v2.4.2
    Image ID:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller@sha256:321e6ff4b55a0bb8afc090cd2f7ef5b0be8cd356e407ce525ee8b24866382808
    Ports:         9443/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --cluster-name=mlops-demo
      --ingress-class=alb
    State:          Running
      Started:      Wed, 17 Apr 2024 09:40:39 +0000
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
    Readiness:      http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
    Environment:
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_DEFAULT_REGION:           eu-west-2
      AWS_REGION:                   eu-west-2
      AWS_ROLE_ARN:                 arn:aws:iam::743582000746:role/aws-load-balancer-controller
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2d6k5 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aws-load-balancer-tls
    Optional:    false
  kube-api-access-2d6k5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  10m                     default-scheduler  Successfully assigned kube-system/aws-load-balancer-controller-9b866d9c-4n665 to ip-10-0-60-11.eu-west-2.compute.internal
  Normal   Pulled     10m                     kubelet            Container image "public.ecr.aws/eks/aws-load-balancer-controller:v2.4.2" already present on machine
  Normal   Created    10m                     kubelet            Created container aws-load-balancer-controller
  Normal   Started    10m                     kubelet            Started container aws-load-balancer-controller
  Warning  Unhealthy  4m54s (x35 over 9m54s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 404

as a result, no services (regardless of load balancer type) can be created (as is expected it seems - outlined in this issue) when trying to apply either ingress or generic service manifests.

kubectl apply -f eks-cluster-w-alb/manifests/alb-example/ # deploy remaining stack
deployment.apps/echoserver unchanged
namespace/alb-example unchanged
Error from server (InternalError): error when creating "eks-cluster-w-alb/manifests/alb-example/ingress.yaml": Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1-ingress?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
Error from server (InternalError): error when creating "eks-cluster-w-alb/manifests/alb-example/service.yaml": Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
Error from server (InternalError): error when creating "eks-cluster-w-alb/manifests/alb-example/test-service.yaml": Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"

i've noticed this change introducing a readiness probe for the controller docker image made in what looks like the release of the 1.7.0:
image

This readiness probe seems to only be supported with controller images >=2.7. in other words, the following combinations of (chart version, controller image version) seem to be compatible:

  • (<1.7, *)
  • (>=1.7, >=2.7)
    • you might be able to use <2.7 controller image versions if you disable the readinessProbe helm value, removing the unsupported probe endpoint for older images

@rumaniel
Copy link

I encountered the same issue. After reviewing @SebastianScherer88 ’s solution, it seemed to use versions 1.8 and 2.6. I decided to upgrade to version 2.7, and it resolved the problem. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants