-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apiserver "Failed to process a Pod event", "failed with error error trying to connect; dns error: failed to lookup address information: Name does not resolve" #332
Comments
The API server is attempting to list agent pods in the update operator namespace where the labels look like:
Can you give an overview of what's in the
And to see what labels are under one of the agent pods:
|
hi @jpmcb, $ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/brupop-agent-lr5m6 1/1 Running 0 112s
pod/brupop-apiserver-7cb6bd59bf-5fzx2 1/1 Running 0 112s
pod/brupop-apiserver-7cb6bd59bf-8rd59 1/1 Running 0 112s
pod/brupop-apiserver-7cb6bd59bf-pndh9 1/1 Running 0 112s
pod/brupop-controller-deployment-6484476846-h8j8q 1/1 Running 0 111s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/brupop-apiserver ClusterIP 172.20.10.204 <none> 443/TCP 112s
service/brupop-controller-server ClusterIP 172.20.107.67 <none> 80/TCP 112s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/brupop-agent 1 1 1 1 1 <none> 112s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/brupop-apiserver 3/3 3 3 112s
deployment.apps/brupop-controller-deployment 1/1 1 1 112s
NAME DESIRED CURRENT READY AGE
replicaset.apps/brupop-apiserver-7cb6bd59bf 3 3 3 112s
replicaset.apps/brupop-controller-deployment-6484476846 1 1 1 112s $ kubectl describe pod/brupop-agent-lr5m6
Name: brupop-agent-lr5m6
Namespace: brupop-bottlerocket-aws
Priority: 0
Service Account: brupop-agent-service-account
Node: ip-10-20-112-166.ec2.internal/10.20.112.166
Start Time: Wed, 02 Nov 2022 12:13:29 -0400
Labels: brupop.bottlerocket.aws/component=agent
controller-revision-hash=76489c4794
pod-template-generation=1
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.20.125.225
IPs:
IP: 10.20.125.225
Controlled By: DaemonSet/brupop-agent
Containers:
brupop:
Container ID: containerd://28b7ec0e8e6b439bc32e6e74fc2d170243d0595622c5c6df1895aa865322320e
Image: public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.2
Image ID: public.ecr.aws/bottlerocket/bottlerocket-update-operator@sha256:a6de31e1b3553e0c5b5401ec7f7cc435c150481f5c4827c061e523106b9748c0
Port: <none>
Host Port: <none>
Command:
./agent
State: Running
Started: Wed, 02 Nov 2022 12:13:31 -0400
Ready: True
Restart Count: 0
Limits:
memory: 50Mi
Requests:
cpu: 10m
memory: 50Mi
Environment:
MY_NODE_NAME: (v1:spec.nodeName)
Mounts:
/bin/apiclient from bottlerocket-apiclient (rw)
/etc/brupop-tls-keys from bottlerocket-tls-keys (rw)
/run/api.sock from bottlerocket-api-socket (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4k5wf (ro)
/var/run/secrets/tokens/ from bottlerocket-agent-service-account-token (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
bottlerocket-api-socket:
Type: HostPath (bare host directory volume)
Path: /run/api.sock
HostPathType: Socket
bottlerocket-apiclient:
Type: HostPath (bare host directory volume)
Path: /bin/apiclient
HostPathType: File
bottlerocket-agent-service-account-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3600
bottlerocket-tls-keys:
Type: Secret (a volume populated by a Secret)
SecretName: brupop-tls
Optional: false
kube-api-access-4k5wf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m14s default-scheduler Successfully assigned brupop-bottlerocket-aws/brupop-agent-lr5m6 to ip-10-20-112-166.ec2.internal
Warning FailedMount 3m14s kubelet MountVolume.SetUp failed for volume "bottlerocket-tls-keys" : secret "brupop-tls" not found
Normal Pulled 3m13s kubelet Container image "public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.2" already present on machine
Normal Created 3m13s kubelet Created container brupop
Normal Started 3m12s kubelet Started container brupop |
hi @karmingc I'm not sure if this causes your issue. I noticed that you only have one node (one agent) running on your EKS cluster. Unfortunately, Brupop only support to work on the cluster which has more than 2 or 3 nodes. ( We will support more document on this). Meanwhile, can you provide more details on the behavior of apiserver? Were the apiserver and controller on pending status like this? If yes, I think that might be the reason I mentioned above; otherwise, I'll do more investigation on this. Thanks!
|
@gthao313 It was done manually, but I do have more nodes on my cluster... I just tried labelling other nodes, testing with 2 and 3 nodes with the label $ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/brupop-agent-9qnpt 1/1 Running 0 4m2s
pod/brupop-agent-ljjkv 1/1 Running 0 4m2s
pod/brupop-agent-nwqhs 1/1 Running 0 24s
pod/brupop-apiserver-7cb6bd59bf-bmzsg 1/1 Running 0 4m2s
pod/brupop-apiserver-7cb6bd59bf-h5vxf 1/1 Running 0 4m2s
pod/brupop-apiserver-7cb6bd59bf-ncn25 1/1 Running 0 4m2s
pod/brupop-controller-deployment-6484476846-hgbpg 1/1 Running 0 4m1s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/brupop-apiserver ClusterIP 172.20.132.135 <none> 443/TCP 4m2s
service/brupop-controller-server ClusterIP 172.20.190.242 <none> 80/TCP 4m2s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/brupop-agent 3 3 3 3 3 <none> 4m2s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/brupop-apiserver 3/3 3 3 4m2s
deployment.apps/brupop-controller-deployment 1/1 1 1 4m2s
NAME DESIRED CURRENT READY AGE
replicaset.apps/brupop-apiserver-7cb6bd59bf 3 3 3 4m3s
replicaset.apps/brupop-controller-deployment-6484476846 1 1 1 4m3s logs from the apiserver pods {"v":0,"name":"apiserver","msg":"failed with error error trying to connect: dns error: failed to lookup address information: Name does not resolve","level":50,"hostname":"brupop-apiserver-79bdb58bc6-p2zz6","pid":1,"time":"2022-11-02T17:26:51.736042453+00:00","target":"kube_client::client::builder","line":164,"file":"/src/.cargo/registry/src/github.com-1ecc6299db9ec823/kube-client-0.71.0/src/client/builder.rs"}
{"v":0,"name":"apiserver","msg":"Failed to process a Pod event","level":50,"hostname":"brupop-apiserver-79bdb58bc6-p2zz6","pid":1,"time":"2022-11-02T17:26:51.736064673+00:00","target":"apiserver::api","line":120,"file":"apiserver/src/api/mod.rs","err":"failed to perform initial object list: HyperError: error trying to connect: dns error: failed to lookup address information: Name does not resolve"} |
@karmingc Can you verify if your pods consistently have this error? This seems incorrect to me, and it maybe relates to certification.
Have you installed cert-manager before installing update operator? |
ah.. we already have cert-manager installed in our cluster |
You should have the So something like:
Are you deploying through the default manifest found in the repository? I wonder if your deployment via Argo isn't including the cert-manager bits |
Yes. we are directly deploying the manifest here https://github.com/bottlerocket-os/bottlerocket-update-operator/blob/develop/yamlgen/deploy/bottlerocket-update-operator.yaml. It should be, at least I'm seeing those resources in the $ kubectl get certificates
NAME READY SECRET AGE
brupop-apiserver-certificate True brupop-tls 10m
~
$ kubectl get secrets
NAME TYPE DATA AGE
brupop-agent-service-account-token-drwmx kubernetes.io/service-account-token 3 10m
brupop-apiserver-service-account-token-h66gf kubernetes.io/service-account-token 3 10m
brupop-controller-service-account-token-9c5cr kubernetes.io/service-account-token 3 10m
brupop-tls kubernetes.io/tls 3 10m
default-token-pdj95 kubernetes.io/service-account-token 3 10m |
Hmmm curious! You might try deleting the entire |
Wouldn't that delete the secrets too? Anyway tried that and the same error is shown... I did however restart the apiserver deployment and the mounting issue is not there. $ kubectl describe pod/brupop-apiserver-c4f75879b-4qmw8
Name: brupop-apiserver-c4f75879b-4qmw8
Namespace: brupop-bottlerocket-aws
Priority: 0
Service Account: brupop-apiserver-service-account
Node: ip-10-20-127-68.ec2.internal/10.20.127.68
Start Time: Wed, 02 Nov 2022 16:20:40 -0400
Labels: brupop.bottlerocket.aws/component=apiserver
pod-template-hash=c4f75879b
Annotations: kubectl.kubernetes.io/restartedAt: 2022-11-02T16:20:19-04:00
kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.20.125.102
IPs:
IP: 10.20.125.102
Controlled By: ReplicaSet/brupop-apiserver-c4f75879b
Containers:
brupop:
Container ID: containerd://dcaac08a75b737780e2ea280008a8029fbd5638c761ef05866b1302742e738c5
Image: public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.2
Image ID: public.ecr.aws/bottlerocket/bottlerocket-update-operator@sha256:a6de31e1b3553e0c5b5401ec7f7cc435c150481f5c4827c061e523106b9748c0
Port: 8443/TCP
Host Port: 0/TCP
Command:
./apiserver
State: Running
Started: Wed, 02 Nov 2022 16:20:41 -0400
Ready: True
Restart Count: 0
Liveness: http-get https://:8443/ping delay=5s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:8443/ping delay=5s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/brupop-tls-keys from bottlerocket-tls-keys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2jh4w (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
bottlerocket-tls-keys:
Type: Secret (a volume populated by a Secret)
SecretName: brupop-tls
Optional: false
kube-api-access-2jh4w:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 60s default-scheduler Successfully assigned brupop-bottlerocket-aws/brupop-apiserver-c4f75879b-4qmw8 to ip-10-20-127-68.ec2.internal
Normal Pulled 59s kubelet Container image "public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.2" already present on machine
Normal Created 59s kubelet Created container brupop
Normal Started 59s kubelet Started container brupop |
You may try deploying the kubernetes dns-utils pod into the brupop namespace and see if something is wrong with dns resolution to/from the agent and apiserver |
What would be the host name to test for the agent? I tried a couple and it didn't seem error out so far.. ❯ kubectl exec -i -t dnsutils -- nslookup brupop-apiserver.brupop-bottlerocket-aws
Server: 172.20.0.10
Address: 172.20.0.10#53
Name: brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local
Address: 172.20.176.214
$ kubectl exec -i -t dnsutils -- nslookup brupop-controller-server.brupop-bottlerocket-aws
Server: 172.20.0.10
Address: 172.20.0.10#53
Name: brupop-controller-server.brupop-bottlerocket-aws.svc.cluster.local
Address: 172.20.135.6
$ kubectl exec -ti dnsutils -- cat /etc/resolv.conf
search brupop-bottlerocket-aws.svc.cluster.local svc.cluster.local cluster.local ec2.internal
nameserver 172.20.0.10
options ndots:2 edit: $ kubectl get service kube-dns -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 172.20.0.10 <none> 53/UDP,53/TCP 486d |
Dns is working. This is a problem where, for some reason, our Kubernetes client in the API server isn't able to reach the base kubernetes API at What is the shape of your network? What CNI are you using? Are you using the node's host-network? I'm wondering if this is related to our usage of Rust-ls as mentioned here: kube-rs/kube#1071 Edit: are you able to upgrade to our 1.0.0 release? There were many small changes that went into that (and the logs look much better). We upgraded our kubernetes client dependency code in that release so it would be interesting to see if this persists on the newest release |
That would indeed seem to be the case... Added another container within the apiserver pods and used tcpdump to monitor calls to using dig 15:20:30.587796 IP brupop-apiserver-ddcfb7d55-hj28q.49204 > kube-dns.kube-system.svc.cluster.local.53: 15341+ [1au] A? kubernetes.default.svc. (63)
15:20:30.588848 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.49204: 15341 NXDomain* 0/1/1 (138)
15:20:30.665433 IP brupop-apiserver-ddcfb7d55-hj28q.39373 > kube-dns.kube-system.svc.cluster.local.53: 55645+ PTR? 10.0.20.172.in-addr.arpa. (42)
15:20:30.666014 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.39373: 55645*- 1/0/0 PTR kube-dns.kube-system.svc.cluster.local. (118) using curl 15:24:02.707393 IP brupop-apiserver-ddcfb7d55-hj28q.51918 > kube-dns.kube-system.svc.cluster.local.53: 9464+ A? kubernetes.default.svc.brupop-bottlerocket-aws.svc.cluster.local. (82)
15:24:02.707447 IP brupop-apiserver-ddcfb7d55-hj28q.51918 > kube-dns.kube-system.svc.cluster.local.53: 1779+ AAAA? kubernetes.default.svc.brupop-bottlerocket-aws.svc.cluster.local. (82)
15:24:02.708136 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.51918: 1779 NXDomain*- 0/1/0 (175)
15:24:02.708200 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.51918: 9464 NXDomain*- 0/1/0 (175)
15:24:02.708269 IP brupop-apiserver-ddcfb7d55-hj28q.53675 > kube-dns.kube-system.svc.cluster.local.53: 44054+ A? kubernetes.default.svc.svc.cluster.local. (58)
15:24:02.708296 IP brupop-apiserver-ddcfb7d55-hj28q.53675 > kube-dns.kube-system.svc.cluster.local.53: 64785+ AAAA? kubernetes.default.svc.svc.cluster.local. (58)
15:24:02.712202 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.53675: 44054 NXDomain*- 0/1/0 (151)
15:24:02.712204 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.53675: 64785 NXDomain*- 0/1/0 (151)
15:24:02.712355 IP brupop-apiserver-ddcfb7d55-hj28q.59823 > kube-dns.kube-system.svc.cluster.local.53: 50321+ A? kubernetes.default.svc.cluster.local. (54)
15:24:02.712387 IP brupop-apiserver-ddcfb7d55-hj28q.59823 > kube-dns.kube-system.svc.cluster.local.53: 12396+ AAAA? kubernetes.default.svc.cluster.local. (54)
15:24:02.713617 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.59823: 50321*- 1/0/0 A 172.20.0.1 (106)
15:24:02.713618 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.59823: 12396*- 0/1/0 (147) which might probably suggest that the current mechanism to reach the base kubernetes API is not going through the list of DNS search domains for hostname lookup, similar to using I also updated to v1.0.0, the logs are more clear, but the problem persists. edit: this was done with updating ndots to |
@karmingc are you using the hostNetwork: true I found something possibly similar where (if you're using hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet If you're not using |
Turns out the problem was our fault. We have a webhook mutator that injects an ndots value of 2:
and I was sure I had checked that we had disabled it for the this operator when we saw the DNS issues but nope, it was being injected 🤦 Once we turned disabled the webhook mutator for this operator it resolved our dns issues. Sorry for the trouble folks 😅 Oh and thanks for this project, it's simplifying our maintenance burden! |
Awesome news! Thanks for reporting back with the details. That's very useful. Sounds like this issue can be closed. :) |
Hello, I recently tried to deploy
v0.2.2
to our cluster but seeing repeated error logs in the apiserver pods.Version: v0.2.2
Deployed with ArgoCD using /yamlgen/deploy/bottlerocket-update-operator.yaml.
node:
I'm a bit unsure about the intricacies of how the apiserver is working, but would appreciate any help on this.
The text was updated successfully, but these errors were encountered: