Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator-inventory keeps restarting panic: interface conversion: runtime.Object is *v1.Status, not *v1.Pod #222

Closed
andy108369 opened this issue May 7, 2024 · 2 comments
Labels
awaiting-triage repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

seeing the following errors on the provider.medc1.com provider:

$ kubectl -n akash-services logs operator-inventory-dfdd44d64-zxcfn  --timestamps
2024-05-07T11:01:25.772363804Z I[2024-05-07|11:01:25.772] using in cluster kube config                 cmp=provider
2024-05-07T11:01:25.888968153Z INFO	nodes.nodes	waiting for nodes to finish
2024-05-07T11:01:25.888990335Z INFO	watcher.storageclasses	started
2024-05-07T11:01:25.889143853Z INFO	rest listening on ":8080"
2024-05-07T11:01:25.889255333Z INFO	grpc listening on ":8081"
2024-05-07T11:01:25.889400085Z INFO	watcher.config	started
2024-05-07T11:01:25.895360878Z INFO	rook-ceph	   ADDED monitoring StorageClass	{"name": "local-path"}
2024-05-07T11:01:25.899218417Z INFO	nodes.node.monitor	starting	{"node": "node4"}
2024-05-07T11:01:25.899222505Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node1"}
2024-05-07T11:01:25.899228226Z INFO	nodes.node.monitor	starting	{"node": "node2"}
2024-05-07T11:01:25.899231632Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node2"}
2024-05-07T11:01:25.899274843Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-002"}
2024-05-07T11:01:25.899310170Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node4"}
2024-05-07T11:01:25.899313075Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-002"}
2024-05-07T11:01:25.899337712Z INFO	nodes.node.monitor	starting	{"node": "node1"}
2024-05-07T11:01:25.899340256Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-003"}
2024-05-07T11:01:25.899360544Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-003"}
2024-05-07T11:01:25.899363059Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-001"}
2024-05-07T11:01:25.899379300Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-001"}
2024-05-07T11:01:25.899428402Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-004"}
2024-05-07T11:01:25.899431077Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-004"}
2024-05-07T11:01:25.899504144Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-005"}
2024-05-07T11:01:25.899528570Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-005"}
2024-05-07T11:01:25.899570008Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-006"}
2024-05-07T11:01:25.899594534Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-006"}
2024-05-07T11:01:25.899639298Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-007"}
2024-05-07T11:01:25.899672901Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-007"}
2024-05-07T11:01:25.899729508Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-008"}
2024-05-07T11:01:25.899756659Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-008"}
2024-05-07T11:01:25.899869381Z INFO	rancher	   ADDED monitoring StorageClass	{"name": "local-path"}
2024-05-07T11:01:25.899892244Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-009"}
2024-05-07T11:01:25.899894788Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-009"}
2024-05-07T11:01:25.900035313Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg-4090-010"}
2024-05-07T11:01:25.900041815Z INFO	nodes.node.monitor	starting	{"node": "xg-4090-010"}
2024-05-07T11:01:25.900081700Z INFO	nodes.node.monitor	starting	{"node": "xg3"}
2024-05-07T11:01:25.900097479Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "xg3"}
2024-05-07T11:01:27.831711875Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-010"}
2024-05-07T11:01:27.832484658Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-006"}
2024-05-07T11:01:27.867445812Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-007"}
2024-05-07T11:01:27.869880457Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-005"}
2024-05-07T11:01:27.986084014Z INFO	nodes.node.monitor	started	{"node": "xg-4090-005"}
2024-05-07T11:01:28.411173473Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-001"}
2024-05-07T11:01:28.451953112Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-002"}
2024-05-07T11:01:28.457704931Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg3"}
2024-05-07T11:01:28.492491468Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node1"}
2024-05-07T11:01:28.504803947Z INFO	nodes.node.monitor	started	{"node": "xg-4090-001"}
2024-05-07T11:01:28.559504638Z INFO	nodes.node.monitor	started	{"node": "xg-4090-002"}
2024-05-07T11:01:28.589371298Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-008"}
2024-05-07T11:01:28.615090122Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node4"}
2024-05-07T11:01:28.703609732Z INFO	nodes.node.monitor	started	{"node": "xg-4090-008"}
2024-05-07T11:01:28.732181427Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-004"}
2024-05-07T11:01:28.834987215Z INFO	nodes.node.monitor	started	{"node": "xg-4090-004"}
2024-05-07T11:01:28.883371365Z INFO	nodes.node.monitor	started	{"node": "node1"}
2024-05-07T11:01:28.893755821Z INFO	nodes.node.monitor	started	{"node": "node4"}
2024-05-07T11:01:28.950694098Z INFO	nodes.node.monitor	started	{"node": "xg-4090-006"}
2024-05-07T11:01:28.956650192Z INFO	nodes.node.monitor	started	{"node": "xg-4090-010"}
2024-05-07T11:01:28.958932581Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-003"}
2024-05-07T11:01:28.979685423Z INFO	nodes.node.monitor	started	{"node": "xg-4090-007"}
2024-05-07T11:01:29.083981691Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "xg-4090-009"}
2024-05-07T11:01:29.090695338Z INFO	nodes.node.monitor	started	{"node": "xg-4090-003"}
2024-05-07T11:01:29.189918924Z INFO	nodes.node.monitor	started	{"node": "xg-4090-009"}
2024-05-07T11:01:29.224653021Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node2"}
2024-05-07T11:01:29.417372818Z INFO	nodes.node.monitor	started	{"node": "node2"}
2024-05-07T11:01:29.426747266Z INFO	nodes.node.monitor	successfully applied labels and/or annotations patches for node "node2"	{"labels": {"akash.network":"true","akash.network/capabilities.gpu.vendor.nvidia.model.t4":"1","akash.network/capabilities.gpu.vendor.nvidia.model.t4.interface.pcie":"1","akash.network/capabilities.gpu.vendor.nvidia.model.t4.ram.16Gi":"1","nvidia.com/gpu.present":"true"}}
2024-05-07T11:01:29.677761599Z INFO	nodes.node.monitor	started	{"node": "xg3"}
2024-05-07T11:04:15.652488204Z INFO	nodes.node.monitor	shutting down monitor	{"node": "xg-4090-001"}
2024-05-07T11:04:15.652507019Z INFO	nodes.node.monitor	shutting down monitor	{"node": "xg-4090-010"}
2024-05-07T11:04:15.652530985Z INFO	nodes.node.monitor	shutting down monitor	{"node": "xg-4090-002"}
2024-05-07T11:04:15.652559648Z W0507 11:04:15.652437       7 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.StorageClass ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-07T11:04:15.652572242Z INFO	nodes.node.monitor	shutting down monitor	{"node": "xg-4090-007"}
2024-05-07T11:04:15.652604322Z INFO	nodes.node.monitor	shutting down monitor	{"node": "node1"}
2024-05-07T11:04:15.652615724Z W0507 11:04:15.652571       7 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-07T11:04:15.652629319Z W0507 11:04:15.652591       7 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-07T11:04:15.655867896Z panic: interface conversion: runtime.Object is *v1.Status, not *v1.Pod
2024-05-07T11:04:15.655874839Z 
2024-05-07T11:04:15.655879859Z goroutine 123 [running]:
2024-05-07T11:04:15.655891100Z github.com/akash-network/provider/operator/inventory.(*nodeDiscovery).monitor(0xc001c16180)
2024-05-07T11:04:15.655900267Z 	github.com/akash-network/provider/operator/inventory/node-discovery.go:524 +0x24c8
2024-05-07T11:04:15.655913863Z golang.org/x/sync/errgroup.(*Group).Go.func1()
2024-05-07T11:04:15.655929923Z 	golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
2024-05-07T11:04:15.655941735Z created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 301
2024-05-07T11:04:15.655946494Z 	golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96
$ kubectl -n akash-services get pods -l app.kubernetes.io/name=inventory -o wide
NAME                                                READY   STATUS    RESTARTS         AGE     IP               NODE          NOMINATED NODE   READINESS GATES
operator-inventory-dfdd44d64-zxcfn                  1/1     Running   43 (7m56s ago)   5h13m   10.233.102.129   node1         <none>           <none>
operator-inventory-hardware-discovery-node1         1/1     Running   0                2m44s   10.233.102.159   node1         <none>           <none>
operator-inventory-hardware-discovery-node2         1/1     Running   0                2m43s   10.233.75.47     node2         <none>           <none>
operator-inventory-hardware-discovery-node4         1/1     Running   0                2m44s   10.233.74.67     node4         <none>           <none>
operator-inventory-hardware-discovery-xg-4090-001   1/1     Running   0                2m44s   10.233.71.193    xg-4090-001   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-002   1/1     Running   0                2m44s   10.233.120.60    xg-4090-002   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-003   1/1     Running   0                2m43s   10.233.100.127   xg-4090-003   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-004   1/1     Running   0                2m44s   10.233.104.59    xg-4090-004   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-005   1/1     Running   0                2m44s   10.233.70.51     xg-4090-005   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-006   1/1     Running   0                2m44s   10.233.70.185    xg-4090-006   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-007   1/1     Running   0                2m44s   10.233.109.233   xg-4090-007   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-008   1/1     Running   0                2m44s   10.233.77.109    xg-4090-008   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-009   1/1     Running   0                2m43s   10.233.100.242   xg-4090-009   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-010   1/1     Running   0                2m44s   10.233.105.236   xg-4090-010   <none>           <none>
operator-inventory-hardware-discovery-xg3           1/1     Running   0                2m44s   10.233.64.148    xg3           <none>           <none>

SW version

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                                IMAGE
akash-node-1-0                                      ghcr.io/akash-network/node:0.34.1
akash-provider-0                                    ghcr.io/akash-network/provider:0.6.1
operator-hostname-7b98cb78db-xdc9r                  ghcr.io/akash-network/provider:0.6.1
operator-inventory-dfdd44d64-zxcfn                  ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-node1         ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-node2         ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-node4         ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-001   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-002   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-003   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-004   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-005   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-006   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-007   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-008   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-009   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-010   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg3           ghcr.io/akash-network/provider:0.6.1

kubectl events log

medc1-kubectl-events.log

K8s nodes

$ kubectl get nodes -o wide
NAME          STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
node1         Ready    control-plane   160d    v1.27.7   192.168.99.41   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
node2         Ready    control-plane   17d     v1.27.7   192.168.99.42   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
node4         Ready    <none>          17d     v1.27.7   192.168.99.44   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-001   Ready    <none>          8d      v1.27.7   192.168.99.71   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-002   Ready    <none>          6d22h   v1.27.7   192.168.99.72   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-003   Ready    <none>          8d      v1.27.7   192.168.99.73   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-004   Ready    <none>          25d     v1.27.7   192.168.99.74   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-005   Ready    <none>          24d     v1.27.7   192.168.99.75   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-006   Ready    <none>          24d     v1.27.7   192.168.99.76   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-007   Ready    <none>          24d     v1.27.7   192.168.99.77   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-008   Ready    <none>          23d     v1.27.7   192.168.99.78   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-009   Ready    <none>          21d     v1.27.7   192.168.99.79   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-010   Ready    <none>          21d     v1.27.7   192.168.99.80   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg3           Ready    <none>          28d     v1.27.7   192.168.99.53   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels May 7, 2024
@andy108369
Copy link
Contributor Author

andy108369 commented May 7, 2024

Might be related to something is wrong on the node1:

$ kubectl get events -A --sort-by='.lastTimestamp' | grep dial
ingress-nginx                                   22m         Warning   Unhealthy   pod/ingress-nginx-controller-6v2kb                      Readiness probe failed: Get "http://10.233.102.170:10254/healthz": dial tcp 10.233.102.170:10254: connect: invalid argument
kube-system                                     16m         Warning   Unhealthy   pod/coredns-5c469774b8-wzfkh                            Liveness probe failed: Get "http://10.233.102.151:8080/health": dial tcp 10.233.102.151:8080: connect: invalid argument
ingress-nginx                                   7m49s       Warning   Unhealthy   pod/ingress-nginx-controller-6v2kb                      Liveness probe failed: Get "http://10.233.102.170:10254/healthz": dial tcp 10.233.102.170:10254: connect: invalid argument
kube-system                                     57s         Warning   Unhealthy   pod/coredns-5c469774b8-wzfkh                            Readiness probe failed: Get "http://10.233.102.151:8181/ready": dial tcp 10.233.102.151:8181: connect: invalid argument

$ kubectl get pods -A -o wide |grep 10.233.102.170
ingress-nginx                                   ingress-nginx-controller-6v2kb                      0/1     CrashLoopBackOff   151 (11s ago)   40d     10.233.102.170   node1         <none>           <none>

$ kubectl get pods -A -o wide |grep 10.233.102.151
kube-system                                     coredns-5c469774b8-wzfkh                            1/1     Running            20 (175m ago)   12d     10.233.102.151   node1         <none>           <none>

Will check with medc1.

@andy108369
Copy link
Contributor Author

They've corrected the issue on their end.

Will re-open in case if I see these errors again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-triage repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

1 participant