provider `0.5.12`: 8443/status & 8444 grpc endpoints sporadically falling off #214

andy108369 · 2024-04-13T09:47:43Z

I've noticed this primarily happening on *.mon.obl.akash.pub providers; most frequently on the h100.mon.obl.akash.pub provider

Additional observations while requests to :8443/status & 8444 grpc endpoints hang:

Restarting operator-inventory does not change anything;
provider keeps producing the logs;

Logs

logs - before provider & operator-inventory operators were restarted

h100.mon.obl.akash.pub.provider.log

h100.mon.obl.akash.pub.deployment-operator-inventory.log

The text was updated successfully, but these errors were encountered:

andy108369 · 2024-04-13T09:52:10Z

FYI: Current workaround

Current livenessProbe checks helping in provider restart whenever this issue gets detected, so the manifest still can be sent to the provider whenever users want to deploy something.

Liveness checks involved:
https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/templates/statefulset.yaml#L261-L270
https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/scripts/liveness_checks.sh#L12-L16

Running the liveness check manually confirms the issue:

$ kubectl -n akash-services describe pod akash-provider-0 | grep -i liveness
    Liveness:  exec [sh -c /scripts/liveness_checks.sh] delay=240s timeout=60s period=60s #success=1 #failure=3
  Normal   Killing    13m (x4 over 53m)    kubelet  Container provider failed liveness probe, will be restarted
  Warning  Unhealthy  6m7s (x14 over 55m)  kubelet  Liveness probe failed: api /status check failed

$ time kubectl -n akash-services exec -ti akash-provider-0 -- bash -x /scripts/liveness_checks.sh
Defaulted container "provider" out of: provider, init (init)
+ set -o pipefail
+ openssl x509 -in /config/provider.pem -checkend 3600 -noout
+ timeout 30s curl -o /dev/null -fsk https://127.0.0.1:8443/status
+ echo 'api /status check failed'
api /status check failed
+ exit 1
command terminated with exit code 1

real	0m31.510s
user	0m0.086s
sys	0m0.022s

andy108369 · 2024-04-13T18:44:20Z

affected providers

Additionally attached the complete logs for the provider pods taken with --previous argument to ensure the logs are of the pods which stopped answering over the status (8443/status & 8444 grpc) endpoints.

SW versions:

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                          IMAGE
akash-node-1-0                                ghcr.io/akash-network/node:0.32.2
akash-provider-0                              ghcr.io/akash-network/provider:0.5.12
operator-hostname-74744f497c-kn5v7            ghcr.io/akash-network/provider:0.5.12
operator-inventory-6985f746d4-jfpt2           ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node1   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node2   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node3   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node4   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node5   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node6   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node7   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node8   ghcr.io/akash-network/provider:0.5.12

h100.mon.obl.akash.pub

NAME               READY   STATUS    RESTARTS         AGE
akash-provider-0   1/1     Running   52 (3m27s ago)   9h

h100.mon.obl.akash.pub.provider.previous.log

a100.mon.obl.akash.pub

NAME               READY   STATUS    RESTARTS       AGE
akash-provider-0   1/1     Running   13 (10m ago)   9h

a100.mon.obl.akash.pub.provider.previous.log

sg.lnlm.akash.pub

NAME               READY   STATUS    RESTARTS        AGE
akash-provider-0   1/1     Running   5 (5h35m ago)   28h

provider-02.sandbox-01.aksh.pw

NAME               READY   STATUS    RESTARTS        AGE
akash-provider-0   1/1     Running   3 (4h10m ago)   28h

provider-02.sandbox-01.aksh.pw.provider.previous.log

pdx.nb.akash.pub

NAME               READY   STATUS    RESTARTS       AGE
akash-provider-0   1/1     Running   2 (6h2m ago)   19h

pdx.nb.akash.pub.provider.previous.log

ty.lneq.akash.pub

NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   2 (10h ago)   28h

ty.lneq.akash.pub.provider.previous.log

sg.lneq.akash.pub

NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   2 (18h ago)   28h

sg.lneq.akash.pub.provider.previous.log

au.lneq.akash.pub

NAME               READY   STATUS    RESTARTS     AGE
akash-provider-0   1/1     Running   1 (9h ago)   28h

au.lneq.akash.pub.provider.previous.log

akash.pro

NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   1 (28h ago)   28h

akash.pro.provider.previous.log

hk.lneq.akash.pub

NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   1 (25h ago)   28h

hk.lneq.akash.pub.provider.previous.log

no issues on these providers, yet 🤞

sl.lneq.akash.pub

NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h

hurricane.akash.pub

NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h

yvr.nb.akash.pub

NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h

andy108369 · 2024-04-15T17:38:16Z

Providers 0.5.12 can downgrade to 0.5.11 using these commands:

cd provider
helm -n akash-services upgrade akash-hostname-operator akash/akash-hostname-operator --set image.tag=0.5.11
helm -n akash-services upgrade inventory-operator akash/akash-inventory-operator --set image.tag=0.5.11
helm -n akash-services upgrade akash-ip-operator akash/akash-ip-operator --set provider_address=<SET-your-provider-address-HERE> --set image.tag=0.5.11

helm upgrade akash-provider akash/provider -n akash-services -f provider.yaml --set bidpricescript="$(cat price_script_generic.sh | openssl base64 -A)" --set image.tag=0.5.11

fixes akash-network/support#214

andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Apr 13, 2024

andy108369 added a commit to andy108369/helm-charts that referenced this issue Apr 16, 2024

fix(provider,operators): bump provider to 0.5.13

ad9ea4a

fixes akash-network/support#214

andy108369 mentioned this issue Apr 16, 2024

fix(provider,operators): bump provider to 0.5.13 akash-network/helm-charts#275

Merged

andy108369 closed this as completed in akash-network/helm-charts#275 Apr 16, 2024

andy108369 added a commit to akash-network/helm-charts that referenced this issue Apr 16, 2024

fix(provider,operators): bump provider to 0.5.13 (#275)

3e7c18e

fixes akash-network/support#214

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provider `0.5.12`: 8443/status & 8444 grpc endpoints sporadically falling off #214

provider `0.5.12`: 8443/status & 8444 grpc endpoints sporadically falling off #214

andy108369 commented Apr 13, 2024

andy108369 commented Apr 13, 2024

andy108369 commented Apr 13, 2024 •

edited

Loading

andy108369 commented Apr 15, 2024 •

edited

Loading

provider 0.5.12: 8443/status & 8444 grpc endpoints sporadically falling off #214

provider 0.5.12: 8443/status & 8444 grpc endpoints sporadically falling off #214

Comments

andy108369 commented Apr 13, 2024

Logs

andy108369 commented Apr 13, 2024

FYI: Current workaround

andy108369 commented Apr 13, 2024 • edited Loading

affected providers

no issues on these providers, yet 🤞

andy108369 commented Apr 15, 2024 • edited Loading

provider `0.5.12`: 8443/status & 8444 grpc endpoints sporadically falling off #214

provider `0.5.12`: 8443/status & 8444 grpc endpoints sporadically falling off #214

andy108369 commented Apr 13, 2024 •

edited

Loading

andy108369 commented Apr 15, 2024 •

edited

Loading