Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider 0.5.12: 8443/status & 8444 grpc endpoints sporadically falling off #214

Closed
andy108369 opened this issue Apr 13, 2024 · 3 comments · Fixed by akash-network/helm-charts#275
Labels
awaiting-triage repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

I've noticed this primarily happening on *.mon.obl.akash.pub providers; most frequently on the h100.mon.obl.akash.pub provider

Additional observations while requests to :8443/status & 8444 grpc endpoints hang:

  • Restarting operator-inventory does not change anything;
  • provider keeps producing the logs;

Logs

logs - before provider & operator-inventory operators were restarted

h100.mon.obl.akash.pub.provider.log

h100.mon.obl.akash.pub.deployment-operator-inventory.log

@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Apr 13, 2024
@andy108369
Copy link
Contributor Author

FYI: Current workaround

Current livenessProbe checks helping in provider restart whenever this issue gets detected, so the manifest still can be sent to the provider whenever users want to deploy something.

Liveness checks involved:
https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/templates/statefulset.yaml#L261-L270
https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/scripts/liveness_checks.sh#L12-L16

Running the liveness check manually confirms the issue:

$ kubectl -n akash-services describe pod akash-provider-0 | grep -i liveness
    Liveness:  exec [sh -c /scripts/liveness_checks.sh] delay=240s timeout=60s period=60s #success=1 #failure=3
  Normal   Killing    13m (x4 over 53m)    kubelet  Container provider failed liveness probe, will be restarted
  Warning  Unhealthy  6m7s (x14 over 55m)  kubelet  Liveness probe failed: api /status check failed
$ time kubectl -n akash-services exec -ti akash-provider-0 -- bash -x /scripts/liveness_checks.sh
Defaulted container "provider" out of: provider, init (init)
+ set -o pipefail
+ openssl x509 -in /config/provider.pem -checkend 3600 -noout
+ timeout 30s curl -o /dev/null -fsk https://127.0.0.1:8443/status
+ echo 'api /status check failed'
api /status check failed
+ exit 1
command terminated with exit code 1

real	0m31.510s
user	0m0.086s
sys	0m0.022s

@andy108369
Copy link
Contributor Author

andy108369 commented Apr 13, 2024

affected providers

Additionally attached the complete logs for the provider pods taken with --previous argument to ensure the logs are of the pods which stopped answering over the status (8443/status & 8444 grpc) endpoints.

SW versions:

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                          IMAGE
akash-node-1-0                                ghcr.io/akash-network/node:0.32.2
akash-provider-0                              ghcr.io/akash-network/provider:0.5.12
operator-hostname-74744f497c-kn5v7            ghcr.io/akash-network/provider:0.5.12
operator-inventory-6985f746d4-jfpt2           ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node1   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node2   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node3   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node4   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node5   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node6   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node7   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node8   ghcr.io/akash-network/provider:0.5.12
  • h100.mon.obl.akash.pub
NAME               READY   STATUS    RESTARTS         AGE
akash-provider-0   1/1     Running   52 (3m27s ago)   9h

h100.mon.obl.akash.pub.provider.previous.log

  • a100.mon.obl.akash.pub
NAME               READY   STATUS    RESTARTS       AGE
akash-provider-0   1/1     Running   13 (10m ago)   9h

a100.mon.obl.akash.pub.provider.previous.log

  • sg.lnlm.akash.pub
NAME               READY   STATUS    RESTARTS        AGE
akash-provider-0   1/1     Running   5 (5h35m ago)   28h
  • provider-02.sandbox-01.aksh.pw
NAME               READY   STATUS    RESTARTS        AGE
akash-provider-0   1/1     Running   3 (4h10m ago)   28h

provider-02.sandbox-01.aksh.pw.provider.previous.log

  • pdx.nb.akash.pub
NAME               READY   STATUS    RESTARTS       AGE
akash-provider-0   1/1     Running   2 (6h2m ago)   19h

pdx.nb.akash.pub.provider.previous.log

  • ty.lneq.akash.pub
NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   2 (10h ago)   28h

ty.lneq.akash.pub.provider.previous.log

  • sg.lneq.akash.pub
NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   2 (18h ago)   28h

sg.lneq.akash.pub.provider.previous.log

  • au.lneq.akash.pub
NAME               READY   STATUS    RESTARTS     AGE
akash-provider-0   1/1     Running   1 (9h ago)   28h

au.lneq.akash.pub.provider.previous.log

  • akash.pro
NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   1 (28h ago)   28h

akash.pro.provider.previous.log

  • hk.lneq.akash.pub
NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   1 (25h ago)   28h

hk.lneq.akash.pub.provider.previous.log

no issues on these providers, yet 🤞

  • sl.lneq.akash.pub
NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h
  • hurricane.akash.pub
NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h
  • yvr.nb.akash.pub
NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h

@andy108369
Copy link
Contributor Author

andy108369 commented Apr 15, 2024

Providers 0.5.12 can downgrade to 0.5.11 using these commands:

cd provider
helm -n akash-services upgrade akash-hostname-operator akash/akash-hostname-operator --set image.tag=0.5.11
helm -n akash-services upgrade inventory-operator akash/akash-inventory-operator --set image.tag=0.5.11
helm -n akash-services upgrade akash-ip-operator akash/akash-ip-operator --set provider_address=<SET-your-provider-address-HERE> --set image.tag=0.5.11

helm upgrade akash-provider akash/provider -n akash-services -f provider.yaml --set bidpricescript="$(cat price_script_generic.sh | openssl base64 -A)" --set image.tag=0.5.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-triage repo/provider Akash provider-services repo issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant