Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a health-check endpoint /health and CLI command #3101

Open
grisha87 opened this issue Mar 7, 2024 · 6 comments · Fixed by #3334
Open

Implement a health-check endpoint /health and CLI command #3101

grisha87 opened this issue Mar 7, 2024 · 6 comments · Fixed by #3334
Assignees
Labels
sdk-request issue requested by SDK team

Comments

@grisha87
Copy link
Contributor

grisha87 commented Mar 7, 2024

Why?

There are multiple reasons why providing such an endpoint makes sense:

  • yagna running in the docker container can be health-checked using that endpoint
  • any of-the-shelf devops/monitoring tooling can utilize that endpoint to inspect the health of the instance and trigger a restart if necessary
  • yagna has known stability issues that could be checked by this endpoint and if yagna becomes unresponsive, that will be visible and will trigger a restart by the orchestrators

We could also have a yagna health CLI command which will inspect the status of the instance and provide some insights on it for the users so that it's going to be easier to troubleshoot.

What would have to be changed?

Introduce the /health API endpoint

If everything is fine, then it should return HTTP 200.
If something is wrong with that instance, then it should return HTTP 500 and the body can contain information about the error:

{ "reason": "No connection to relay" }

The list of potential reasons and checks can be filled in below:

  • no connection to relay
  • super laggy connection to relay - ping above 5min
  • ...
  • ...

Introduce the "health" command

This command can execute the same checks as the endpoint listed above.

If everything is fine, the exit code should be 0.
If anything is wrong, the exit code should be 1.

From the UI perspective, a "checklist" showing what was checked and what's the status of the check would be great for any sort of bug or issue reports done by the community.

For more context: when we were leading the DegenHack teams, they ended up in situations where over time they "didn't get any offers". Today, one has to know a series of commands to troubleshoot this particular issue. I believe that the types of checks, and especially the health command could output information that would be already providing enough guidance in troubleshooting common issues.

@cryptobench
Copy link
Member

+1 - would be useful for the stats page too.

Right now on stats I have this script that Docker is running on the container

#!/bin/sh
set -e # Exit immediately if a command exits with a non-zero status.
set -x # Print commands and their arguments as they are executed.

# Get the Yagna app key
KEY=$(yagna app-key list --json | jq -r '.[0].key')

if [ -z "$KEY" ]; then
    echo "Key is empty, exiting."
    exit 1
fi

# Use the key in the curl command
curl -H "Authorization: Bearer ${KEY}" 127.0.0.1:7465/me

The REST API might work fine but the rest of yagna doesnt for example when checking payment status... so i've needed to healthcheck multiple things. Not sure if it covers all possible scenarios even, so a proper health endpoint would be great.

@cryptobench
Copy link
Member

Issues with the stats page right now... checked the instance, healthcheck is not returning any errors as its able to fetch from the Yagna REST API

ubuntu@stats:~$ docker exec -it 1d1d5d653f78 sh
/stats-backend # yagna payment status
[2024-03-17T18:52:44.824+0000 ERROR yagna] Exiting..., error details: Error while sending payment: Unreachable, /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/erc20_payment_lib-0.3.17/src/eth.rs:78:26
Error: Error while sending payment: Unreachable, /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/erc20_payment_lib-0.3.17/src/eth.rs:78:26
/stats-backend # /check_yagna.sh
+ yagna app-key list --json
+ jq -r '.[0].key'
+ KEY=stats
+ '[' -z stats ]
+ curl -H 'Authorization: Bearer stats' 127.0.0.1:7465/me
{"identity":"0x20541b279e2550ebcab44acb2186ae7b4cdcbc94","name":"autoconfigured","role":"manager"}/stats-backend #

In this case yagna payment status fails but I can easily curl the REST API /me endpoint which doesn't indicate an issue.

The more I run yagna the more obscure ways I need to implement healthchecks because so many individuals things are failing while the rest seems fine.

@prekucki

@cryptobench
Copy link
Member

Now I re-ran the yagna payment status command twice and suddenly it returns fine, but yagna still doesn't work correctly and doesnt retrieve offers.

/stats-backend # yagna payment status

Status for account: 0x20541b279e2550ebcab44acb2186ae7b4cdcbc94
Payment Driver status: OK

┌─────────────────────┬────────────────┬────────────┬─────────────┬────────────┬────────────┬───────────────┐
│  platform           │  total amount  │  reserved  │  amount     │  incoming  │  outgoing  │  gas          │
├─────────────────────┼────────────────┼────────────┼─────────────┼────────────┼────────────┼───────────────┤
│  driver: erc20next  │  0.0000 tGLM   │  0 tGLM    │  accepted   │  0 tGLM    │  0 tGLM    │  0.0000 tETH  │
│  network: goerli    │                │            │  confirmed  │  0 tGLM    │  0 tGLM    │               │
│  token: tGLM        │                │            │  requested  │  0 tGLM    │  0 tGLM    │               │
└─────────────────────┴────────────────┴────────────┴─────────────┴────────────┴────────────┴───────────────┘
/stats-backend # yagna payment status

Status for account: 0x20541b279e2550ebcab44acb2186ae7b4cdcbc94
Payment Driver status: OK

┌─────────────────────┬────────────────┬────────────┬─────────────┬────────────┬────────────┬───────────────┐
│  platform           │  total amount  │  reserved  │  amount     │  incoming  │  outgoing  │  gas          │
├─────────────────────┼────────────────┼────────────┼─────────────┼────────────┼────────────┼───────────────┤
│  driver: erc20next  │  0.0000 tGLM   │  0 tGLM    │  accepted   │  0 tGLM    │  0 tGLM    │  0.0000 tETH  │
│  network: goerli    │                │            │  confirmed  │  0 tGLM    │  0 tGLM    │               │
│  token: tGLM        │                │            │  requested  │  0 tGLM    │  0 tGLM    │               │
└─────────────────────┴────────────────┴────────────┴─────────────┴────────────┴────────────┴───────────────

@grisha87 grisha87 added the sdk-request issue requested by SDK team label Mar 29, 2024
@kamirr
Copy link
Contributor

kamirr commented Sep 18, 2024

Decided to detect the following problems for now:

  • Payment issues (as covered by the current erc20 implementation -- missing token, missing gas etc.).
  • Ya-relay ping not working
  • Time since last market broadcast (should never exceed 2 minutes).

Problems will be reported according to RFC 9457.

@grisha87
Copy link
Contributor Author

@kamirr , we just recently encountered a situation where after 40 days of operation, the passive market scan was providing offers only for 2 providers and didn't see the rest.

@prekucki , was made aware about the issue.

My question is: will your implementation potentially detect this situation as well?

@kamirr
Copy link
Contributor

kamirr commented Sep 23, 2024

My question is: will your implementation potentially detect this situation as well?

I don't think so. I'll discuss this today with the rest of core and let you know if a case for this will be added (unless it is actually covered).

@kamirr kamirr linked a pull request Sep 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sdk-request issue requested by SDK team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants