Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4 - Collect feedback on dashboard prototype #3

Open
cfl0ws opened this issue Oct 26, 2020 · 11 comments
Open

4 - Collect feedback on dashboard prototype #3

cfl0ws opened this issue Oct 26, 2020 · 11 comments

Comments

@cfl0ws
Copy link
Contributor

cfl0ws commented Oct 26, 2020

Oasis Mission Control Call for Feedback

Chainflow and our development partner Vitwit have been awarded an Oasis grant to build the Oasis Mission Control Validator Monitoring and Alerting Dashboard. You can find more details about that here.

We are feeling excited to share this prototype with the community. Validators, we're building this for you.

Please review the work done so far and provide feedback. We'll use this feedback to update the prototype to provide a final and open-sourced version for their use.

For example -

1 - Is the dashboard missing any key metrics?

2 - Are there any additional alerts you'd like to see be made available?

3 - Is there anything we can do to organize the information in a more user-friendly way, e.g. reorganize existing dashboards and/or create new ones?

Please provide your feedback in the comments of this issue.

Here's a brief overview of the dashboards and current alerts.

Summary Dashboard

This view provides a quick-look at overall validator and system health.

Screen Shot 2020-10-27 at 9 19 24 AM

Screen Shot 2020-10-27 at 9 19 47 AM

Validator Monitoring Dashboard

This view provides a comprehensive look at validator details and performance, expanding on the summary dashboard. It will also includes proposal information, once Oasis implements a Governance module.

Note: The system displays the number of total peers. For those that choose to implement a sentry node configuration, we will implement a metric that shows the peer names as well.

This is useful to confirm a validator is connected to the peers an operator would expect their validator to be connected to. In this scenario, there will also be an alert configured that alerts a user if the number of peers drops below a specified number.

For example, if your validator is connected to two sentries, the system will alert you if the number of peers drops below two.

Screen Shot 2020-10-27 at 9 22 07 AM

Screen Shot 2020-10-27 at 9 22 22 AM

System Monitoring Dashboard

This view provides a comprehensive look at system performance metrics, expanding on the summary dashboard. Here you'll find all the system metrics you'd expect to see in a comprehensive system monitoring tool.

Screen Shot 2020-10-27 at 9 23 59 AM

Screen Shot 2020-10-27 at 9 24 13 AM

Screen Shot 2020-10-27 at 9 24 29 AM

Screen Shot 2020-10-27 at 9 24 49 AM

Screen Shot 2020-10-27 at 9 25 01 AM

Screen Shot 2020-10-27 at 9 25 14 AM

Screen Shot 2020-10-27 at 9 25 30 AM

Screen Shot 2020-10-27 at 9 25 40 AM

Screen Shot 2020-10-27 at 9 25 51 AM

Alerting

So far, these alerts are configured -

  • Alert when the missed blocks count reaches or exceeds missed_blocks_threshold.
    • This is an emergency alert that gets sent to Telegram and email. It's easily integrated with Pager Duty via email.
  • Alert when no.of peers count falls below of num_peers_threshold.
  • Alert when oasis node is not running on the validator.
  • Alert about validator health, i.e. whether it's voting or jailed.
    • This is a sanity check alert, that let's you know your validator is voting (if it is). It can be configured to send at user-specified times during the day.
  • Alert when the voting power of your validator drops below voting_power_threshold.

This image shows some of those alerts in action.

Screen_Shot_2020-09-07_at_2 34 22_PM

@cfl0ws
Copy link
Contributor Author

cfl0ws commented Nov 12, 2020

As no feedback was received in this round of collection, we are happy to do another round of updates after the community has had a chance to use the tool.

@joesixpack
Copy link

FYI, the dashboards don't completely work. Example...

image

@cfl0ws
Copy link
Contributor Author

cfl0ws commented Dec 18, 2020

@joesixpack apologies for the delayed response. Can you please provide additional context?

I'm assuming this is a screenshot of an implementation you attempted? If so, what steps did you follow?

cc: @PrathyushaLakkireddy

@PrathyushaLakkireddy
Copy link
Collaborator

@joesixpack, few metrics will be displayed from prometheus and few are from based on the network url which you have configured.
So to get that prometheus metrics working have to enable these commands of oasis node --metrics.mode pull --metrics.address <listen-address>:3000
And also can you once check configured network url, whether it's working or not?

@joesixpack
Copy link

joesixpack commented Dec 18, 2020

For network URL I'm using "http://157.230.100.229:3000" which is your server.

Oasis config.toml has:

metrics:
mode: pull
address: 0.0.0.0:9999

Port 3000 is not available to use as that is what Grafana uses for its web dashboard.

Is the network url actually supposed to point to my own node's metric address? That is not stated in the docs. That makes some kind of sense and I tried that and port 3001 also, but the dashboard errors (Bad Gateway) didn't resolve.

Regardless, I ran into that edge case bug twice already so since I can't upgrade to 20.12.3 yet (I did accidentally and it worked fine before reverting), I'll have to shut down the mission contol to prevent another crash.

@joesixpack
Copy link

joesixpack commented Dec 18, 2020

There's also what looks like missing and/or wrongly named datasources in some of the dashboards.

@PrathyushaLakkireddy
Copy link
Collaborator

Sorry for that issue @joesixpack. If you have any other network's URL, you can mention that or else you can keep same one which we have provided. I will update the dashboards of grafana to resolve Bad Gateway.

@joesixpack
Copy link

I'm seeing this in the log:

2021/01/05 01:26:58 Error while unmarshelling the validator set data proto: wrong wireType = 0 for field Ed25519

@cfl0ws
Copy link
Contributor Author

cfl0ws commented Jan 6, 2021

@PrathyushaLakkireddy please take a look 👆

@joesixpack note we currently recommend ONLY running Oasis Mission Control with v20.12.3. This is due to a bug in the Oasis code that was fixed in v20.12.3.

See details here.

Note that the chances of the bug crashing the validator when running Mission Control are very low. We ran Chainflow's instance without a problem for a couple months, then the bug nailed us. It's for this reason we're suggesting to stay on the safe side and wait until you're running v20.12.3 on mainnet.

@PrathyushaLakkireddy
Copy link
Collaborator

I'm seeing this in the log:

2021/01/05 01:26:58 Error while unmarshelling the validator set data proto: wrong wireType = 0 for field Ed25519

Fixed.

@joesixpack
Copy link

joesixpack commented Aug 23, 2021

Could you upload the dashboards to Grafana and provide the #'s to import?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants