Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(#60): dashboards with CHT api express metrics #75

Merged
merged 19 commits into from
Sep 21, 2023

Conversation

kennsippell
Copy link
Member

@kennsippell kennsippell commented Jul 18, 2023

#60

Left Column

Screenshot from 2023-07-18 01-51-33
Dashboard starts with an replication apdex score. Waiting on some UX research to set these values, but tentatively I've set the following thresholds.

User State Threshold
Satisfied <90 secs
Tolerating 90 - 180 secs
Frustrated > 180 secs

Interesting constraint is that the threshold we monitor for needs to match one of these hardcoded buckets in CHT 4.3 API. So this does not seem particularly agile or easy to change on the fly or per-partner.

Error rate shows % of get-ids requests resulting in status code 400-599.

Right Column

Screenshot from 2023-07-18 01-51-47

Total number of successful and failing replications.
Rate of replications per second by endpoint (optional. considering removing this)
Replication latency is 50th percentile, 90th percentile, and max replication times

@kennsippell kennsippell changed the title Dashboards with CHT API Express Metrics feat(#60) - Dashboards with CHT API Express Metrics Jul 18, 2023
@kennsippell kennsippell changed the title feat(#60) - Dashboards with CHT API Express Metrics feat(#60): Dashboards with CHT API Express Metrics Jul 18, 2023
@kennsippell kennsippell changed the title feat(#60): Dashboards with CHT API Express Metrics feat(#60): dashboards with CHT api express metrics Jul 18, 2023
@kennsippell
Copy link
Member Author

kennsippell commented Jul 19, 2023

2nd dashboard is a techy one based on the prometheus-api-metrics shared dashboard with all widgets and settings related to apdex removed. I don't know what the apdex thresholds should be for general CHT traffic.

@kennsippell kennsippell marked this pull request as ready for review July 19, 2023 06:50
@jkuester
Copy link
Collaborator

@kennsippell should these dashboards be getting populated with data from the fake-cht? I am trying to test them locally by pulling them in on top of 60-express-metrics, but I am not seeing anything populated in the dashboards:

image
image

Copy link
Collaborator

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rational for adding these as brand new dashboards instead of just including them as rows on the CHT Admin Details dashboard? I ask this mostly as a general question in that I am not sure exactly how we should determine "what belongs on a dashboard". I just know from a users perspective, if I want to find data about how long it is taking the server to respond to requests, it is not immediately clear to me if I would find that data on CHT Admin Details, CHT Replication, or CHT API Dashboard. (c.c. @m5r I would be glad for your feedback on this question as well!)

On one hand it is nice to have all of the panels dependent on a particular scrape config or CHT version all grouped on one dashboard since that makes it easier to document the requirements of a particular dashboard. However, it feels like dependencies (particularly on CHT versions) are bound to evolve over time and things will get more complicated.

My current thinking is that our dashboard design should prioritize the best experience for folks running the latest CHT version (while remaining aware of how changes will impact folks running older versions). Basically, we should put the panels where it makes the most sense to have them (regardless of what CHT version they are dependent on). If folks with older CHT versions see blank panels that is OK (as long as we indicate in the panel description the minimum CHT version needed for the panel).

That being said, for this particular case I have not fully made up my mind on where the best place is for the panels to go. 🤔 Seems like if we keep these panels in their own dashboards (like you have them now), then we should plan to break the CHT Admin Details up into separate dashboards (Couch data, Outbound messaging data, etc). This could be done as needed in the future. Open to other ideas though!

"templating": {
"list": [
{
"definition": "query_result(up{job=~\"cht\"})",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth filtering the instances here based on their version (so we only show instances >= 4.3?

Suggested change
"definition": "query_result(up{job=~\"cht\"})",
"definition": "query_result(cht_version{app!~"^([0-3]\\.)|(4\\.[0-2]\\.).*"})",

@kennsippell
Copy link
Member Author

kennsippell commented Jul 21, 2023

Rationale for adding these into thier own dashboards

When I've worked in monitored environments previously, there were thousands of dashboards. I didn't know what they were all for, but I could come in and learn what I needed to learn. I could get things done. I personally preferred having dashboards which were tailored to specific scenarios/tasks, rather than a few dashboards which tried to do everything. That was for a very complex service though and I'm not sure if it is applicable - but it is my lens.

cht_partnerships_replication I'm hoping will be used by partnerships-level people and not CHT admins. See this.

cht_admin_api_express I suspect this targets a core dev user type and not a CHT admin. See this.

I'd be quite open to including endpoint performance on the existing "CHT Details" dashboards - that feels like an interesting level of detail to me for CHT admins. Maybe?

@m5r
Copy link
Contributor

m5r commented Jul 24, 2023

What is the rational for adding these as brand new dashboards

I personally preferred having dashboards which were tailored to specific scenarios/tasks

@kennsippell's take makes sense. I'm not strongly opinionated on either side of the question.

  • Going all in with a single mega-dashboard will probably result in meh performances because of having many charts displayed at once.
  • Splitting dashboards by persona (i.e. a dashboard tailored for app devs, another one for SREs, another one for CHT admins and so on...) would be nice to give a personalized view of what's happening with a CHT instance but there is bound to be some duplication between dashboards. I don't know if grafana configs can share reusable dashboard "components" but that could be a solution.
  • And finally, splitting dashboards by scenario would categorize dashboards neatly with nearly 0 duplicate dashboard config but a single person might need to open many dashboards side by side to get all the information they need.

We can't go wrong with either the first or the third solution, either way we can split dashboards or regroup them without too much overhead
cc @jkuester

@jkuester
Copy link
Collaborator

Thanks for all the great conversation here @m5r and @kennsippell! Maybe I was just behind the curve here, but I feel like thing are a lot more clear in my head now regarding organizational strategies for these dashboards!

It seems to me that if we try to keep dashboards focused on specific scenarios/tasks (Mokhtar's #3) we would still get pretty much all the benefits from Mokhtar's #2 case of splitting by persona (since a given persona would be focused on one or more scenarios/tasks). So, this seems like a good guiding design principal for us to use!

And, with that being said, given the extra context @kennsippell provided, I think it makes sense to keep these two as separate dashboards!

@mrjones-plip
Copy link
Contributor

mrjones-plip commented Aug 2, 2023

Deferring to the feedback from @jkuester & @m5r, so removing myself as a reviewer. Lemme know if you want me to jump back in!

@mrjones-plip mrjones-plip removed their request for review August 2, 2023 20:23
@kennsippell
Copy link
Member Author

@kennsippell should these dashboards be getting populated with data from the fake-cht?

The API dashboards are expected to work. But the replication dashboards would be empty with fake-cht. Not clear to me how to get replication dashboards working with fake-cht. I could add the endpoint, but what would ping the endpoint to generate the data? I guess it could be prom (?) If you have ideas I can pursue; but I tested with live CHT.

@kennsippell kennsippell requested a review from jkuester August 11, 2023 07:53
@mrjones-plip
Copy link
Contributor

@jkuester - with CHT Core 4.3.0 released which includes the API express metrics, it'd be good to move this PR along so we can release a matching version Watchdog. Put this in yer queue when ya get a sec!

thanks

@mrjones-plip
Copy link
Contributor

Ah - I see @kennsippell is out on holiday until Sept 5th, in case any next steps are blocked until his return!

Copy link
Collaborator

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Super excited to get these dashboards in!

@kennsippell
Copy link
Member Author

Thanks for the review @jkuester!

Good times with fake-cht server now:
image

Copy link
Collaborator

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -2,7 +2,7 @@
"name": "fake-cht",
"version": "1.0.0",
"scripts": {
"start": "node src/index.js"
"start": "node --experimental-fetch src/index.js"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with Node 18? 😆

@kennsippell kennsippell merged commit e57487d into main Sep 21, 2023
@kennsippell kennsippell deleted the 60-express-dashboards branch September 21, 2023 05:58
@medic-ci
Copy link
Collaborator

🎉 This PR is included in version 1.11.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants