-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(#60): dashboards with CHT api express metrics #75
Conversation
2nd dashboard is a techy one based on the prometheus-api-metrics shared dashboard with all widgets and settings related to apdex removed. I don't know what the apdex thresholds should be for general CHT traffic. |
@kennsippell should these dashboards be getting populated with data from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rational for adding these as brand new dashboards instead of just including them as rows on the CHT Admin Details
dashboard? I ask this mostly as a general question in that I am not sure exactly how we should determine "what belongs on a dashboard". I just know from a users perspective, if I want to find data about how long it is taking the server to respond to requests, it is not immediately clear to me if I would find that data on CHT Admin Details, CHT Replication, or CHT API Dashboard. (c.c. @m5r I would be glad for your feedback on this question as well!)
On one hand it is nice to have all of the panels dependent on a particular scrape config or CHT version all grouped on one dashboard since that makes it easier to document the requirements of a particular dashboard. However, it feels like dependencies (particularly on CHT versions) are bound to evolve over time and things will get more complicated.
My current thinking is that our dashboard design should prioritize the best experience for folks running the latest CHT version (while remaining aware of how changes will impact folks running older versions). Basically, we should put the panels where it makes the most sense to have them (regardless of what CHT version they are dependent on). If folks with older CHT versions see blank panels that is OK (as long as we indicate in the panel description the minimum CHT version needed for the panel).
That being said, for this particular case I have not fully made up my mind on where the best place is for the panels to go. 🤔 Seems like if we keep these panels in their own dashboards (like you have them now), then we should plan to break the CHT Admin Details up into separate dashboards (Couch data, Outbound messaging data, etc). This could be done as needed in the future. Open to other ideas though!
"templating": { | ||
"list": [ | ||
{ | ||
"definition": "query_result(up{job=~\"cht\"})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is worth filtering the instances here based on their version (so we only show instances >= 4.3
?
"definition": "query_result(up{job=~\"cht\"})", | |
"definition": "query_result(cht_version{app!~"^([0-3]\\.)|(4\\.[0-2]\\.).*"})", | |
Rationale for adding these into thier own dashboards When I've worked in monitored environments previously, there were thousands of dashboards. I didn't know what they were all for, but I could come in and learn what I needed to learn. I could get things done. I personally preferred having dashboards which were tailored to specific scenarios/tasks, rather than a few dashboards which tried to do everything. That was for a very complex service though and I'm not sure if it is applicable - but it is my lens.
I'd be quite open to including endpoint performance on the existing "CHT Details" dashboards - that feels like an interesting level of detail to me for CHT admins. Maybe? |
@kennsippell's take makes sense. I'm not strongly opinionated on either side of the question.
We can't go wrong with either the first or the third solution, either way we can split dashboards or regroup them without too much overhead |
Thanks for all the great conversation here @m5r and @kennsippell! Maybe I was just behind the curve here, but I feel like thing are a lot more clear in my head now regarding organizational strategies for these dashboards! It seems to me that if we try to keep dashboards focused on And, with that being said, given the extra context @kennsippell provided, I think it makes sense to keep these two as separate dashboards! |
The API dashboards are expected to work. But the replication dashboards would be empty with fake-cht. Not clear to me how to get replication dashboards working with fake-cht. I could add the endpoint, but what would ping the endpoint to generate the data? I guess it could be prom (?) If you have ideas I can pursue; but I tested with live CHT. |
@jkuester - with CHT Core 4.3.0 released which includes the API express metrics, it'd be good to move this PR along so we can release a matching version Watchdog. Put this in yer queue when ya get a sec! thanks |
Ah - I see @kennsippell is out on holiday until Sept 5th, in case any next steps are blocked until his return! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Super excited to get these dashboards in!
grafana/provisioning/dashboards/CHT/cht_coredev_api_express.json
Outdated
Show resolved
Hide resolved
grafana/provisioning/dashboards/CHT/cht_coredev_api_express.json
Outdated
Show resolved
Hide resolved
grafana/provisioning/dashboards/CHT/cht_partnerships_replication.json
Outdated
Show resolved
Hide resolved
grafana/provisioning/dashboards/CHT/cht_partnerships_replication.json
Outdated
Show resolved
Hide resolved
grafana/provisioning/dashboards/CHT/cht_partnerships_replication.json
Outdated
Show resolved
Hide resolved
grafana/provisioning/dashboards/CHT/cht_partnerships_replication.json
Outdated
Show resolved
Hide resolved
grafana/provisioning/dashboards/CHT/cht_partnerships_replication.json
Outdated
Show resolved
Hide resolved
grafana/provisioning/dashboards/CHT/cht_partnerships_replication.json
Outdated
Show resolved
Hide resolved
Co-authored-by: Joshua Kuestersteffen <[email protected]>
…on.json Co-authored-by: Joshua Kuestersteffen <[email protected]>
…on.json Co-authored-by: Joshua Kuestersteffen <[email protected]>
…watchdog into 60-express-dashboards
Thanks for the review @jkuester! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
development/fake-cht/package.json
Outdated
@@ -2,7 +2,7 @@ | |||
"name": "fake-cht", | |||
"version": "1.0.0", | |||
"scripts": { | |||
"start": "node src/index.js" | |||
"start": "node --experimental-fetch src/index.js" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with Node 18? 😆
🎉 This PR is included in version 1.11.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
#60
Left Column
Dashboard starts with an replication apdex score. Waiting on some UX research to set these values, but tentatively I've set the following thresholds.
Interesting constraint is that the threshold we monitor for needs to match one of these hardcoded buckets in CHT 4.3 API. So this does not seem particularly agile or easy to change on the fly or per-partner.
Error rate shows % of
get-ids
requests resulting in status code 400-599.Right Column
Total number of successful and failing replications.
Rate of replications per second by endpoint (optional. considering removing this)
Replication latency is 50th percentile, 90th percentile, and max replication times