stats: export vttablet_tablet_type to prometheus #14303
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR is very similar in spirit to #12772.
We already have a
TabletType
string that is exported to Golang expvars. E.g. after./examples/local/101_initial_cluster.sh
:However, this is not exported to Prometheus. It's a pretty valuable thing to know for monitoring purposes, e.g. to detect and alert on conditions like "primary tablet is not serving".
This PR exposes the
vttablet_tablet_type
via/metrics
, without changing the shape of the data currently exported via/debug/vars
.E.g., after
./examples/local/101_initial_cluster.sh
.Notes
tablet_type
labelA note about the label name,
tablet_type
. There is another metricvttablet_tablet_type_count
that has the labeltype
, but there is another set of more recent metrics by @rafer #13521 that have the labeltablet_type
. I had to pick one of these, so I went with the more recent one, which I happen to prefer, personally.The "right way"
The "right way" to export Prometheus metrics is to define the time series at process start up, and then only change their value.
That means that this metric in its current form is using Prometheus the "wrong way" - a
ChangeTabletType
or reparent will change the value of thetablet_type
label, which from Prometheus' point of view means that one time series has disappeared, and a new time series has suddenly appeared.The current way should be somewhat usable, but will prevent Prometheus users from using the metric to its fullest.
Perhaps a better approach would be to define a new metric like
vttablet_tablet_types
, which is a map of tablet types to either 0 or 1. These time series would be registered at VTTablet startup, and then change duringChangeTabletType
and reparents. I opted not to do this because it seemed like a bigger change, and would also emit a lot more metrics. But happy to change it if people prefer.Related Issue(s)
Fixes #14300
Checklist
Deployment Notes