agent: Add per-VM metric for desired CU(s) #1108
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit adds a new per-VM metric:
autoscaling_vm_desired_cu
.It's based on the same "desired CU" information exposed by the scaling event reporting, but updated continuously instead of being rate limited to avoid spamming our reporting.
The metric has the same base labels as the other per-VM metrics, with the addition of the "reason" label, which is one of:
total
- the goal CU, after taking the maximum of the individual parts and rounding up to the next unit.cpu
- goal CU size in order to fit the current CPU usagemem
- goal CU size in order to fit the current memory usage (including some information derived from LFC, to make sure there's room for cache too)lfc
- goal CU size in order to fit the estimated working set sizeAll of these values are also multiplied by the same Compute Unit factor as with the normal scaling event reporting, so that Neon's fractional compute units are exposed as such in the metrics, even as we use integer compute units in the autoscaler-agent.
Also note that all values except "total" are NOT rounded, and instead show the fractional amounts to allow better comparison.
KNOWN LIMITATION: If
ReportDesiredScaling
is disabled at runtime for a particular VM, the metrics will not be cleared, and instead will just cease to be updated. I figured this is a reasonable trade-off for simplicity.Notes for review: Tested this locally with the following patch to vm-deploy.yaml:
AFAICT it works as intended, but metrics are sometimes tricky. I plan to test it on staging before merging.
Also note: This PR builds on #1107 and must not be merged before it.