API: Add request metrics for disaster recovery #13825

hamistao · 2024-07-26T05:34:38Z

This introduces API rates metrics for Disaster Recovery, the spec related to this should be published once this is merged.

Here is a raw output example of /1.0/metrics with the new metrics:
https://pastebin.canonical.com/p/wvhcnK7tq6/

github-actions · 2024-07-26T05:34:52Z

Heads up @mionaalex - the "Documentation" label was applied to this issue.

hamistao · 2024-07-26T05:38:31Z

@tomponline Please take a look at this when you can, I would appreciate some comments on the overall structure of the solution so far.

tomponline · 2024-07-26T07:29:51Z

This is uncomplete, to be done: 1.Handle metric value updates on asynchronous operations 2.Fine tune entity_type assignment 3.Fix GH tests; 4.Manual tests 5.Docs

You can convert this to a checklist in GH so we can see progress as you tick them off.

lxd/api.go

lxd/metrics/api_rates.go

lxd/metrics/types.go

test/suites/metrics.sh

shared/entity/type.go

lxd/metrics/api_rates.go

lxd/metrics/types.go

lxd/operations/operations.go

lxd/api.go

hamistao · 2024-07-30T21:09:41Z

@mseralessandri @tomponline This is ready for a full review.
The categorization of the endpoints on entity types looks like this, those marked with * are endpoints I categorized on my own as they were not spefically mentioned during the discussions, if you think I should leave those separated or on other category let me know and I will make the changes:

TypeInstance for /{version}/containers, /{version}/virtual-machines and /{version}/instances endpoints.
TypeNetwork for /{version}/network-zones, /{version}/network-allocations, /{version}/network-acls and /{version}/networks endpoints.
TypeStoragePool for /{version}/storage-pools and /{version}/storage-volumes endpoints.
TypeIdentity for /{version}/auth and /{version}/certificates* endpoints.
TypeImage for /{version}/images endpoints.
TypeNode for /{version}/cluster endpoints.
TypeProject for /{version}/projects endpoints.
TypeProfile for /{version}/profiles endpoints.
TypeWarning for /{version}/warnings endpoints.
TypeOperation for /{version}/operations endpoints.
TypeServer for /{version}/events, /{version}/metrics, /, /1.0/, /1.0/events, /1.0/internal and /{version}/resources endpoints.*

@tomponline There is one caveat that should be mentioned. I am using the 400 status on operations to derive if the request result is a server error. But perhaps the 400 status is too broad and may also include some types of client errors (e.g. trying to add a block device to a container). I am not sure if I should handle that differently, maybe perform a more intricate analysis of the operation instead of just checking the status code.

Signed-off-by: hamistao <[email protected]>

This is useful to mark the request that spawned that operation as completed when the operation is done. Signed-off-by: hamistao <[email protected]>

Uses the callback function when the operation finishes to mark the request that sapwned the operation as completed. Signed-off-by: hamistao <[email protected]>

Signed-off-by: hamistao <[email protected]>

As a consequence of the introduction of the parameter on Render. Those fields have become obsolete and should be substituted and removed. Signed-off-by: hamistao <[email protected]>

Signed-off-by: hamistao <[email protected]>

This ensures that: 1. Internal metrics are not cached and are always updated. Before this change, the new values were computed but only the older cached values were included in the endpoint output. 2. Internal metrics are included when there are no instances on the default project. That happened because if no instances were present, the metric set for the default project would not be initialized and thus the internal metrics wouldn't have a set to be included in. This is included on this PR so that the tests won't fail due to the metrics' values not being updated quick enough. Signed-off-by: hamistao <[email protected]>

Signed-off-by: hamistao <[email protected]>

hamistao · 2024-09-06T07:02:27Z

@tomponline The requested changes were made.

tomponline · 2024-09-06T07:12:34Z

doc/reference/provided_metrics.md

+
+The API rates metrics include `lxd_api_requests_completed_total` and `lxd_api_requests_ongoing`. These metrics can be consumed by an observability tool deployed externally (for example, the [Canonical Observability Stack](https://charmhub.io/topics/canonical-observability-stack) or another third-party tool) to help identify failures or overload on a LXD server. You can set thresholds on the observability tools for these metrics' values to trigger alarms and take programmatic actions.
+
+These metrics consider all endpoints in the [LXD REST API](../api), with the exception of the `/` endpoint. Requests using an invalid URL are also counted. Requests against the metrics server are also counted. Both introduced metrics include a label `entity_type` based on the main entity type that the endpoint is operating on.


Requests against the metrics server are also counted.

What entity type is used for this?

Since the type is determined by the endpoint and it just handles the /1.0 and the metrics endpoints, it would betypeServer. Same as the regular rest server

tomponline

Thanks!

changes addressed

Starting from canonical/lxd#13825 when using `Render()` we also have to pass the request besides the response writer.

github-actions bot added Documentation Documentation needs updating API Changes to the REST API labels Jul 26, 2024