Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Latency metrics in ClusterManager [Appliers, Listeners, Reroute..] #12332

Open
gargharsh3134 opened this issue Feb 15, 2024 · 1 comment
Assignees
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request Telemetry:Metrics PRs or issues specific to telemetry metrics framework

Comments

@gargharsh3134
Copy link
Contributor

gargharsh3134 commented Feb 15, 2024

Is your feature request related to a problem? Please describe

Given the introduction of Request Tracing Framework (RTF) using OpenTelemetry (OTel), metrics (histogram/counter) can now be published and used to track high latency operations. This issue tracks the instrumentation for introducing latency metrics in ClusterManager which can help identify scaling bottlenecks.

The following metrics can be added to start with:

  1. Committing any change in ClusterState involves running Appliers and Listeners, which are supposed to be very light weight operations. Tracking latency metrics for such operations will help in identifying potential bottlenecks which can slow down the ability of ClusterManager to process the pending tasks queue.

  2. Metric to track latency of reroute operation.

  3. Latency while computing new cluster state upon any change and time taken to successfully publish that state to other nodes.

Describe the solution you'd like

OTel Histogram Metrics: Support for Histogram type metrics, which was added as part of #12062, can be utilised to publish the metrics for each use case.

Related component

Cluster Manager

Describe alternatives you've considered

No response

Additional context

No response

@gargharsh3134 gargharsh3134 added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 15, 2024
@peternied peternied added the Telemetry:Metrics PRs or issues specific to telemetry metrics framework label Feb 21, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6]
@gargharsh3134 Thanks for filing, looking forward to this improvement

@gargharsh3134 gargharsh3134 changed the title [Feature Request] Latency metrics in ClusterApplierService [Feature Request] Latency metrics in ClusterManager [Appliers, Listeners, Reroute..] Mar 15, 2024
@rwali-aws rwali-aws moved this from 🆕 New to Now(This Quarter) in Cluster Manager Project Board Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request Telemetry:Metrics PRs or issues specific to telemetry metrics framework
Projects
Status: Now(This Quarter)
Development

No branches or pull requests

2 participants