Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add solver metrics #435

Merged
merged 13 commits into from
Nov 21, 2024
Merged

feat: Add solver metrics #435

merged 13 commits into from
Nov 21, 2024

Conversation

bgins
Copy link
Contributor

@bgins bgins commented Nov 13, 2024

Summary

This pull request makes the following changes:

  • Add solver metrics options
  • Add metrics config and provider
  • Configure solver with metrics
  • Add meter to solver controller
  • Add solver system metrics
  • Add solver deal state metrics
  • Add solver matcher metrics

This pull request also includes a new solver store method and two minor refactors to support the work above:

  • Add GetDealsAll solver store method
  • Initialize tracer provider with noop provider
  • Lift resource instantiation to SetupOTelSDK

We would like improved metrics observability on the solver to monitor system and process performance, deal matching stats, and deal status stats.

Task/Issue reference

Closes: #434

Test plan

Start the observability server on the bgins/feat-add-solver-dashboards branch. Open localhost:3000 in a browser and select "Dashboards" from the hamburger menu on the left.

Two dashboards have been added:

  • Solver System Metrics (system and process metrics)
  • Solver Metrics (deal state metrics, matcher metrics)

Start the stack. Run the solver with explicit metrics configuration:

export ENABLE_METRICS=true
export METRICS_URL="http://localhost:8500"
export METRICS_TOKEN="eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJhdXRob3JpemVkIjp0cnVlLCJ1c2VyIjoicmVzb3VyY2UtcHJvdmlkZXIifQ.n36M_ngwC4XPQ_pEkkWAnPiOinnx6-0VO1v_WgCTUEERD7b_p9KHCU6SY5bUdFh5UXRZHAhc1gfyc7rjAnmeDQ"
./stack solver

Run a job. Metrics are collected and sent in batches once per minute, so it may take a moment before metrics are reported.

Once reported, the Solver System Metrics dashboard should display standard metrics for system and process performance.

CleanShot 2024-11-13 at 16 21 13@2x

The Solver Metrics dashboard displays deal state metrics. For example, in a local run we can see deal states over a couple jobs:

deal-states

The Solver Metrics dashboard also displays job offer, resource offer, and deal counts.

CleanShot 2024-11-13 at 16 20 17@2x

Details

This pull request adds metrics to the solver. We may want to add the system metrics for the resource provider in the future, but we start with the solver to test the implementation before making it broadly available. It should be easy to add to resource providers in the future, though we may want to revisit some of the configuration decisions made here.

Configuration

This pull request configures metrics with MetricsOptions and MetricsConfig. We have kept configuration separate from TelemetryOptions and TelemetryConfig for a couple of reasons:

  • We only want metrics config for the solver
  • We may send metrics and logs to a local collector like FluentBit, while still sending traces directly to the observability instance

We may want to revisit configuration in the future, but this approach is a first iteration while we consider the bigger picture of where and how metrics will be used.

Metrics must be configured explicitly with environment variables or command line options. My thinking here is that these metrics are mostly interesting in production, and it's easy enough to configure them in our deployments.

System Metrics

This pull request includes an initial set of system metrics including:

  • System uptime
  • System load average (1, 5, 15)
  • System CPU used percent and logical CPU count
  • System memory used percent, used, available, and total
  • System network connections count, bytes transmitted, bytes received
  • System filesystem used percent, used, free, total, inodes used percent
  • Process uptime, memory used percent, CPU used percent, connections count, num threads used

Solver matching metrics

This pull request records the following matcher metrics:

  • Resource offers available to match
  • Job offers available to match
  • Deals made during matching

These metrics are recorded once per solver control loop iteration.

Solver deal state metrics

This pull request records deal state metrics once per solver control loop iteration. This should be considered a form of sampling because deal states are updated outside of control loop iteration.

Eventually, we should implement UpDownCounters for each deal state when we have a better handle on the job lifecycle and its state transitions.

Computing deal states once per control loop iteration may have performance implications. On my local runs with a single job, the traces report ~12ms of time spent computing deal state metrics. Likely to be more in production, we can keep an eye on it.

Related issues or PRs

Epic: https://github.com/Lilypad-Tech/internal/issues/345
Solver metrics dashboards: https://github.com/Lilypad-Tech/observability/pull/22
Future work: #439

@cla-bot cla-bot bot added the cla-signed label Nov 13, 2024
@bgins bgins force-pushed the bgins/feat-solver-otel-metrics branch 2 times, most recently from 51e2ae7 to 80a08ad Compare November 14, 2024 18:41
@bgins bgins marked this pull request as ready for review November 14, 2024 19:04
@bgins bgins requested a review from a team as a code owner November 14, 2024 19:04
@bgins bgins self-assigned this Nov 14, 2024
Comment on lines 86 to 88
if options.Enable {
if len(options.URL) == 0 {
return fmt.Errorf("No metrics endpoint specified - please use METRICS_URL or --metrics-url")
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Is there anything else we might want to do if options.Enable is true? If not maybe we can combine the two if statements together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good call! Simplified this check. 👌

return err
}
reportDealMetrics(ctx, controller.meter, storedDeals)
span.AddEvent("report_deal_metrics.done")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question here but in the case of the err block e=being entered on line 309, would otel be smart enough to still add the report_deal_metrics.done span or is there another event upstream that will highlight the error state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh good catch! It would not record the report_deal_metrics.done span event. Also realized I wasn't handling the reportDealMetrics error.

Made some updates and now it looks like:

span.AddEvent("report_deal_metrics.start")
storedDeals, err := controller.store.GetDealsAll()
if err != nil {
span.SetStatus(codes.Error, "get all deals failed")
span.RecordError(err)
return err
}
err = reportDealMetrics(ctx, controller.meter, storedDeals)
if err != nil {
span.SetStatus(codes.Error, "report deal metrics failed")
span.RecordError(err)
}
span.AddEvent("report_deal_metrics.done")

Note that we don't return an error when reportDealMetrics fails, we just record it in the trace. It's also logged in the reportDealMetrics function body.

Copy link
Collaborator

@walkah walkah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested locally and ... got metrics! 🙌 🚀

Screenshot 2024-11-18 at 3 32 31 PM Screenshot 2024-11-18 at 3 32 22 PM

@bgins bgins force-pushed the bgins/feat-solver-otel-metrics branch 3 times, most recently from 360f3d2 to ee55054 Compare November 19, 2024 21:18
@bgins bgins force-pushed the bgins/feat-solver-otel-metrics branch from ee55054 to e36bd3a Compare November 21, 2024 21:16
@bgins bgins merged commit bff3def into main Nov 21, 2024
3 checks passed
@bgins bgins deleted the bgins/feat-solver-otel-metrics branch November 21, 2024 22:14
walkah pushed a commit that referenced this pull request Dec 10, 2024
* refactor: Initialize tracer provider with noop provider

* refactor: Lift resource instantiation to SetupOTelSDK

We will also want to use this resource for meter and log providers.

* feat: Add solver metrics options

* feat: Add metrics config and provider

* chore: Configure services with metrics (or not)

We only configure metrics on the solver for now.

* feat: Add meter to solver controller

* feat: Add solver matcher metrics

* feat: Add solver system metrics

* feat: Add GetDealsAll store method

* feat: Add solver deal state metrics

* fix: Acquire lock in GetDealsAll

* refactor: Simplify metrics options check

* chore: Add report deal metrics trace errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add solver system and deal making metrics
3 participants