-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add solver metrics #435
Conversation
51e2ae7
to
80a08ad
Compare
pkg/options/otel.go
Outdated
if options.Enable { | ||
if len(options.URL) == 0 { | ||
return fmt.Errorf("No metrics endpoint specified - please use METRICS_URL or --metrics-url") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Is there anything else we might want to do if options.Enable
is true? If not maybe we can combine the two if statements together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good call! Simplified this check. 👌
return err | ||
} | ||
reportDealMetrics(ctx, controller.meter, storedDeals) | ||
span.AddEvent("report_deal_metrics.done") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a question here but in the case of the err block e=being entered on line 309, would otel be smart enough to still add the report_deal_metrics.done
span or is there another event upstream that will highlight the error state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh good catch! It would not record the report_deal_metrics.done
span event. Also realized I wasn't handling the reportDealMetrics
error.
Made some updates and now it looks like:
lilypad/pkg/solver/controller.go
Lines 307 to 319 in 5e522c5
span.AddEvent("report_deal_metrics.start") | |
storedDeals, err := controller.store.GetDealsAll() | |
if err != nil { | |
span.SetStatus(codes.Error, "get all deals failed") | |
span.RecordError(err) | |
return err | |
} | |
err = reportDealMetrics(ctx, controller.meter, storedDeals) | |
if err != nil { | |
span.SetStatus(codes.Error, "report deal metrics failed") | |
span.RecordError(err) | |
} | |
span.AddEvent("report_deal_metrics.done") |
Note that we don't return an error when reportDealMetrics
fails, we just record it in the trace. It's also logged in the reportDealMetrics
function body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
360f3d2
to
ee55054
Compare
We will also want to use this resource for meter and log providers.
We only configure metrics on the solver for now.
ee55054
to
e36bd3a
Compare
* refactor: Initialize tracer provider with noop provider * refactor: Lift resource instantiation to SetupOTelSDK We will also want to use this resource for meter and log providers. * feat: Add solver metrics options * feat: Add metrics config and provider * chore: Configure services with metrics (or not) We only configure metrics on the solver for now. * feat: Add meter to solver controller * feat: Add solver matcher metrics * feat: Add solver system metrics * feat: Add GetDealsAll store method * feat: Add solver deal state metrics * fix: Acquire lock in GetDealsAll * refactor: Simplify metrics options check * chore: Add report deal metrics trace errors
Summary
This pull request makes the following changes:
This pull request also includes a new solver store method and two minor refactors to support the work above:
GetDealsAll
solver store methodWe would like improved metrics observability on the solver to monitor system and process performance, deal matching stats, and deal status stats.
Task/Issue reference
Closes: #434
Test plan
Start the observability server on the
bgins/feat-add-solver-dashboards
branch. Openlocalhost:3000
in a browser and select "Dashboards" from the hamburger menu on the left.Two dashboards have been added:
Start the stack. Run the solver with explicit metrics configuration:
Run a job. Metrics are collected and sent in batches once per minute, so it may take a moment before metrics are reported.
Once reported, the Solver System Metrics dashboard should display standard metrics for system and process performance.
The Solver Metrics dashboard displays deal state metrics. For example, in a local run we can see deal states over a couple jobs:
The Solver Metrics dashboard also displays job offer, resource offer, and deal counts.
Details
This pull request adds metrics to the solver. We may want to add the system metrics for the resource provider in the future, but we start with the solver to test the implementation before making it broadly available. It should be easy to add to resource providers in the future, though we may want to revisit some of the configuration decisions made here.
Configuration
This pull request configures metrics with
MetricsOptions
andMetricsConfig
. We have kept configuration separate fromTelemetryOptions
andTelemetryConfig
for a couple of reasons:We may want to revisit configuration in the future, but this approach is a first iteration while we consider the bigger picture of where and how metrics will be used.
Metrics must be configured explicitly with environment variables or command line options. My thinking here is that these metrics are mostly interesting in production, and it's easy enough to configure them in our deployments.
System Metrics
This pull request includes an initial set of system metrics including:
Solver matching metrics
This pull request records the following matcher metrics:
These metrics are recorded once per solver control loop iteration.
Solver deal state metrics
This pull request records deal state metrics once per solver control loop iteration. This should be considered a form of sampling because deal states are updated outside of control loop iteration.
Eventually, we should implement UpDownCounters for each deal state when we have a better handle on the job lifecycle and its state transitions.
Computing deal states once per control loop iteration may have performance implications. On my local runs with a single job, the traces report ~12ms of time spent computing deal state metrics. Likely to be more in production, we can keep an eye on it.
Related issues or PRs
Epic: https://github.com/Lilypad-Tech/internal/issues/345
Solver metrics dashboards: https://github.com/Lilypad-Tech/observability/pull/22
Future work: #439