Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add beholder metrics on workflow engine #15238

Merged
merged 14 commits into from
Nov 27, 2024

Conversation

vyzaldysanchez
Copy link
Contributor

@vyzaldysanchez vyzaldysanchez commented Nov 14, 2024

Summary

Adds further instrumentation to the engine. Adding labels. Putting workflows running gauge in the heartbeat to remove misleading label association.

Fixes known potential data race observed here: https://github.com/smartcontractkit/chainlink/actions/runs/11983046999/job/33412020087?pr=15238

The histograms are eventually meant to replace the workflows running + workflow execution latency gauges, but require the follow-ups below before the old ones are removed.

Follow-ups:

KS-580, KS-581

Copy link
Contributor

github-actions bot commented Nov 23, 2024

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

patrickhuie19
patrickhuie19 previously approved these changes Nov 26, 2024
core/services/workflows/engine.go Outdated Show resolved Hide resolved
@@ -1263,7 +1291,7 @@ func NewEngine(ctx context.Context, cfg Config) (engine *Engine, err error) {
engine = &Engine{
cma: cma,
logger: cfg.Lggr.Named("WorkflowEngine").With("workflowID", cfg.WorkflowID),
metrics: workflowsMetricLabeler{metrics.NewLabeler().With(platform.KeyWorkflowID, cfg.WorkflowID, platform.KeyWorkflowOwner, cfg.WorkflowOwner, platform.KeyWorkflowName, cfg.WorkflowName)},
metrics: workflowsMetricLabeler{metrics.NewLabeler().With(platform.KeyWorkflowID, cfg.WorkflowID, platform.KeyWorkflowOwner, cfg.WorkflowOwner, platform.KeyWorkflowName, cfg.WorkflowName), *em},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an extra nil check for extra safety?

core/services/workflows/engine.go Show resolved Hide resolved
@patrickhuie19 patrickhuie19 requested review from bolekk and a team November 27, 2024 20:44
@vyzaldysanchez vyzaldysanchez added this pull request to the merge queue Nov 27, 2024
github-merge-queue bot pushed a commit that referenced this pull request Nov 27, 2024
* Adds metrics on workflow engine

* Adds trigger event metric

* Removes comment

* metrics: execution duration histograms by status and removing now redundant instrumentation

* adding step execution time histogram

* fixing data race for global instruments

* cleanup + fixing tests

* renaming vars somehow fixes broken test

* removing short circuit in workferForStepRequest if Vertex call fails

* nil guard if Vertex errs

* updating workflow.name to workflow.hexName and fixing err log

---------

Co-authored-by: patrickhuie19 <[email protected]>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 27, 2024
@patrickhuie19 patrickhuie19 added this pull request to the merge queue Nov 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 27, 2024
@patrickhuie19 patrickhuie19 added this pull request to the merge queue Nov 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 27, 2024
@bolekk bolekk added this pull request to the merge queue Nov 27, 2024
Merged via the queue into develop with commit 539674a Nov 27, 2024
146 of 147 checks passed
@bolekk bolekk deleted the task/CAPPL-283/engine-beholder-metrics branch November 27, 2024 23:32
bolekk pushed a commit that referenced this pull request Dec 11, 2024
* Adds metrics on workflow engine

* Adds trigger event metric

* Removes comment

* metrics: execution duration histograms by status and removing now redundant instrumentation

* adding step execution time histogram

* fixing data race for global instruments

* cleanup + fixing tests

* renaming vars somehow fixes broken test

* removing short circuit in workferForStepRequest if Vertex call fails

* nil guard if Vertex errs

* updating workflow.name to workflow.hexName and fixing err log

---------

Co-authored-by: patrickhuie19 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants