fix potential panic due to race condition and add slow consumer handling in TriggerSubscriber #15772

ettec · 2024-12-19T14:34:03Z

resolves-> https://smartcontract-it.atlassian.net/browse/CAPPL-415

…ing logic

github-actions · 2024-12-19T14:35:27Z

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

#added For any new functionality added.
#breaking_change For any functionality that requires manual action for the node to boot.
#bugfix For bug fixes.
#changed For any change to the existing functionality.
#db_update For any feature that introduces updates to database schema.
#deprecation_notice For any upcoming deprecation functionality.
#internal For changesets that need to be excluded from the final changelog.
#nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#removed For any functionality/config that is removed.
#updated For any functionality that is updated.
#wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

🎖️ No JIRA issue number found in: PR title, commit message, or branch name. Please include the issue ID in one of these.

cl-sonarqube-production · 2024-12-19T14:43:34Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

github-actions · 2024-12-19T14:44:05Z

AER Report: CI Core

aer_workflow , commit , Detect Changes , Scheduled Run Frequency , Clean Go Tidy & Generate , Flakeguard Root Project / Get Tests To Run , Core Tests (go_core_tests) , GolangCI Lint , Core Tests (go_core_tests_integration) , Core Tests (go_core_ccip_deployment_tests) , Core Tests (go_core_fuzz) , Core Tests (go_core_race_tests) , test-scripts , Flakeguard Deployment Project , Flakeguard Root Project / Run Tests (github.com/smartcontractkit/chainlink/v2/core/capabilities/remote, ubuntu-latest) , lint , Flakeguard Root Project / Report , SonarQube Scan , Flakey Test Detection

1. GolangCI Lint errors: Golang Lint

Source of Error:

Golang Lint ()	2024-12-19T14:35:45.7078003Z level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix / does not contain main module or its selected dependencies"
Golang Lint ()	2024-12-19T14:35:45.7079410Z level=error msg="Running error: can't create output for golangci-lint-report.xml: open golangci-lint-report.xml: permission denied"
Golang Lint ()	2024-12-19T14:35:45.7105823Z ##[error]golangci-lint exit with code 3

Why: The first error indicates that the GolangCI Lint tool cannot find the main module or its dependencies in the specified directory. The second error is a permission issue when trying to create the output file golangci-lint-report.xml.

Suggested fix: Ensure that the working directory is correctly set to the location of the main module and its dependencies. Additionally, verify that the directory has the necessary write permissions to create the output file.

bolekk · 2024-12-19T22:43:45Z

core/capabilities/remote/trigger_subscriber.go

 	// Registrations will quickly expire on all remote nodes.
 	// Alternatively, we could send UnregisterTrigger messages right away.
 	return nil
 }

+func (s *triggerSubscriber) closeSubscription(workflowID string) {


Add a comment that this must be called under s.mu lock.

bolekk · 2024-12-19T22:50:53Z

core/capabilities/remote/trigger_subscriber.go

+		case registration.callback <- response:
+		default:
+			s.lggr.Warn("slow consumer detected, closing subscription", "capabilityId", s.capInfo.ID, "workflowId", workflowID)
+			s.closeSubscription(workflowID)


I don't think we should close it. We're not giving the engine a chance to recover. I admit that with a channel size of 1000 it's pretty unlikely for it to recover. But still, maybe some large buffering happens in the network somewhere and we'll get a flood of past requests. I'd rather not get into a state that is not recoverable, apart from a node restart. How about we just log an error here?

you mean log an error and drop the response? Thinking purely in terms of current use cases, logging and dropping the response is the better solution, so that may be the way to go. My concern was that you may see some quite hard to diagnose unexpected behaviours further downstream if you intermittently drop responses in different non-price feed contexts, and in this scenario it might be cleaner to break the subscription. Ideally, if it was aware of the type of data it could conflate in the case of price feeds, but at this level the code is data type agnostic.

We should probably stat the % filled of the buffer to metrics and monitor, it would be a good indicator that the system is running too hot/there's a bottle neck in a given workflow, I'll add that in.

Yes definitely let's add a metric for % filled and a loud error when dropping.

I don't think it makes sense to end up in a state that requires a node restart. That's why I don't want to cancel a sub. Firing on a subset of trigger events doesn't seem worse than not firing at all. Especially in cases where this problem affects a single node, not all of them at once.

fix potential panic due to race condition and add slow consumer handl…

e2ed35c

…ing logic

ettec marked this pull request as ready for review December 19, 2024 14:34

ettec requested review from a team as code owners December 19, 2024 14:34

bolekk requested changes Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix potential panic due to race condition and add slow consumer handling in TriggerSubscriber #15772

fix potential panic due to race condition and add slow consumer handling in TriggerSubscriber #15772

ettec commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

cl-sonarqube-production bot commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

bolekk Dec 19, 2024

bolekk Dec 19, 2024

ettec Dec 20, 2024 •

edited

Loading

bolekk Dec 20, 2024

fix potential panic due to race condition and add slow consumer handling in TriggerSubscriber #15772

Are you sure you want to change the base?

fix potential panic due to race condition and add slow consumer handling in TriggerSubscriber #15772

Conversation

ettec commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

cl-sonarqube-production bot commented Dec 19, 2024

Quality Gate passed

github-actions bot commented Dec 19, 2024

AER Report: CI Core

1. GolangCI Lint errors: Golang Lint

bolekk Dec 19, 2024

Choose a reason for hiding this comment

bolekk Dec 19, 2024

Choose a reason for hiding this comment

ettec Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

bolekk Dec 20, 2024

Choose a reason for hiding this comment

ettec Dec 20, 2024 •

edited

Loading