-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix potential panic due to race condition and add slow consumer handling in TriggerSubscriber #15772
base: develop
Are you sure you want to change the base?
Conversation
I see you updated files related to
|
Quality Gate passedIssues Measures |
AER Report: CI Coreaer_workflow , commit , Detect Changes , Scheduled Run Frequency , Clean Go Tidy & Generate , Flakeguard Root Project / Get Tests To Run , Core Tests (go_core_tests) , GolangCI Lint , Core Tests (go_core_tests_integration) , Core Tests (go_core_ccip_deployment_tests) , Core Tests (go_core_fuzz) , Core Tests (go_core_race_tests) , test-scripts , Flakeguard Deployment Project , Flakeguard Root Project / Run Tests (github.com/smartcontractkit/chainlink/v2/core/capabilities/remote, ubuntu-latest) , lint , Flakeguard Root Project / Report , SonarQube Scan , Flakey Test Detection 1. GolangCI Lint errors: Golang LintSource of Error:
Why: The first error indicates that the GolangCI Lint tool cannot find the main module or its dependencies in the specified directory. The second error is a permission issue when trying to create the output file Suggested fix: Ensure that the working directory is correctly set to the location of the main module and its dependencies. Additionally, verify that the directory has the necessary write permissions to create the output file. |
// Registrations will quickly expire on all remote nodes. | ||
// Alternatively, we could send UnregisterTrigger messages right away. | ||
return nil | ||
} | ||
|
||
func (s *triggerSubscriber) closeSubscription(workflowID string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this must be called under s.mu lock.
case registration.callback <- response: | ||
default: | ||
s.lggr.Warn("slow consumer detected, closing subscription", "capabilityId", s.capInfo.ID, "workflowId", workflowID) | ||
s.closeSubscription(workflowID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should close it. We're not giving the engine a chance to recover. I admit that with a channel size of 1000 it's pretty unlikely for it to recover. But still, maybe some large buffering happens in the network somewhere and we'll get a flood of past requests. I'd rather not get into a state that is not recoverable, apart from a node restart. How about we just log an error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean log an error and drop the response? Thinking purely in terms of current use cases, logging and dropping the response is the better solution, so that may be the way to go. My concern was that you may see some quite hard to diagnose unexpected behaviours further downstream if you intermittently drop responses in different non-price feed contexts, and in this scenario it might be cleaner to break the subscription. Ideally, if it was aware of the type of data it could conflate in the case of price feeds, but at this level the code is data type agnostic.
We should probably stat the % filled of the buffer to metrics and monitor, it would be a good indicator that the system is running too hot/there's a bottle neck in a given workflow, I'll add that in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely let's add a metric for % filled and a loud error when dropping.
I don't think it makes sense to end up in a state that requires a node restart. That's why I don't want to cancel a sub. Firing on a subset of trigger events doesn't seem worse than not firing at all. Especially in cases where this problem affects a single node, not all of them at once.
resolves-> https://smartcontract-it.atlassian.net/browse/CAPPL-415