-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make circuit breaker more modular #774
Conversation
Refactors CircuitBreaker to avoid singleton pattern and use injection. This will make testing easier as the package scales. Additionally, group circuit breaker related functionality into single service. Will allow components one centralized place to interact with the circuit breaker. Along with this, it removes some of the utility functions from the KNNSettings class to avoid confusion. Lastly, rename the circuit breaker from KNNCircuitBreaker to NativeMemoryCircuit breaker to be more descriptive. Related to this, move the NativeMemoryCircuitBreaker into the memory package. Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #774 +/- ##
============================================
+ Coverage 85.14% 85.50% +0.35%
- Complexity 1079 1089 +10
============================================
Files 151 152 +1
Lines 4370 4401 +31
Branches 390 391 +1
============================================
+ Hits 3721 3763 +42
+ Misses 472 461 -11
Partials 177 177
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
qa/rolling-upgrade/src/test/java/org/opensearch/knn/bwc/StatsIT.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/memory/breaker/NativeMemoryCircuitBreakerService.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/memory/breaker/NativeMemoryCircuitBreakerService.java
Outdated
Show resolved
Hide resolved
* circuit breaking logic is enabled, it will trip the circuit. Elsewhere in the code, the circuit breaker's value can | ||
* be queried to prevent actions that should not happen during high memory pressure. | ||
*/ | ||
public class NativeMemoryCircuitBreakerService extends AbstractLifecycleComponent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the benefit of making this as a LifeCycleComponent? As per some high level look at other LifeCycleComponents I can see that they maintain some components lifecycle and generally they have listeners when the state gets changed like start, stop etc. What is the exact motivation of using this interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I wanted the ability to stop the Scheduler.Cancellable task on close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say that you want to stop the task on close, this is generally done when have try with resource. The close function gets called automatically. It's more like calling close function in final block.
So, what you can do is there is another interface in java called as Closeable Interface which provides a function called as close. Or if you want something related to OpenSearch you can use Releasable interface which provides this close function and implements Closeable interface.
AbstractLifecycleComponent doesn't seem correct here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use Closeable
src/main/java/org/opensearch/knn/index/memory/breaker/NativeMemoryCircuitBreakerService.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/memory/breaker/NativeMemoryCircuitBreakerService.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/memory/breaker/NativeMemoryCircuitBreakerService.java
Outdated
Show resolved
Hide resolved
"[KNN] Cache capacity below 75% of the circuit breaker limit for all nodes." | ||
+ " Unsetting knn.circuit_breaker.triggered flag." | ||
); | ||
nativeMemoryCircuitBreakerService.setCircuitBreaker(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Some thought for better debugging and better failure handling] This is a async call, don't you think we should wait for the call to get succeeded, which will tell that settings are updated or not, so that in case of error we can log and take actions properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by async? This particular line will execute the setting update request and block till it returns. I am not sure, however, if when the settings update request returns, it will be reflected across the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you go inside these functions you will reach in class src/main/java/org/opensearch/knn/index/KNNSettings.java function updateBooleanSetting:
client.admin().cluster().updateSettings(clusterUpdateSettingsRequest, new ActionListener<>() {
@Override
public void onResponse(ClusterUpdateSettingsResponse clusterUpdateSettingsResponse) {
logger.debug(
"Cluster setting {}, acknowledged: {} ",
clusterUpdateSettingsRequest.persistentSettings(),
clusterUpdateSettingsResponse.isAcknowledged()
);
}
@Override
public void onFailure(Exception e) {
logger.info(
"Exception while updating circuit breaker setting {} to {}",
clusterUpdateSettingsRequest.persistentSettings(),
e.getMessage()
);
logger.info("Exception while updating setting {} to {}", clusterUpdateSettingsRequest.persistentSettings(), e.getMessage());
which is a async call.
nativeMemoryCircuitBreakerService = new NativeMemoryCircuitBreakerService(KNNSettings.state(), threadPool, clusterService, client); | ||
NativeMemoryCacheManager.initialize(nativeMemoryCircuitBreakerService); | ||
// Called after NativeMemoryCacheManager initialization. Start method requires access to an instance of the | ||
// NativeMemoryCacheManager that needs to be initialized. | ||
nativeMemoryCircuitBreakerService.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a circular dependency between NativeMemoryCircuitBreakerService and NativeMemoryCacheManager, and the way you have handled this is very error prone, as we need to understand that we cannot start nativeMemoryCircuitBreakerService, before initiating the NativeMemoryCacheManager.
I would recommend breaking this dependency for better code readability and maintainability standpoint. Open for ideas here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good point. I wasnt sure how to break this, but here is one idea:
NativeMemoryCacheManager relies on NativeMemoryCircuitBreakerService for 2 reasons:
- Get dynamic configuration values (cb limit, unset percentage, time based eviction enabled, etc.)
- Setting the circuit breaker to true if there is an eviction from the cache caused by size (ref)
NativeMemoryCircuitBreakerService relies on NativeMemoryCacheManager for 2 reasons:
- Untrip CacheManager cache capacity reached if size is less than the threshold (ref)
- When run on leader node, the NativeMemoryCircuitBreakerService calls stats API to collect list of nodes with CacheCapacityReached. Internally, the stats API will have a dependency on NativeMemoryCacheManager to get this information (ref). This is used to unset the circuit breaker at cluster level.
Because NativeMemoryCircuitBreakerService needs the stat's API to execute one of its primary functions on untripping circuit breaker, NativeMemoryCircuitBreakerService must have the dependency on NativeMemoryCacheManager.
One option to resolve this would be to separate out the following CB functions into a separate class, NativeMemoryCircuitBreakerUtil or NativeMemoryCircuitBreaker (open to other naming suggestions):
- isCircuitBreakerTriggered
- setCircuitBreaker
- getCircuitBreakerLimit
- isCircuitBreakerEnabled
Then, NativeMemoryCircuitBreakerService and NativeMemoryCacheManager could take a dependency on this class. And then NativeMemoryCircuitBreakerService could also take a dependency on NativeMemoryCacheManager. With this, NativeMemoryCircuitBreakerService could be renamed NativeMemoryCircuitBreakerMonitor, NativeMemoryCircuitBreakerUntripper, NativeMemoryCircuitBreakerUntrippingService (also open to naming suggestions).
Lastly, one more thing we could do is remove NativeMemoryCircuitBreakerService's resonsibility of unsetting CacheCapacityReached and create a separate scheduled thread that does this in a different class. This would remove NativeMemoryCircuitBreakerService's explicit dependency on NativeMemoryCacheManager. The downside to this would be adding another job that needs to be continually run, but the overhead of such a job should be pretty small.
ClusterUpdateSettingsRequest clusterUpdateSettingsRequest = new ClusterUpdateSettingsRequest(); | ||
Settings circuitBreakerSettings = Settings.builder().put(KNNSettings.KNN_CIRCUIT_BREAKER_TRIGGERED, flag).build(); | ||
Settings circuitBreakerSettings = Settings.builder().put(settingName, value).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen if the setting name is not found? We should handle that case also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case will cause a failure here, which will result in the onFailure handler called:
» [2023-03-28T10:53:04,356][INFO ][o.o.k.i.KNNSettings ] [integTest-0] Exception while updating circuit breaker setting {"invalid-setting-lets-see-what-happens":"true"} to persistent setting [invalid-setting-lets-see-what-happens], not recognized
Updating exception should give a better stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With updated log message:
» java.lang.IllegalArgumentException: persistent setting [invalid-setting-lets-see-what-happens], not recognized
» at org.opensearch.common.settings.AbstractScopedSettings.updateSettings(AbstractScopedSettings.java:863) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.common.settings.AbstractScopedSettings.updateDynamicSettings(AbstractScopedSettings.java:810) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.action.admin.cluster.settings.SettingsUpdater.updateSettings(SettingsUpdater.java:111) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.action.admin.cluster.settings.TransportClusterUpdateSettingsAction$1.execute(TransportClusterUpdateSettingsAction.java:254) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:65) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:867) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:424) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:295) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:206) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:190) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:228) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
» at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
» at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
» at java.lang.Thread.run(Thread.java:834) [?:?]
public ByteSizeValue getCircuitBreakerLimit() { | ||
return knnSettings.getSettingValue(KNNSettings.KNN_MEMORY_CIRCUIT_BREAKER_LIMIT); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. will monitor based on discussion above.
public void setCircuitBreaker(boolean circuitBreaker) { | ||
knnSettings.updateBooleanSetting(KNNSettings.KNN_CIRCUIT_BREAKER_TRIGGERED, circuitBreaker); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. will monitor based on discussion above.
public boolean isCircuitBreakerTriggered() { | ||
return knnSettings.getSettingValue(KNNSettings.KNN_CIRCUIT_BREAKER_TRIGGERED); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a good function to be kept here? what if we move this in KNNSetting class or some other class or a utility class that can provide this info.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem with the KNNSettings class is that it ends up collecting settings from many unrelated components. If a component wants to know if the circuit breaker is triggered, I think it should check from the circuit breaker service as opposed to looking up in the KNNSettings class. In the future, we may use something other than a setting to determine if the circuit breaker is triggered. Encapsulating this function in this service as opposed to KNNSettings would prevent breaking changes to dependent components.
As an aside, in OpenSearch, settings are generally defined in the component they are related to (ref). I think ideally we would move to a pattern like this and change KNNSettings into a general utility class used to look up the value of a particular setting.
public boolean isCircuitBreakerEnabled() { | ||
return knnSettings.getSettingValue(KNNSettings.KNN_MEMORY_CIRCUIT_BREAKER_ENABLED); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. will monitor based on discussion above.
when(NEVER_TRIGGERED_CB_SERVICE.isCircuitBreakerTriggered()).thenReturn(false); | ||
when(NEVER_TRIGGERED_CB_SERVICE.getCircuitBreakerLimit()).thenReturn(new ByteSizeValue(100, ByteSizeUnit.KB)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need these static mocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to provide an easy to use NativeMemoryCircuitBreakerService that is never tripped.
...est/java/org/opensearch/knn/index/memory/breaker/NativeMemoryCircuitBreakerServiceTests.java
Outdated
Show resolved
Hide resolved
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
791b489
to
e9fa5e4
Compare
Splits the CBService into 2 separate classes. One contains the logic for querying information about the circuit breaker (i.e. whether or not it is tripped). The other contains the logic for running the periodic monitoring workflow. Signed-off-by: John Mazanec <[email protected]>
@navneet1v I refactored the NativeMemoryCircuitBreakerService into 2 classes: NativeMemoryCircuitBreaker and NativeMemoryCircuitBreakerMonitor. This resolves the circular dependency issue. I left out 2 changes that I think we will want to do eventually:
I decided to leave these 2 out to keep this PR simple. I will take them up in a future PR. |
Closing for now. Might come back to this later. |
Description
Overall goal of this PR is to make our Circuit Breaker more modular. With that, it refactors the CircuitBreaker to avoid singleton pattern and use injection. This will make testing easier as the project scales.
Additionally, it groups circuit breaker related functionality into single service. This will give components one centralized place to interact with the circuit breaker. Along with this, it migrates some of the utility functions away from the KNNSettings.
Lastly, rename the circuit breaker from KNNCircuitBreaker to NativeMemoryCircuit breaker to be more descriptive. Related to this, move the NativeMemoryCircuitBreaker into the memory package in its own sub package.
Additionally, this PR adds unit test tests for the NativeMemoryCircuitBreakerService.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.