Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make circuit breaker more modular #774

Closed
wants to merge 8 commits into from

Conversation

jmazanec15
Copy link
Member

Description

Overall goal of this PR is to make our Circuit Breaker more modular. With that, it refactors the CircuitBreaker to avoid singleton pattern and use injection. This will make testing easier as the project scales.

Additionally, it groups circuit breaker related functionality into single service. This will give components one centralized place to interact with the circuit breaker. Along with this, it migrates some of the utility functions away from the KNNSettings.

Lastly, rename the circuit breaker from KNNCircuitBreaker to NativeMemoryCircuit breaker to be more descriptive. Related to this, move the NativeMemoryCircuitBreaker into the memory package in its own sub package.

Additionally, this PR adds unit test tests for the NativeMemoryCircuitBreakerService.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Refactors CircuitBreaker to avoid singleton pattern and use injection.
This will make testing easier as the package scales.

Additionally, group circuit breaker related functionality into single service.
Will allow components one centralized place to interact with the circuit
breaker. Along with this, it removes some of the utility functions from
the KNNSettings class to avoid confusion.

Lastly, rename the circuit breaker from KNNCircuitBreaker to
NativeMemoryCircuit breaker to be more descriptive. Related to this,
move the NativeMemoryCircuitBreaker into the memory package.

Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 added the Refactoring Improve the design, structure, and implementation while preserving its functionality label Feb 21, 2023
@jmazanec15 jmazanec15 requested a review from a team February 21, 2023 06:57
Signed-off-by: John Mazanec <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Feb 21, 2023

Codecov Report

Merging #774 (1665c3a) into main (f6d3d40) will increase coverage by 0.35%.
The diff coverage is 88.34%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main     #774      +/-   ##
============================================
+ Coverage     85.14%   85.50%   +0.35%     
- Complexity     1079     1089      +10     
============================================
  Files           151      152       +1     
  Lines          4370     4401      +31     
  Branches        390      391       +1     
============================================
+ Hits           3721     3763      +42     
+ Misses          472      461      -11     
  Partials        177      177              
Impacted Files Coverage Δ
...ain/java/org/opensearch/knn/index/KNNSettings.java 87.80% <66.66%> (+3.42%) ⬆️
...ory/breaker/NativeMemoryCircuitBreakerMonitor.java 76.08% <76.08%> (ø)
...nsearch/knn/index/mapper/KNNVectorFieldMapper.java 84.18% <100.00%> (+0.37%) ⬆️
...opensearch/knn/index/mapper/LegacyFieldMapper.java 100.00% <100.00%> (ø)
...opensearch/knn/index/mapper/LuceneFieldMapper.java 75.00% <100.00%> (+0.58%) ⬆️
...opensearch/knn/index/mapper/MethodFieldMapper.java 85.71% <100.00%> (ø)
.../opensearch/knn/index/mapper/ModelFieldMapper.java 71.42% <100.00%> (ø)
...rch/knn/index/memory/NativeMemoryCacheManager.java 96.06% <100.00%> (+0.86%) ⬆️
...dex/memory/breaker/NativeMemoryCircuitBreaker.java 100.00% <100.00%> (ø)
...main/java/org/opensearch/knn/plugin/KNNPlugin.java 100.00% <100.00%> (ø)
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

src/main/java/org/opensearch/knn/index/KNNSettings.java Outdated Show resolved Hide resolved
* circuit breaking logic is enabled, it will trip the circuit. Elsewhere in the code, the circuit breaker's value can
* be queried to prevent actions that should not happen during high memory pressure.
*/
public class NativeMemoryCircuitBreakerService extends AbstractLifecycleComponent {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of making this as a LifeCycleComponent? As per some high level look at other LifeCycleComponents I can see that they maintain some components lifecycle and generally they have listeners when the state gets changed like start, stop etc. What is the exact motivation of using this interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I wanted the ability to stop the Scheduler.Cancellable task on close.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say that you want to stop the task on close, this is generally done when have try with resource. The close function gets called automatically. It's more like calling close function in final block.

So, what you can do is there is another interface in java called as Closeable Interface which provides a function called as close. Or if you want something related to OpenSearch you can use Releasable interface which provides this close function and implements Closeable interface.

AbstractLifecycleComponent doesn't seem correct here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use Closeable

"[KNN] Cache capacity below 75% of the circuit breaker limit for all nodes."
+ " Unsetting knn.circuit_breaker.triggered flag."
);
nativeMemoryCircuitBreakerService.setCircuitBreaker(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Some thought for better debugging and better failure handling] This is a async call, don't you think we should wait for the call to get succeeded, which will tell that settings are updated or not, so that in case of error we can log and take actions properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by async? This particular line will execute the setting update request and block till it returns. I am not sure, however, if when the settings update request returns, it will be reflected across the cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you go inside these functions you will reach in class src/main/java/org/opensearch/knn/index/KNNSettings.java function updateBooleanSetting:

client.admin().cluster().updateSettings(clusterUpdateSettingsRequest, new ActionListener<>() {
            @Override
            public void onResponse(ClusterUpdateSettingsResponse clusterUpdateSettingsResponse) {
                logger.debug(
                    "Cluster setting {}, acknowledged: {} ",
                    clusterUpdateSettingsRequest.persistentSettings(),
                    clusterUpdateSettingsResponse.isAcknowledged()
                );
            }

            @Override
            public void onFailure(Exception e) {
                logger.info(
                    "Exception while updating circuit breaker setting {} to {}",
                    clusterUpdateSettingsRequest.persistentSettings(),
                    e.getMessage()
                );
                logger.info("Exception while updating setting {} to {}", clusterUpdateSettingsRequest.persistentSettings(), e.getMessage());

which is a async call.

Comment on lines 205 to 209
nativeMemoryCircuitBreakerService = new NativeMemoryCircuitBreakerService(KNNSettings.state(), threadPool, clusterService, client);
NativeMemoryCacheManager.initialize(nativeMemoryCircuitBreakerService);
// Called after NativeMemoryCacheManager initialization. Start method requires access to an instance of the
// NativeMemoryCacheManager that needs to be initialized.
nativeMemoryCircuitBreakerService.start();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a circular dependency between NativeMemoryCircuitBreakerService and NativeMemoryCacheManager, and the way you have handled this is very error prone, as we need to understand that we cannot start nativeMemoryCircuitBreakerService, before initiating the NativeMemoryCacheManager.

I would recommend breaking this dependency for better code readability and maintainability standpoint. Open for ideas here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good point. I wasnt sure how to break this, but here is one idea:

NativeMemoryCacheManager relies on NativeMemoryCircuitBreakerService for 2 reasons:

  1. Get dynamic configuration values (cb limit, unset percentage, time based eviction enabled, etc.)
  2. Setting the circuit breaker to true if there is an eviction from the cache caused by size (ref)

NativeMemoryCircuitBreakerService relies on NativeMemoryCacheManager for 2 reasons:

  1. Untrip CacheManager cache capacity reached if size is less than the threshold (ref)
  2. When run on leader node, the NativeMemoryCircuitBreakerService calls stats API to collect list of nodes with CacheCapacityReached. Internally, the stats API will have a dependency on NativeMemoryCacheManager to get this information (ref). This is used to unset the circuit breaker at cluster level.

Because NativeMemoryCircuitBreakerService needs the stat's API to execute one of its primary functions on untripping circuit breaker, NativeMemoryCircuitBreakerService must have the dependency on NativeMemoryCacheManager.

One option to resolve this would be to separate out the following CB functions into a separate class, NativeMemoryCircuitBreakerUtil or NativeMemoryCircuitBreaker (open to other naming suggestions):

  1. isCircuitBreakerTriggered
  2. setCircuitBreaker
  3. getCircuitBreakerLimit
  4. isCircuitBreakerEnabled

Then, NativeMemoryCircuitBreakerService and NativeMemoryCacheManager could take a dependency on this class. And then NativeMemoryCircuitBreakerService could also take a dependency on NativeMemoryCacheManager. With this, NativeMemoryCircuitBreakerService could be renamed NativeMemoryCircuitBreakerMonitor, NativeMemoryCircuitBreakerUntripper, NativeMemoryCircuitBreakerUntrippingService (also open to naming suggestions).

Lastly, one more thing we could do is remove NativeMemoryCircuitBreakerService's resonsibility of unsetting CacheCapacityReached and create a separate scheduled thread that does this in a different class. This would remove NativeMemoryCircuitBreakerService's explicit dependency on NativeMemoryCacheManager. The downside to this would be adding another job that needs to be continually run, but the overhead of such a job should be pretty small.

ClusterUpdateSettingsRequest clusterUpdateSettingsRequest = new ClusterUpdateSettingsRequest();
Settings circuitBreakerSettings = Settings.builder().put(KNNSettings.KNN_CIRCUIT_BREAKER_TRIGGERED, flag).build();
Settings circuitBreakerSettings = Settings.builder().put(settingName, value).build();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if the setting name is not found? We should handle that case also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case will cause a failure here, which will result in the onFailure handler called:

» [2023-03-28T10:53:04,356][INFO ][o.o.k.i.KNNSettings      ] [integTest-0] Exception while updating circuit breaker setting {"invalid-setting-lets-see-what-happens":"true"} to persistent setting [invalid-setting-lets-see-what-happens], not recognized

Updating exception should give a better stack trace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With updated log message:

»  java.lang.IllegalArgumentException: persistent setting [invalid-setting-lets-see-what-happens], not recognized
»       at org.opensearch.common.settings.AbstractScopedSettings.updateSettings(AbstractScopedSettings.java:863) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.common.settings.AbstractScopedSettings.updateDynamicSettings(AbstractScopedSettings.java:810) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.action.admin.cluster.settings.SettingsUpdater.updateSettings(SettingsUpdater.java:111) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.action.admin.cluster.settings.TransportClusterUpdateSettingsAction$1.execute(TransportClusterUpdateSettingsAction.java:254) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:65) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:867) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:424) ~[opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:295) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:206) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:190) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:228) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) [opensearch-2.5.1-SNAPSHOT.jar:2.5.1-SNAPSHOT]
»       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
»       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
»       at java.lang.Thread.run(Thread.java:834) [?:?]

Comment on lines 90 to 92
public ByteSizeValue getCircuitBreakerLimit() {
return knnSettings.getSettingValue(KNNSettings.KNN_MEMORY_CIRCUIT_BREAKER_LIMIT);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack. will monitor based on discussion above.

Comment on lines 81 to 83
public void setCircuitBreaker(boolean circuitBreaker) {
knnSettings.updateBooleanSetting(KNNSettings.KNN_CIRCUIT_BREAKER_TRIGGERED, circuitBreaker);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack. will monitor based on discussion above.

Comment on lines 72 to 74
public boolean isCircuitBreakerTriggered() {
return knnSettings.getSettingValue(KNNSettings.KNN_CIRCUIT_BREAKER_TRIGGERED);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a good function to be kept here? what if we move this in KNNSetting class or some other class or a utility class that can provide this info.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem with the KNNSettings class is that it ends up collecting settings from many unrelated components. If a component wants to know if the circuit breaker is triggered, I think it should check from the circuit breaker service as opposed to looking up in the KNNSettings class. In the future, we may use something other than a setting to determine if the circuit breaker is triggered. Encapsulating this function in this service as opposed to KNNSettings would prevent breaking changes to dependent components.

As an aside, in OpenSearch, settings are generally defined in the component they are related to (ref). I think ideally we would move to a pattern like this and change KNNSettings into a general utility class used to look up the value of a particular setting.

Comment on lines 99 to 101
public boolean isCircuitBreakerEnabled() {
return knnSettings.getSettingValue(KNNSettings.KNN_MEMORY_CIRCUIT_BREAKER_ENABLED);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack. will monitor based on discussion above.

Comment on lines 45 to 46
when(NEVER_TRIGGERED_CB_SERVICE.isCircuitBreakerTriggered()).thenReturn(false);
when(NEVER_TRIGGERED_CB_SERVICE.getCircuitBreakerLimit()).thenReturn(new ByteSizeValue(100, ByteSizeUnit.KB));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these static mocking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to provide an easy to use NativeMemoryCircuitBreakerService that is never tripped.

@jmazanec15 jmazanec15 force-pushed the cb-refactor branch 3 times, most recently from 791b489 to e9fa5e4 Compare March 28, 2023 23:37
Splits the CBService into 2 separate classes. One contains the logic for
querying information about the circuit breaker (i.e. whether or not it
is tripped). The other contains the logic for running the periodic
monitoring workflow.

Signed-off-by: John Mazanec <[email protected]>
@jmazanec15
Copy link
Member Author

@navneet1v I refactored the NativeMemoryCircuitBreakerService into 2 classes: NativeMemoryCircuitBreaker and NativeMemoryCircuitBreakerMonitor. This resolves the circular dependency issue.

I left out 2 changes that I think we will want to do eventually:

  1. Make settings update call synchronous
  2. Move the Circuit breaker setting definition out of KNNSettings and into NativeMemoryCircuitBreaker

I decided to leave these 2 out to keep this PR simple. I will take them up in a future PR.

@jmazanec15 jmazanec15 requested a review from navneet1v March 28, 2023 23:39
@jmazanec15
Copy link
Member Author

Closing for now. Might come back to this later.

@jmazanec15 jmazanec15 closed this Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Refactoring Improve the design, structure, and implementation while preserving its functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants