Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/QuerySandbox(ResourceLimitGroup) Add sandbox schema and tracking framework #13311

Conversation

kaushalmahi12
Copy link
Contributor

Description

As part of limiting the resource usage to the group of queries(Query Sandboxing/resourceLimitGroups). The feature is broken down into 3 independent parts.

  • CRUD APIs (Will go into the Plugin)
  • Request Tagging/Classification (Will go into the Plugin)
  • Tracking and Monitoring

This change sets up the tracking and monitoring framework which consists of

  • Schema for defining these groups/sandboxes
  • Service which will run on a 1 second interval. This class is not functional as it needs to be added in the Node.java
  • Service level settings
  • Logic to propagate the sandbox/resourceLimitGroup from coordinator to data nodes.

Remaining Changes

  • Will add the Tests for Service once the CRUD APIs are raised and merged
  • ResourceTracking tests once CRUD APIs are raised and merged
  • Enable the service

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kaushalmahi12 kaushalmahi12 changed the title Feature/resource limit group sandbox schema Feature/QuerySandbox(ResourceLimitGroup) Add sandbox schema and tracking framework Apr 19, 2024
Copy link
Contributor

✅ Gradle check result for 5ba7535: SUCCESS

Copy link

codecov bot commented Apr 19, 2024

Codecov Report

Attention: Patch coverage is 50.50167% with 148 lines in your changes missing coverage. Please review.

Project coverage is 71.28%. Comparing base (b15cb0c) to head (5ba7535).
Report is 373 commits behind head on main.

Current head 5ba7535 differs from pull request most recent head 3afd9db

Please upload reports for the commit 3afd9db to get more accurate results.

Files Patch % Lines
...esourceLimitsGroupResourceUsageTrackerService.java 0.00% 41 Missing ⚠️
...h/cluster/metadata/ResourceLimitGroupMetadata.java 38.29% 29 Missing ⚠️
...esource_limit_group/ResourceLimitGroupService.java 0.00% 23 Missing ⚠️
...java/org/opensearch/cluster/metadata/Metadata.java 0.00% 18 Missing ⚠️
...limit_group/ResourceLimitGroupServiceSettings.java 62.79% 14 Missing and 2 partials ⚠️
...pensearch/cluster/metadata/ResourceLimitGroup.java 92.39% 2 Missing and 5 partials ⚠️
...e_limit_group/module/ResourceLimitGroupModule.java 0.00% 5 Missing ⚠️
...va/org/opensearch/action/search/SearchRequest.java 55.55% 2 Missing and 2 partials ⚠️
...pensearch/rest/action/search/RestSearchAction.java 0.00% 1 Missing and 1 partial ⚠️
...opensearch/search/internal/ShardSearchRequest.java 75.00% 0 Missing and 2 partials ⚠️
... and 1 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13311      +/-   ##
============================================
- Coverage     71.42%   71.28%   -0.14%     
- Complexity    59978    60609     +631     
============================================
  Files          4985     5045      +60     
  Lines        282275   285730    +3455     
  Branches      40946    41358     +412     
============================================
+ Hits         201603   203685    +2082     
- Misses        63999    65152    +1153     
- Partials      16673    16893     +220     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kaushal Kumar <[email protected]>
Copy link
Contributor

❌ Gradle check result for d2a19ee: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Collaborator

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have the logic for request rejection in this PR, right?

Comment on lines +89 to +96

public String getResourceLimitGroupName() {
return resourceLimitGroupId;
}

public void setResourceLimitGroupName(String resourceLimitGroupId) {
this.resourceLimitGroupId = resourceLimitGroupId;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class does not have any other setters. Why not set the resourceLimitGroupId within the constructor itself?

Copy link
Contributor Author

@kaushalmahi12 kaushalmahi12 May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do that because of following reasons

but we will change the information is coming in the searchRequest so this will likely not be present in final change.

Comment on lines +112 to +118
public String getResourceLimitGroupName() {
return resourceLimitGroupId;
}

public void setResourceLimitGroupName(String resourceLimitGroupId) {
this.resourceLimitGroupId = resourceLimitGroupId;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as SearchShardTask

Copy link
Contributor Author

@kaushalmahi12 kaushalmahi12 May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above,
This is where shard level task are created : https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/transport/RequestHandlerRegistry.java#L91

so basically the tasks are created by TaskManager using TaskAwareRequest.

Comment on lines 196 to 212
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
ResourceLimitGroupMetadata that = (ResourceLimitGroupMetadata) o;
return Objects.equals(resourceLimitGroups, that.resourceLimitGroups);
}

@Override
public int hashCode() {
return Objects.hash(resourceLimitGroups);
}

@Override
public String toString() {
return Strings.toString(MediaTypeRegistry.JSON, this);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Move these methods before inner class definitions for more clarity

* ResourceLimitGroupMetadataDiff
*/
static class ResourceLimitGroupMetadataDiff implements NamedDiff<Metadata.Custom> {
final Diff<Map<String, ResourceLimitGroup>> dataStreanDiff;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in dataStreanDiff

*/
@Override
public Metadata.Custom apply(Metadata.Custom part) {
return new ResourceLimitGroupMetadata(dataStreanDiff.apply(((ResourceLimitGroupMetadata) part).resourceLimitGroups));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we apply the difference in resourceLimitGroup to existing one for more efficient internode cluster state publication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the diff we are only including the diff of ResourceLimitGroups so it should be fine.

* ResourceLimitGroup ids which are marked for deletion in between the
* {@link org.opensearch.search.resource_limit_group.ResourceLimitGroupService} runs
*/
private List<String> toDeleteResourceLimitGroups;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can call this deleteRLG since we don't track the groups after hard deletion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, we can keep the reference to original RLG instead of just id?

Comment on lines +107 to +112
Map<String, List<Task>> newResourceLimitGroupTasks = new HashMap<>();
for (Map.Entry<String, List<ResourceLimitGroupTask>> entry : activeTasks.stream()
.collect(Collectors.groupingBy(ResourceLimitGroupTask::getResourceLimitGroupName))
.entrySet()) {
newResourceLimitGroupTasks.put(entry.getKey(), entry.getValue().stream().map(task -> (Task) task).collect(Collectors.toList()));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to stream the task list twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge the .map logic of these together.

public Map<String, Double> toMap() {
Map<String, Double> map = new HashMap<>();
// We can put the additional resources into this map in the future
map.put(JVM, getAbsoluteJvmAllocationsUsageInPercent());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write generic method for streaming the list of ALLOWED_RESOURCES, and get corresponding value based on the resource type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Makes sense

}

/**
* filter out the deleted sandboxes which still has unfi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: incomplete documentation?

Comment on lines +200 to +214
public void pruneResourceLimitGroup() {
toDeleteResourceLimitGroups = toDeleteResourceLimitGroups.stream().filter(this::hasUnfinishedTasks).collect(Collectors.toList());
}

private boolean hasUnfinishedTasks(String sandboxId) {
return false;
}

/**
* method to handle the completed tasks
* @param task represents completed task on the node
*/
@Override
public void onTaskCompleted(Task task) {}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If task completion is overridden, we don't need to stream the list of deleted RLGs? Although, should not matter given RLG deletion should be rare

@@ -698,6 +707,15 @@ public String pipeline() {
return pipeline;
}

public SearchRequest resourceLimitGroupId(String resourceLimitGroupId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public SearchRequest resourceLimitGroupId(String resourceLimitGroupId) {
public SearchRequest setResourceLimitGroupId(String resourceLimitGroupId) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had kept this method aligned with other methods in the class.

return this;
}

public String resourceLimitGroupId() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public String resourceLimitGroupId() {
public String getResourceLimitGroupId() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had kept this method aligned with other methods in the class.

@@ -1329,6 +1336,36 @@ public Builder removeDataStream(String name) {
return this;
}

public Builder resourceLimitGroups(final Map<String, ResourceLimitGroup> resourceLimitGroups) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
public Builder resourceLimitGroups(final Map<String, ResourceLimitGroup> resourceLimitGroups) {
public Builder putResourceLimitGroups(final Map<String, ResourceLimitGroup> resourceLimitGroups) {

Copy link
Contributor Author

@kaushalmahi12 kaushalmahi12 May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept these methods inline with rest of the methods in class. These are Builder methods so I think it makes sense

* }
*/
@ExperimentalApi
public class ResourceLimitGroup extends AbstractDiffable<ResourceLimitGroup> implements ToXContentObject {
Copy link
Contributor

@kiranprakash154 kiranprakash154 Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have an id field to refer a sandbox internally with the id. The name can be a user-facing entity, according to the published RFC - https://docs.google.com/document/d/1Bp7JnDLectuv3PviyXO0FWV6Tg3wKQOTnzC0PUYXZ6I/edit#heading=h.jrflzibacpc7

Copy link
Contributor

@kiranprakash154 kiranprakash154 Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have an isDeleted field or something that denotes it has been deleted and in the drain mode only, i.e only the already running searches are left to run. Without this field, we cannot drop an incoming request when a ResourceLimitGroup is marked for deletion and in that drain phase.

Copy link
Contributor

@kiranprakash154 kiranprakash154 May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I dont think we need isDeleted as I'm guessing the delete API will remove it from the clusterState, so it wont be ever returned as a value here activeResourceLimitGroups = new ArrayList<>(clusterService.state().metadata().resourceLimitGroups().values());

In that case, where are we verifying the user provided ResourceLimitGroup is a valid ResourceLimitGroup or not ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will handle that in the tracking part, We will simply track these invalidly tagged tasks under default catch all sandbox.

@ExperimentalApi
public static class ResourceLimit implements Writeable, ToXContentObject {
private final String resourceName;
private final Double value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think limit or threshold are better than value ?
value seems a little vague to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is true in the context.

*/
@ExperimentalApi
public static class ResourceLimit implements Writeable, ToXContentObject {
private final String resourceName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just name ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make an abstract class of Resource.
Which CPU/JVM extends, that way we can accomodate in the long term to be able to provide JVM as bytes or or add a new resource etc.

Using Factory pattern to create a valid Resource like

public class ResourceFactory {
    public Resource createResource(String type) {
        if (type.equalsIgnoreCase("CPU")) {
            return new CPU();
        } else if (type.equalsIgnoreCase("JVM")) {
            return new JVM();
        }
        throw new IllegalArgumentException("Invalid resource type");
    }
}

Copy link
Contributor Author

@kaushalmahi12 kaushalmahi12 May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can encapsulate the Resource into a class. But all that would do is rename the class

Objects.requireNonNull(resourceName, "resourceName can't be null");
Objects.requireNonNull(value, "resource value can't be null");

if (Double.compare(value, 1.0) > 0) {
Copy link
Contributor

@kiranprakash154 kiranprakash154 Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Double.compare(value, 1.0) > 0) {
if (value > 1.0) {

this is much simpler ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this just for the accuracy purposes as We can only hold doubles/floats precise upto a digit after decimal

* Main class to declare the query ResourceLimitGrouping feature related settings
*/
public class ResourceLimitGroupServiceSettings {
private static final Long DEFAULT_RUN_INTERVAL_MILLIS = 1000l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static final Long DEFAULT_RUN_INTERVAL_MILLIS = 1000l;
private static final Long DEFAULT_RUN_INTERVAL_MILLIS = 1_000L;

PARSER.declareDouble(ConstructingObjectParser.constructorArg(), RESOURCE_VALUE_FIELD);
}

public ResourceLimit(String resourceName, Double value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe ResourceLimit is called by an API that creates a new ResourceLimitGroup ?
If this is user facing then we should let them use a string ending with % like 10% or a double value of 0.1.
We can follow something similar to this - https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/indices/IndicesRequestCache.java#L837C1-L857C6


@Override
public void updateResourceLimitGroupsResourceUsage() {
activeResourceLimitGroups = new ArrayList<>(clusterService.state().metadata().resourceLimitGroups().values());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we avoid storing these values at the instance level to avoid synchronization issues and always fetch the activeResourceLimitGroups from the clusterService ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is calculated every second so this is always upto date. Am I missing something here ?

* We are keeping this as instance member as we will need this to identify the contended resource limit groups
* resourceLimitGroupName -> List&lt;ResourceLimitGroupTask&gt;
*/
private Map<String, List<Task>> resourceLimitGroupTasks;
Copy link
Contributor

@kiranprakash154 kiranprakash154 May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be hardcoded to CancellableTask ? should/would we ever tag a non CancellableTask type to a ResourceLimitGroup ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

@Inject
public ResourceLimitGroupService(
ResourceLimitGroupResourceUsageTracker requestTrackerService,
ResourceLimitGroupTaskCanceller requestCanceller,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are both requestCanceller and taskCanceller required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Catch! This is a miss from my end. No, only one is required.

/**
* This class tracks requests per resourceLimitGroups
*/
public class ResourceLimitsGroupResourceUsageTrackerService extends ResourceLimitGroupTaskCanceller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is overburdened with multiple responsibilities - Tracking resource usage, Managing task cancellations and Handling resource pruning and deletion.
I think we need to break them into their own classes.

Something like

public class ResourcePruningService implements ResourcePruner {}
public class TaskCancellationService implements TaskCanceller {}
public class ResourceUsageService implements ResourceUsageTracker {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, We will be creating these concrete implementations. This PR is going to serve as the Parent PR for upcoming small PRs which are breakdown of this.

Copy link
Contributor

github-actions bot commented May 8, 2024

❌ Gradle check result for 3afd9db: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Jun 8, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Jul 14, 2024
@jainankitk
Copy link
Collaborator

I guess we can close this PR. @kaushalmahi12 - Please reopen if needed

@jainankitk jainankitk closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants