Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] QueryGroup level stats structure and api #15120

Closed
kaushalmahi12 opened this issue Aug 5, 2024 · 8 comments · Fixed by #15527
Closed

[META] QueryGroup level stats structure and api #15120

kaushalmahi12 opened this issue Aug 5, 2024 · 8 comments · Fixed by #15527
Labels
Meta Meta issue, not directly linked to a PR Search Search query, autocomplete ...etc

Comments

@kaushalmahi12
Copy link
Contributor

kaushalmahi12 commented Aug 5, 2024

Please describe the end goal of this project

We want to come to a conclusion on all the fields that should be present in query group stats. Sine the _node/stats is already quite large and given that we will have [0-100] query groups available in the cluster.

Though the number of query groups are limited but these query groups account node level metrices pertaining to a query group, hence actual number of such stat objects in the cluster will be #dataNodes * #queryGroups.

If the cluster is large e,g; dataNodes = 200 and queryGroups = 50 then these objects will be 10000 which is a lot and can potentially make the stats output hard to fathom.

The schema for the single query group stat I am proposing is something like the following

{
    "queryGroupId": {
            "completions": Long,
            "rejections": Long,
            "CPU": { "currentUsage": Double, "cancellations": Long },
            "Memory": { "currentUsage": Double, "cancellations": Long }
}

apart from resource usage in this all the metric values are cumulative counters since the process start time.

Keeping these things in mind I am more inclined towards keeping the stats API for this separate.
I think the feature related stats should only be provided when explicitly asked either using

  • queryParam
  • PathParam
    But currently if a feature is enabled and has stats then they are returned by default. If a client is consuming the node/stats and then upgrades to a OS version which has additional stats object present in the response it can break the client code.

Supporting References

#12342

Issues

#12342

Related component

Search:Resiliency

@kaushalmahi12 kaushalmahi12 added Meta Meta issue, not directly linked to a PR untriaged labels Aug 5, 2024
@jainankitk jainankitk added the Search Search query, autocomplete ...etc label Aug 7, 2024
@jainankitk
Copy link
Collaborator

Thanks @kaushalmahi12 for initiating this issue. Few considerations:

  1. Should we also have failures in the queryGroup stats, since we don't track usage for those requests?
  2. Also, should we distinguish between the stats between SearchTask and SearchShardTask?
  3. I am slightly unclear on if you're suggesting for the stats to be part of nodestats and only returned using queryParam/pathParam. OR Have these stats as part of separate stats API

@kaushalmahi12
Copy link
Contributor Author

@jainankitk Thanks for reviewing it.

I am slightly unclear on if you're suggesting for the stats to be part of nodestats and only returned using queryParam/pathParam. OR Have these stats as part of separate stats API

Yes, I am suggesting to have separate API.

Should we also have failures in the queryGroup stats, since we don't track usage for those requests?

What do we mean by failures here ?

Also, should we distinguish between the stats between SearchTask and SearchShardTask?

IMO this will be unnecessary and redundant fields in the API, Why do we think that these will be useful w.r.t a QueryGroup ?

@backslasht Can you provide your inputs on this ?

@kaushalmahi12
Copy link
Contributor Author

kaushalmahi12 commented Aug 15, 2024

@jainankitk If mean request/task failures then I have the following reasons to not include those in the stats.

  • Since the rejection and cancellation are due to query group resource usage while failures are transient or non related to query group limits itself. It should not come under stats for the queryGroup
  • The failure hooks are not in built in the task resource tracking framework. Just for the sake of adding failures to the stats I think this will be an overkill given the value add

@backslasht
Copy link
Contributor

Thanks @kaushalmahi12 for creating this issue.

+1 for creating a new API instead of adding it to node stats. I'm interested in knowing what are the filters you are thinking to provide as part of this new API. Request/response structures will be helpful to understand it.

Since the rejection and cancellation are due to query group resource usage while failures are transient or non related to query group limits itself. It should not come under stats for the queryGroup

While I agree that failures may not be due to resource consumption, it does provide clarity for the user to understand which query groups are seeing failures especially if those query groups belongs to multiple tenants.

@kaushalmahi12
Copy link
Contributor Author

kaushalmahi12 commented Aug 16, 2024

Thanks @backslasht ! for going through this and providing your useful insights.

I'm interested in knowing what are the filters you are thinking to provide as part of this new API. Request/response structures will be helpful to understand it.

Regarding this We can keep the API aligned with nodes_/stats Api and provide path param based filtering
e,g;

GET query_group/stats
GET query_group/<node_id>/stats
GET query_group/stats/<metrices>

Here the metrices would be comma separated list of fields that we want to get as part of response
Some concrete examples could be as following

GET query_group/stats

Response

{
    "F-ByTQzVQ3GQeYzQJArJGQ": {
        "P7D7VE4fe7": {
            "completions": 12343134,
            "rejections": 343,
            "CPU": {
                "current_usage": 0.4,
                "cancellations": 10
            },
            "MEMORY": {
                "current_usage": 0.2,
                "cancellations": 101
            }
        }
        .....
    }

    .....
}
GET query_group/F-ByTQzVQ3GQeYzQJArJGQ/stats

Response

{
    "F-ByTQzVQ3GQeYzQJArJGQ": {
        "P7D7VE4fe7": {
            "completions": 12343134,
            "rejections": 343,
            "CPU": {
                "current_usage": 0.4,
                "cancellations": 10
            },
            "MEMORY": {
                "current_usage": 0.2,
                "cancellations": 101
            }
        }
        .....
    }
}
GET query_group/stats/completions,rejections

Response

{
    "F-ByTQzVQ3GQeYzQJArJGQ": {
        "P7D7VE4fe7": {
            "completions": 12343134,
            "rejections": 343
        },
        "RPqvjXFyiP": {
            "completions": 98424211,
            "rejections": 352
        }
        .....
    }
    ....
}

it does provide clarity for the user to understand which query groups are seeing failures especially if those query groups belongs to multiple tenants.

I think failures from a multi-tenancy perspective does make sense to incorporate. But I think we are extending this idea to do the task scheduling using threadpool per query group then in future it might also make sense the currently running tasks and queue size for the tenant.

Keeping these things in mind I think it shouldn't hurt to add failures into the stats.

@jainankitk
Copy link
Collaborator

@kaushalmahi12 - Thanks for providing the request response structures. Few comments:

  • Might be good to add whether QueryGroup is breaching threshold or not. Not sure if we should add specific resource level breach indicator as well
  • Do we need cumulative cancellations metric in addition to per resource? I am assuming cumulative rejections is for tracking requests we did not even start execution
  • Maybe for simplicity, we need not support query_group/stats/completions,rejections. Although, we should take some filtering parameters like, breach=true for looking at just the QueryGroups breaching one of the resource threshold. We can further that by taking breach=memory to get the QueryGroups breaching memory threshold.

I think failures from a multi-tenancy perspective does make sense to incorporate. But I think we are extending this idea to do the task scheduling using threadpool per query group then in future it might also make sense the currently running tasks and queue size for the tenant.

I also agree with including failures as one of the metric and running tasks/queue size and other relevant metrics once we introduce QueryScheduling using QueryGroups

@backslasht
Copy link
Contributor

Thanks @kaushalmahi12 for the request and response structures. It will be good to add filters using query group id/name as well.

@kaushalmahi12
Copy link
Contributor Author

Thanks @jainankitk for follow up suggestions.

Might be good to add whether QueryGroup is breaching threshold or not. Not sure if we should add specific resource level breach indicator as well

I think filtering based on breach filter seems like a good idea and user experience. But regarding passing the breach parameter is little ambiguous since we will be performing rejections/cancellations so should be declare breach at cancellation threshold or rejection.

Do we need cumulative cancellations metric in addition to per resource? I am assuming cumulative rejections is for tracking requests we did not even start execution

IMO this will be redundant information as the structure is simple enough and not too many nested subfields.

Maybe for simplicity, we need not support query_group/stats/completions,rejections. Although, we should take some filtering parameters like, breach=true for looking at just the QueryGroups breaching one of the resource threshold. We can further that by taking breach=memory to get the QueryGroups breaching memory threshold.

I like this idea and we should support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Meta issue, not directly linked to a PR Search Search query, autocomplete ...etc
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants