Add limit for max range query splits by interval #6458

afhassan · 2024-12-24T18:43:11Z

What this PR does:
Cortex only supports using a static interval to split range queries. This PR adds a new limit called split_queries_by_interval_max_splits that is used to dynamically change split interval to a multiple of split_queries_by_interval to ensure that the total number of splits remains below the set number.

Example:
split_queries_by_interval = 24h
split_queries_by_interval_max_splits = 30
A 30 day range query is split to 30 queries using 24h interval
A 40 day range query is split to 20 queries using 48h interval

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Ahmed Hassan <[email protected]>

harry671003 · 2024-12-27T03:46:46Z

pkg/querier/tripperware/queryrange/query_range_middlewares.go

-		staticIntervalFn := func(_ tripperware.Request) time.Duration { return cfg.SplitQueriesByInterval }
-		queryRangeMiddleware = append(queryRangeMiddleware, tripperware.InstrumentMiddleware("split_by_interval", metrics), SplitByIntervalMiddleware(staticIntervalFn, limits, prometheusCodec, registerer))
+		intervalFn := func(_ tripperware.Request) time.Duration { return cfg.SplitQueriesByInterval }
+		if cfg.SplitQueriesByIntervalMaxSplits != 0 {


Shouldn't the limit be applied to both range splits and vertical spits?

cortex/pkg/querier/tripperware/shard_by.go

Line 40 in 8a46d20

func (s shardBy) Do(ctx context.Context, r Request) (Response, error) {

Technically this sets a limit for the total range and vertical splits for a given query. The number of vertical shards is static, so the max number of of splits for a given query becomes split_queries_by_interval_max_splits x query_vertical_shard_size. Because of this adding a separate limit for vertical sharding when the number of vertical shards is a static config would be redundant because we limit it already.

Signed-off-by: Ahmed Hassan <[email protected]>

pkg/querier/tripperware/queryrange/split_by_interval.go

yeya24 · 2024-12-31T19:55:30Z

Instead of changing split interval using max number of split queries, can we try to combine it with estimated data to fetch?

For example, a query up[30d] is very expensive to split to 30 splits as each split query still fetches 30 day of data so 30 splits ended up fetching 900 days of data.

Instead of having a limit of total splits should we use total days of data to fetch?

afhassan · 2024-12-31T23:59:47Z

Instead of changing split interval using max number of split queries, can we try to combine it with estimated data to fetch?

For example, a query up[30d] is very expensive to split to 30 splits as each split query still fetches 30 day of data so 30 splits ended up fetching 900 days of data.

Instead of having a limit of total splits should we use total days of data to fetch?

That's a good idea - I can add a new limit for total hours of data fetched and adjust the interval to not exceed it.

We can still keep max number of splits since it gives more flexibility to limit the number of shards for queries with long day range even if they don't fetch a lot of days of data like the example you mentioned

Signed-off-by: Ahmed Hassan <[email protected]>

harry671003 · 2025-01-17T03:37:06Z

docs/configuration/config-file-reference.md

+# If vertical sharding is enabled for a query, the combined total number of
+# vertical and interval shards is kept below this limit
+# CLI flag: -querier.split-queries-by-interval-max-splits
+[split_queries_by_interval_max_splits: <int> | default = 0]


Should run: make doc?

harry671003 · 2025-01-17T03:43:48Z

pkg/querier/querier.go

@@ -62,6 +62,9 @@ type Config struct {
 	// Limit of number of steps allowed for every subquery expression in a query.
 	MaxSubQuerySteps int64 `yaml:"max_subquery_steps"`

+	// Max number of days of data fetched for a query, used to calculate appropriate interval and vertical shard size.
+	MaxDaysOfDataFetched int `yaml:"max_days_of_data_fetched"`


Does MaxDurationOfDataFetchedFromStoragePerQuery sound better?
Should this be part of QueryRange configuration?

harry671003 · 2025-01-17T03:46:17Z

pkg/querier/querier.go

@@ -131,6 +134,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
 	f.Int64Var(&cfg.MaxSubQuerySteps, "querier.max-subquery-steps", 0, "Max number of steps allowed for every subquery expression in query. Number of steps is calculated using subquery range / step. A value > 0 enables it.")
 	f.BoolVar(&cfg.IgnoreMaxQueryLength, "querier.ignore-max-query-length", false, "If enabled, ignore max query length check at Querier select method. Users can choose to ignore it since the validation can be done before Querier evaluation like at Query Frontend or Ruler.")
 	f.BoolVar(&cfg.EnablePromQLExperimentalFunctions, "querier.enable-promql-experimental-functions", false, "[Experimental] If true, experimental promQL functions are enabled.")
+	f.IntVar(&cfg.MaxDaysOfDataFetched, "querier.max-days-of-data-fetched", 0, "Max number of days of data fetched for a query. This can be used to calculate appropriate interval and vertical shard size dynamically.")


Could more details be added to the explanation? Also add "0 to disable".

harry671003 · 2025-01-17T03:48:51Z

pkg/querier/tripperware/queryrange/query_range_middlewares.go

-	CacheResults           bool `yaml:"cache_results"`
-	MaxRetries             int  `yaml:"max_retries"`
+	SplitQueriesByInterval          time.Duration `yaml:"split_queries_by_interval"`
+	SplitQueriesByIntervalMaxSplits int           `yaml:"split_queries_by_interval_max_splits"`


Maybe these both could be nested inside another config called DynamicQuerySplits?

harry671003 · 2025-01-17T03:50:05Z

pkg/querier/tripperware/queryrange/split_by_interval.go

 )

-type IntervalFn func(r tripperware.Request) time.Duration
+// dayMillis is the L4 block range in milliseconds.


The L4 block range is configurable in Cortex. Do we have to tie it to the L4 block range? Could the configuration itself be of type time.Duration?

harry671003 · 2025-01-17T03:51:10Z

pkg/querier/tripperware/queryrange/split_by_interval.go

-	reqs, err := splitQuery(r, s.interval(r))
+	interval, err := s.interval(ctx, r)
+	if err != nil {
+		return nil, httpgrpc.Errorf(http.StatusBadRequest, err.Error())


nit: This should be an InternalServerError

harry671003 · 2025-01-17T03:51:44Z

pkg/querier/tripperware/queryrange/split_by_interval.go

 	if err != nil {
 		return nil, httpgrpc.Errorf(http.StatusBadRequest, err.Error())
 	}
 	s.splitByCounter.Add(float64(len(reqs)))

+	stats := querier_stats.FromContext(ctx)


What is the stats used for? Is it only used to log in query-frontend?

harry671003 · 2025-01-17T03:54:48Z

pkg/querier/tripperware/queryrange/split_by_interval.go

+	}
+}
+
+func dynamicIntervalFn(cfg Config, limits tripperware.Limits, queryAnalyzer querysharding.Analyzer, queryStoreAfter time.Duration, lookbackDelta time.Duration, maxDaysOfDataFetched int) func(ctx context.Context, r tripperware.Request) (time.Duration, error) {


Could all of these be passed through the cfg?

harry671003 · 2025-01-17T03:56:32Z

pkg/querier/tripperware/queryrange/split_by_interval.go

+			return cfg.SplitQueriesByInterval, err
+		}
+
+		queryDayRange := int((r.GetEnd() / dayMillis) - (r.GetStart() / dayMillis) + 1)


Could we avoid using day here? Other users of Cortex might choose to split by multiple days or less than a day?

harry671003 · 2025-01-17T04:06:43Z

pkg/util/time.go

+	return int64(d / (time.Millisecond / time.Nanosecond))
+}
+
+func GetTimeRangesForSelector(start, end int64, lookbackDelta time.Duration, n *parser.VectorSelector, path []parser.Node, evalRange time.Duration) (int64, int64) {


Could you add some tests for these util methods?

Signed-off-by: Ahmed Hassan <[email protected]>

yeya24 · 2025-01-20T09:38:57Z

I get the idea. But my main concern for such dynamic split interval + max splits by interval is that results cache will have very bad hit ratio as our current results cache key is tied to your split interval.

A 30 day range query is split to 30 queries using 24h interval
A 40 day range query is split to 20 queries using 48h interval

The first 30 day range query uses 24h interval so 24h will be part of our results cache key.
Now you run another 40 day range query with 48h interval, the results cache of the first 30 days will be missing as you are using 48h in your results cache key now.

Making vertical shard size dynamic seems more friendly to results cache because vertical shard size is not part of the results cache key. However, not all queries can be vertically sharded.

harry671003 · 2025-01-20T15:46:10Z

The first 30 day range query uses 24h interval so 24h will be part of our results cache key.
Now you run another 40 day range query with 48h interval, the results cache of the first 30 days will be missing as you are using 48h in your results cache key now.

Isn't this true today with Grafana modifying the step interval? For example the 30d query will have a step of 900s vs a 40d query will have a step of 1200s. Since the step is also in the cache key, this will already invalidate the cache.

I agree with you on changing the vertical shard size first. Could we mark this feature experimental and iterate on it?

yeya24 · 2025-01-21T18:35:07Z

Yeah let's mark it experimental in https://cortexmetrics.io/docs/configuration/v1guarantees/#experimental-features

add limit for range query max splits by interval

b874e4e

Signed-off-by: Ahmed Hassan <[email protected]>

pull-request-size bot added the size/S label Dec 24, 2024

harry671003 reviewed Dec 27, 2024

View reviewed changes

Change dynamic interval sharding to take into account vertical sharding

6106978

Signed-off-by: Ahmed Hassan <[email protected]>

pull-request-size bot added size/M and removed size/S labels Dec 31, 2024

afhassan commented Dec 31, 2024

View reviewed changes

pkg/querier/tripperware/queryrange/split_by_interval.go Outdated Show resolved Hide resolved

add dynamic sharding based on total days of data fetched for query

01c121c

Signed-off-by: Ahmed Hassan <[email protected]>

pull-request-size bot added size/L and removed size/M labels Jan 16, 2025

harry671003 reviewed Jan 17, 2025

View reviewed changes

add unit tests for dynamicIntervalFn

b15dde6

Signed-off-by: Ahmed Hassan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add limit for max range query splits by interval #6458

Add limit for max range query splits by interval #6458

afhassan commented Dec 24, 2024

harry671003 Dec 27, 2024

afhassan Dec 30, 2024 •

edited

Loading

yeya24 commented Dec 31, 2024

afhassan commented Dec 31, 2024

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

harry671003 Jan 17, 2025

yeya24 commented Jan 20, 2025 •

edited

Loading

harry671003 commented Jan 20, 2025 •

edited

Loading

yeya24 commented Jan 21, 2025

Add limit for max range query splits by interval #6458

Are you sure you want to change the base?

Add limit for max range query splits by interval #6458

Conversation

afhassan commented Dec 24, 2024

Choose a reason for hiding this comment

afhassan Dec 30, 2024 • edited Loading

Choose a reason for hiding this comment

yeya24 commented Dec 31, 2024

afhassan commented Dec 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 commented Jan 20, 2025 • edited Loading

harry671003 commented Jan 20, 2025 • edited Loading

yeya24 commented Jan 21, 2025

afhassan Dec 30, 2024 •

edited

Loading

yeya24 commented Jan 20, 2025 •

edited

Loading

harry671003 commented Jan 20, 2025 •

edited

Loading