Cache metrics (values) in Metric Server and honor `pollingInterval` #2282

zroubalik · 2021-11-12T14:48:12Z

Proposal

Currently KEDA Metrics Server scrapes metrics values from external services (triggers) with each request from k8s server. This could overload the external service.

keda/pkg/provider/provider.go

Line 72 in 2446b8e

    
           func (p *KedaProvider) GetExternalMetric(ctx context.Context, namespace string, metricSelector labels.Selector, info provider.ExternalMetricInfo) (*external_metrics.ExternalMetricValueList, error) {

We might want to cache the metric values in Metric Server to reduce the load on the external service and respond to k8s server much faster.

We have pollingInterval property in ScaledObject which is being honored only by KEDA Operator, we could extend this behavior to KEDA Metrics Server as well. There could be a new property in ScaledObject.spec.advanced to configure this in case users would like to not cache metric values and keep the original behavior (if we make the new behavior default).

The current design

When useCachedMetrics option is set to true, metrics are queried only during polling interval and stored in Operator. KEDA Metrics Server asks KEDA Operator for metrics through gRPC:

if useCachedMetrics: true -> we will hit the cache
if useCachedMetrics: false or not specifed -> Operator will do the request to the scaler and reply back to KEDA Metrics server

  - type: kafka
    useCachedMetrics: true     # <-- NEW OPTION, default value: false
    metadata:
      bootstrapServers: kafka.svc:9092
      consumerGroup: my-group
      topic: test-topic
      lagThreshold: '5'

Tasks to do:

The text was updated successfully, but these errors were encountered:

stale · 2022-01-22T17:02:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

kefik · 2022-05-12T07:05:12Z

In my use-case, not being able to fully control pollingInterval on the Prometheus metrics results in starting more replicas than required. My metrics logic requires ~60s to settle on a new correct value after starting a new replica.

zroubalik · 2022-05-12T08:04:57Z

In my use-case, not being able to fully control pollingInterval on the Prometheus metrics results in starting more replicas than required. My metrics logic requires ~60s to settle on a new correct value after starting a new replica.

What if you specify this in the Prometheus Query?

bjethwan · 2022-06-28T14:36:55Z

This looks like a related issue:
#3291

zroubalik · 2022-11-08T12:25:54Z

So the idea is following:

Introduce a new option for specifying pollingInterval for 1<->N scaling:
1. this could be either on a ScaledObject level
2. or on a Scaler (trigger level)
This would be an optional value, if it is not set, the current behavior is preserved.
If the option si specified, the metric value would be cached for the specified interval, if there's a request coming from HPA(k8s) we will return from cache if it is within the interval. Otherwise we will poll the external service for a new value.
Since there could be multiple replicas of the same KEDA Metrics Server, we need to share the cache somehow between the replicas. This is hard to implement, so we can do ->
Query the metric value from KEDA Operator and cache the value there (this has also additional benefit in using only one connection to the external service -> now we have 2 connections: Operator & Metrics Server). KEDA Operator will act as a source of the metrics for Metric Server
There will be gRPC server hosted on KEDA Operator and Metric servers will connect to this server for getting the metric value.
This way we have all the metric values in the single place - KEDA Operator. This enables us with possible extension on this part in the future (we can modify the metric values before they are exposed to HPA).

💩 I am not sure how to name this new option, honestly the best would be to name it pollingInterval and rename current property with this name to activationPollingInterval, this would make our spec consistent and in sync with threshold & activationThreshold properties. But it is a huge breaking change, so I am not sure if it is doable.

JorTurFer · 2022-11-14T22:06:44Z

Sorry for the late response.
As I said during the standup, I'd use the metrics server as server because that opens the scenario where different KEDA operator instances from different namespaces, push metrics to the shared metrics server. With this change, the operator instances must know where the metrics server is (single service in a namespace), which is easier than the metrics server knowing where are (single service across multiple namespaces).

We should think about how to push/pull the metrics between different metrics server instances, as starting point I guess that we could assume the topology is 1-1, but from reliability pov is good idea having 2 or more instances. Maybe we could create a mesh between metrics servers to share the information, but maybe the best idea is just creating a gRPC connections slice and push the metrics using all of them. In background, we could watch (k8s) endpoints for the service and use the chagnes to establish and add new connections to the slice.

WDYT?

zroubalik · 2022-11-15T09:40:10Z

I do understand the point you have and I agree. Though in the first iteration I'd try to make this feature as simple as possible. Creating the unary RPC client(metrics server)->server(opeator) is much easier than the opposite direction server(metrics server)->client(operator), because the flow of request is going from the metrics server side - this side is making the call for metrics.
Once this proves to be functional, we can think about more complext stuff.

Alternative approach to support multi namespace installations would be to open a client connection from Metrics Server for each installation and send the request the the respective Operator. The filtering would be easy, as the namespace is part of the request from the k8s api when asking for metrics. So we would know (in Metrics Server) which operator ask for the metric. This topology also supports mutliple Metrics Server replicas, as those could be independent because all stuff is stored in the operator.
The registration of a new namespaced KEDA installation could be done 1) manually by user by editing the Metrics Server config (a configmap?) or 2) we can have a separate grpc server hosted on metrics server just for the registration purposes.

JorTurFer · 2022-11-15T21:56:15Z

okey, let's go with your design and once we check it works, we can modify how we stablish the connections. For the moment we could go 1-1 and then we can think how to solve the connection management. I think that watching KEDA namespaced endpoints to detect new instances and stablish new connections based on that should be a small change

JorTurFer · 2022-11-15T22:03:22Z

I'm thinking on the approach and I have one question, will we enforce 1 single operator instance? Docs explain that multiple instances are allowed for HA stuff, even a single instance will work. So we could break the behaviour to them if we stablish the connection against the wrong instance

zroubalik · 2022-11-16T11:56:06Z

I'm thinking on the approach and I have one question, will we enforce 1 single operator instance? Docs explain that multiple instances are allowed for HA stuff, even a single instance will work. So we could break the behaviour to them if we stablish the connection against the wrong instance

The gRPC server start will be hooked on the leader election -> so it will run only on the active instance.

JorTurFer · 2022-11-16T12:04:15Z

The gRPC server start will be hooked on the leader election -> so it will run only on the active instance.

Nice!, I have to think about this slowly to discover potential gaps/issues, but this probably is better solution than mine. Congrats! 👏

zroubalik · 2022-11-16T12:12:00Z

The gRPC server start will be hooked on the leader election -> so it will run only on the active instance.

Nice!, I have to think about this slowly to discover potential gaps/issues, but this probably is better solution than mine. Congrats! 👏

see this 19cd301

tomkerkhove · 2022-12-12T10:41:26Z

@zroubalik I think we can close this?

zroubalik · 2022-12-12T12:58:15Z

@tomkerkhove there is one outstanding sub issue TBD, I'd keep this open till we fix it.

zroubalik added needs-discussion feature-request All issues for new features that have not been committed to labels Nov 12, 2021

zroubalik changed the title ~~Cache metrics (values) in Metric Server~~ Cache metrics (values) in Metric Server and honor pollingInterval Nov 12, 2021

tomkerkhove added enhancement New feature or request and removed feature-request All issues for new features that have not been committed to labels Nov 23, 2021

zroubalik mentioned this issue Dec 13, 2021

smoother scaling (predictive scaling) #2401

Open

stale bot added the stale All issues that are marked as stale due to inactivity label Jan 22, 2022

zroubalik added the stale-bot-ignore All issues that should not be automatically closed by our stale bot label Jan 24, 2022

stale bot removed the stale All issues that are marked as stale due to inactivity label Jan 24, 2022

zroubalik mentioned this issue Apr 19, 2022

scaler polling active check is crashing our prometheus server #2918

Closed

JorTurFer mentioned this issue Aug 30, 2022

many "Client.Timeout Exceeded while waiting header" on keda operator log #3610

Closed

JorTurFer mentioned this issue Nov 14, 2022

Support Restrict Secret Access #3677

Merged

zroubalik mentioned this issue Nov 15, 2022

Metrics Server: use gRPC connection to get metrics from Operator #3861

Merged

5 tasks

zroubalik self-assigned this Nov 29, 2022

zroubalik removed the needs-discussion label Nov 29, 2022

This was referenced Nov 30, 2022

grpc client: wait properly for establishing a connection #3938

Merged

feat: option to query metrics only on polling interval #3940

Merged

zroubalik mentioned this issue Dec 14, 2022

fix: properly retrieve and close scalers cache #4012

Merged

3 tasks

zroubalik closed this as completed Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache metrics (values) in Metric Server and honor `pollingInterval` #2282

Cache metrics (values) in Metric Server and honor `pollingInterval` #2282

zroubalik commented Nov 12, 2021 •

edited

Loading

stale bot commented Jan 22, 2022

kefik commented May 12, 2022

zroubalik commented May 12, 2022

bjethwan commented Jun 28, 2022

zroubalik commented Nov 8, 2022 •

edited

Loading

JorTurFer commented Nov 14, 2022

zroubalik commented Nov 15, 2022 •

edited

Loading

JorTurFer commented Nov 15, 2022 •

edited

Loading

JorTurFer commented Nov 15, 2022

zroubalik commented Nov 16, 2022

JorTurFer commented Nov 16, 2022

zroubalik commented Nov 16, 2022

tomkerkhove commented Dec 12, 2022

zroubalik commented Dec 12, 2022

Cache metrics (values) in Metric Server and honor pollingInterval #2282

Cache metrics (values) in Metric Server and honor pollingInterval #2282

Comments

zroubalik commented Nov 12, 2021 • edited Loading

Proposal

The current design

Tasks to do:

stale bot commented Jan 22, 2022

kefik commented May 12, 2022

zroubalik commented May 12, 2022

bjethwan commented Jun 28, 2022

zroubalik commented Nov 8, 2022 • edited Loading

JorTurFer commented Nov 14, 2022

zroubalik commented Nov 15, 2022 • edited Loading

JorTurFer commented Nov 15, 2022 • edited Loading

JorTurFer commented Nov 15, 2022

zroubalik commented Nov 16, 2022

JorTurFer commented Nov 16, 2022

zroubalik commented Nov 16, 2022

tomkerkhove commented Dec 12, 2022

zroubalik commented Dec 12, 2022

Cache metrics (values) in Metric Server and honor `pollingInterval` #2282

Cache metrics (values) in Metric Server and honor `pollingInterval` #2282

zroubalik commented Nov 12, 2021 •

edited

Loading

zroubalik commented Nov 8, 2022 •

edited

Loading

zroubalik commented Nov 15, 2022 •

edited

Loading

JorTurFer commented Nov 15, 2022 •

edited

Loading