double close in `refreshScalers()` #6273

latk · 2024-10-25T15:38:21Z

Report

I am seeing spurious panics where Keda tries to close an already-closed scaler. This points towards slightly incorrect handling of failing scalers in refreshScaler().

Expected Behavior

Scalers have a clear lifecycle and are always closed exactly once. If scalers fail, they are restarted safely.

Actual Behavior

Scalers can be double-closed, causing a panic.

I stumbled across this problem while investigating a slightly different error, that some scalers seem to occasionally stop producing new metrics. This seems to be correlated with transient scaler errors. I believe – but don't have evidence for this – that this relates to the same refreshScaler() code path, just with a single-close instead of a double-close?

Steps to Reproduce the Problem

Use the Kafka scaler
Be unlucky and experience transient errors with the Kafka broker, causing Keda to go into the refreshScaler() code path multiple times for the same scaler instance.

Logs from KEDA operator

 panic: close of closed channel

goroutine 388 [running]:
github.com/Shopify/sarama.(*client).Close(0xc0009050e0)
	/workspace/vendor/github.com/Shopify/sarama/client.go:271 +0x66
github.com/Shopify/sarama.(*clusterAdmin).Close(0xc00146c1d0?)
	/workspace/vendor/github.com/Shopify/sarama/admin.go:188 +0x22
github.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).Close(0xc000c20018?, {0x0?, 0xc002ee6cc0?})
	/workspace/pkg/scalers/kafka_scaler.go:561 +0x2d
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).refreshScaler(0xc000415780, {0x474cfc8?, 0xc002114690?}, 0x0)
	/workspace/pkg/scaling/cache/scalers_cache.go:227 +0x324
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler(0xc000415780, {0x474cfc8, 0xc002114690}, 0x0, {0xc0028e7600, 0x20})
	/workspace/pkg/scaling/cache/scalers_cache.go:135 +0x1bf
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState(0x0?, {0x474cfc8, 0xc002114690}, 0xc00127ce00)
	/workspace/pkg/scaling/scale_handler.go:572 +0xb4a
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers(0xc0006e61c0, {0x474cfc8, 0xc002114690}, {0x3e7de80?, 0xc00127ce00?}, {0x4736b50, 0xc00293a510})
	/workspace/pkg/scaling/scale_handler.go:236 +0x271
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop(0x0?, {0x474cfc8, 0xc002114690}, 0xc001df9b80, {0x3e7de80, 0xc00127ce00}, {0x4736b50, 0xc00293a510}, 0x0?)
	/workspace/pkg/scaling/scale_handler.go:175 +0x405

Unfortunately the available log seems to be misconfigured, and I can't identify the error messages that led up to this failure.

KEDA Version

2.11.1

Kubernetes Version

None

Platform

None

Scaler Details

Kafka

Anything else?

While the error has been observed in Keda 2.11.1, the relevant code hasn't changed since then. Here is the relevant part of the refreshScaler() function, last changed with PR #4093:

keda/pkg/scaling/cache/scalers_cache.go

Lines 149 to 163 in b2ce95d

    
           sb := c.Scalers[id] 
        
           defer sb.Scaler.Close(ctx) 
        
           ns, sConfig, err := sb.Factory() 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           if id < 0 || id >= len(c.Scalers) { 
        
           	return nil, fmt.Errorf("scaler with id %d not found, len = %d, cache has been probably already invalidated", id, len(c.Scalers)) 
        
           } 
        
           c.Scalers[id] = ScalerBuilder{ 
        
           	Scaler:       ns, 
        
           	ScalerConfig: *sConfig, 
        
           	Factory:      sb.Factory, 
        
           }

One interpretation: resource handling problem.

There are two relevant resources:

the old sb.Scaler, which is always closed via the defer
the new scaler ns, which is never closed

This is fine along the happy path: the scaler-builder is replaced with the new scaler, and the old scaler is closed.

However, things break if sb.Factory() returns an error, or if the bounds check fails:

old scaler is not replaced and still closed, leaving the closed scaler in the cache
new scaler fails to be closed

The next time metrics are fetched from the (now closed) scaler, there will be errors, and we will enter the refreshScaler() function again, and will close the old scaler for the second time.

I'm not experienced enough in Go to know how to fix this. I assume that the sb.Scaler.Close() should be moved until after registering the new ScalerBuilder, essentially reverting that part of #4093. And there should probably be a ns.Close() when returning an error from the bounds check. I am also confused by the existence of the bounds check, as a similar check is done a few lines before the quoted snippet. The second bounds check – and with it the diverging control flow – would probably be unnecessary if mutating the sb object in-place. Sketch:

sb := c.Scalers[id]
oldScaler := sb.Scaler

ns, sConfig, err := sb.Factory() 
if err != nil { 
    return nil, err 
}

sb.Scaler = oldScaler
sb.ScalerConfig = sConfig

oldScaler.Close()

Another interpretation: concurrency problem.

Reading through the Sarama client.Close() code, it is only possible for this channel-close to panic if the Close() method is invoked multiple times concurrently. The remainder of the Close() method is protected by a lock. This in turn could happen if Keda can invoke the same Scaler.Close() concurrently.

The bounds check that was added in #4093 seems to try to defend against concurrent modification of the ScalersCache, but it's not entirely safe because the check + update is not atomic. The bounds check just makes it less likely that concurrent modification can crash the program.

An actual fix would likely require introducing a lock for ScalersCache modifications, in particular for the ScalersCache.refreshScaler() and ScalersCache.Close() operations. It could also help to make the ScalersCache.Scalers field private.

The text was updated successfully, but these errors were encountered:

latk added the bug Something isn't working label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

double close in `refreshScalers()` #6273

double close in `refreshScalers()` #6273

latk commented Oct 25, 2024

double close in refreshScalers() #6273

double close in refreshScalers() #6273

Comments

latk commented Oct 25, 2024

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

double close in `refreshScalers()` #6273

double close in `refreshScalers()` #6273