Not replacing pruned images #843

ppapp92 · 2024-11-19T20:51:21Z

We're currently using k8s-image-swapper to proxy all the images in our cluster to ECR. I have setup a lifecycle policy for the images to delete after 30 days after push.

However once the image gets pruned by ECR k8s-image-swapper won't replace it if the image is requested again causing an error pulling the image because it no longer exists in ECR. Only fix I've found is to restart k8s-image-swapper pods and then the image gets pulled and pushed to ECR correctly.

estahn · 2024-11-19T22:47:58Z

@ppapp92 The reason for this behaviour is k8s-image-swapper is caching the existence of that image for a period of time (24h 🤔 ) to reduce strain on ECR:

k8s-image-swapper/pkg/registry/ecr.go

Lines 234 to 239 in e817e04

    
           func (e *ECRClient) ImageExists(ctx context.Context, imageRef ctypes.ImageReference) bool { 
        
           	ref := imageRef.DockerReference().String() 
        
           	if _, found := e.cache.Get(ref); found { 
        
           		log.Ctx(ctx).Trace().Str("ref", ref).Msg("found in cache") 
        
           		return true 
        
           	}

Restarting k8s-image-swapper will purge the in-memory cache and therefore will know it doesn't exist.

ECR doesn't signal the execution of its lifecycle policy runs, so essentially only pulling for that info is possible as far as I know.

Possible option to resolve your issue:

Change ECR lifecycle policy to retain the last image and have a different process checking if the image is still in use and if not cleanup.
Disabling the in-memory cache - this feature would need to be added.

ahatzz11 · 2024-11-26T02:15:30Z

@estahn Hello!

Today we spent a couple hours debugging an issue where one of the images in our google artifact registry was removed per our cleanup policy (which we have now increased). We had GKE nodes that were unable to start because of this exact same issue (the coincidence on this being the most recent issue when I looked was a good laugh for us). Having the ability to either disable the in-memory cache or tune the timing of the cache (maybe 1-2 hours?) to avoid this would be awesome, add us to the list of people wanting some more options around this!

edit: I've gave this a shot at #847, let me know what you think!

migueleliasweb · 2024-11-26T23:16:46Z

Just a perspective from someone that just started using this project:

It seems like the underlying challenge is that there's no bridge between the external purge management (ECR policies and whatnot) and the internal cache that is kept in memory.

I had a conversation with one of my teammates about this and having built a few Kubernetes Operators in my lifetime, it seems that what's missing are two things to make this issue completely goes away:

An internal Go routing to warm-up the cache from scratch from an external registry that is periodically triggered
A way to determine images that have been swapped and are not longer present in the remote registry by listening to pod events (more specifically a pod state of "ImagePullBackOff")
- By listening to these events, you could also even evict a pod that has that issue to ensure no long-term disruption occours

With these two processes in place, there's very little need for any type of TTL that attempts to keep the two sources (cache and external OCI registry) in sync. In the other hand, this would make this project surprisingly close to a fully fledged operator as it would have to use libraries such as https://github.com/kubernetes/client-go/tree/master/informers and possibly https://github.com/kubernetes-sigs/controller-runtime.

This is a very interesting project, so I gotta say congrats for the initiative 😃 .

estahn · 2024-11-27T00:23:35Z

@ahatzz11 I do appreciate the initiative and time you've taken to contribute a PR. The PR itself looks great (code, docs, tests, etc) and something we could possibly merge. However I think this is going in the wrong direction. Whatever time the cache will be set to will eventually raise aforementioned failure as the "cleaning" process runs at a non-predictable time (at last for ECR).

The diagram below hopefully shows deleting an image from the registry will always raise a failure due to the internal cache regardless of the TTL.

@migueleliasweb Awesome! I really like the idea of using ImagePullBackOff as a signal to invalidate the cache for a particular image. This would also avoid further dependency on the image registry or possible cleaning policies and keeps everything within the "Kubernetes world".

sequenceDiagram
    Kubernetes API->>+k8s-image-swapper: 1. Webhook request
    alt Image exists according to cache?
    else Image exists according to image registry?
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    else
        k8s-image-swapper->>Image Registry: Push image to registry
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    end
    k8s-image-swapper->k8s-image-swapper: Replace image in webhook response
    k8s-image-swapper->>-Kubernetes API: 1. Webhook response
    Image Registry "Cleaner"->>Image Registry: Delete images
    Kubernetes API->>+k8s-image-swapper: 2. Webhook request
    alt Image exists according to cache?
    else Image exists according to image registry?
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    else
        k8s-image-swapper->>Image Registry: Push image to registry
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    end
    k8s-image-swapper->k8s-image-swapper: Replace image in webhook response
    k8s-image-swapper->>-Kubernetes API: 2. Webhook response

migueleliasweb · 2024-11-27T00:32:01Z

@estahn I love those flowcharts from MermaidJS 🤩

However I think this is going in the wrong direction. Whatever time the cache will be set to will eventually raise aforementioned failure as the "cleaning" process runs at a non-predictable time (at last for ECR).

The diagram below hopefully shows deleting an image from the registry will always raise a failure due to the internal cache regardless of the TTL.

That's precisely my point. Mitigations can be made, but there will always be a bit of delay/disruption if we go this direction.

Ps: Do you think by listening to pod events and periodically performing a full-sync from the ECR registry this problem would vanish? I personally think so.

Ps2: Is there a reason why not to cache the whole ECR repo tags (maybe with an option for a filter)? Even with many thousands of images/tags as strings, something like a sync.Map would be able to handle it with under a hundred megs, I reckon. Maybe that's an option too?

estahn · 2024-11-27T01:25:04Z

Ps: Do you think by listening to pod events and periodically performing a full-sync from the ECR registry this problem would vanish? I personally think so.

I think listening to the pod event as you described and either purging the entire cache or a specific cache entry solves the issue.

I would argue its unlikely you need all images of the registry. The usual scenario is moving from one image version to the next. Therefore I don't see much of a benefit to hold this all in cache. If there is a good argument to be made I'm not against it.

Ps2: Is there a reason why not to cache the whole ECR repo tags (maybe with an option for a filter)? Even with many thousands of images/tags as strings, something like a sync.Map would be able to handle it with under a hundred megs, I reckon. Maybe that's an option too?

There wasn't a need to as far as I'm concerned. Also keeping things lean and efficient :)

ahatzz11 · 2024-12-03T01:57:32Z

@estahn

However I think this is going in the wrong direction.

I agree 😄 Having the cache at all introduces the opportunity for it to be wrong. I figured being able to set it to something like 30 minutes or something much smaller than 24 hours would at least make that time that is wrong a lot shorter/maybe unnoticeable, at least in our situation where it took us a couple hours to figure out what was even wrong. Listening to pod events would be spectacular, but much more of a lift than I have time to put into it. If this is something you do we would love to test it out!

ppapp92 · 2024-12-05T20:30:22Z

TTL is probably the easiest fix but a more comprehensive solution makes sense long term. For our use case right now even if it had to check ECR every time a new pod was being deployed the amount of API calls to check ECR would be pretty low.

ahatzz11 · 2024-12-09T19:17:18Z

@estahn What are your thoughts about having this caching configuration as a short-medium term option for people to improve this, with the long term plan being a bit more tied to k8s events and truly be dynamic? It would definitely help out our organization.

This was referenced Nov 26, 2024

Add cache configuration ahatzz11/k8s-image-swapper#1

Closed

feat: Add new cacheTtlMinutes and cacheJitterMaxMinutes configuration options #847

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not replacing pruned images #843

Not replacing pruned images #843

ppapp92 commented Nov 19, 2024

estahn commented Nov 19, 2024

ahatzz11 commented Nov 26, 2024 •

edited

Loading

migueleliasweb commented Nov 26, 2024 •

edited

Loading

estahn commented Nov 27, 2024

migueleliasweb commented Nov 27, 2024 •

edited

Loading

estahn commented Nov 27, 2024 •

edited

Loading

ahatzz11 commented Dec 3, 2024

ppapp92 commented Dec 5, 2024

ahatzz11 commented Dec 9, 2024

Not replacing pruned images #843

Not replacing pruned images #843

Comments

ppapp92 commented Nov 19, 2024

estahn commented Nov 19, 2024

ahatzz11 commented Nov 26, 2024 • edited Loading

migueleliasweb commented Nov 26, 2024 • edited Loading

estahn commented Nov 27, 2024

migueleliasweb commented Nov 27, 2024 • edited Loading

estahn commented Nov 27, 2024 • edited Loading

ahatzz11 commented Dec 3, 2024

ppapp92 commented Dec 5, 2024

ahatzz11 commented Dec 9, 2024

ahatzz11 commented Nov 26, 2024 •

edited

Loading

migueleliasweb commented Nov 26, 2024 •

edited

Loading

migueleliasweb commented Nov 27, 2024 •

edited

Loading

estahn commented Nov 27, 2024 •

edited

Loading