Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not replacing pruned images #843

Open
ppapp92 opened this issue Nov 19, 2024 · 9 comments · May be fixed by #847
Open

Not replacing pruned images #843

ppapp92 opened this issue Nov 19, 2024 · 9 comments · May be fixed by #847

Comments

@ppapp92
Copy link

ppapp92 commented Nov 19, 2024

We're currently using k8s-image-swapper to proxy all the images in our cluster to ECR. I have setup a lifecycle policy for the images to delete after 30 days after push.

However once the image gets pruned by ECR k8s-image-swapper won't replace it if the image is requested again causing an error pulling the image because it no longer exists in ECR. Only fix I've found is to restart k8s-image-swapper pods and then the image gets pulled and pushed to ECR correctly.

@estahn
Copy link
Owner

estahn commented Nov 19, 2024

@ppapp92 The reason for this behaviour is k8s-image-swapper is caching the existence of that image for a period of time (24h 🤔 ) to reduce strain on ECR:

func (e *ECRClient) ImageExists(ctx context.Context, imageRef ctypes.ImageReference) bool {
ref := imageRef.DockerReference().String()
if _, found := e.cache.Get(ref); found {
log.Ctx(ctx).Trace().Str("ref", ref).Msg("found in cache")
return true
}

Restarting k8s-image-swapper will purge the in-memory cache and therefore will know it doesn't exist.

ECR doesn't signal the execution of its lifecycle policy runs, so essentially only pulling for that info is possible as far as I know.

Possible option to resolve your issue:

  1. Change ECR lifecycle policy to retain the last image and have a different process checking if the image is still in use and if not cleanup.
  2. Disabling the in-memory cache - this feature would need to be added.

@ahatzz11
Copy link

ahatzz11 commented Nov 26, 2024

@estahn Hello!

Today we spent a couple hours debugging an issue where one of the images in our google artifact registry was removed per our cleanup policy (which we have now increased). We had GKE nodes that were unable to start because of this exact same issue (the coincidence on this being the most recent issue when I looked was a good laugh for us). Having the ability to either disable the in-memory cache or tune the timing of the cache (maybe 1-2 hours?) to avoid this would be awesome, add us to the list of people wanting some more options around this!

edit: I've gave this a shot at #847, let me know what you think!

@migueleliasweb
Copy link

migueleliasweb commented Nov 26, 2024

Just a perspective from someone that just started using this project:

It seems like the underlying challenge is that there's no bridge between the external purge management (ECR policies and whatnot) and the internal cache that is kept in memory.

I had a conversation with one of my teammates about this and having built a few Kubernetes Operators in my lifetime, it seems that what's missing are two things to make this issue completely goes away:

  • An internal Go routing to warm-up the cache from scratch from an external registry that is periodically triggered
  • A way to determine images that have been swapped and are not longer present in the remote registry by listening to pod events (more specifically a pod state of "ImagePullBackOff")
    • By listening to these events, you could also even evict a pod that has that issue to ensure no long-term disruption occours

With these two processes in place, there's very little need for any type of TTL that attempts to keep the two sources (cache and external OCI registry) in sync. In the other hand, this would make this project surprisingly close to a fully fledged operator as it would have to use libraries such as https://github.com/kubernetes/client-go/tree/master/informers and possibly https://github.com/kubernetes-sigs/controller-runtime.

This is a very interesting project, so I gotta say congrats for the initiative 😃 .

@estahn
Copy link
Owner

estahn commented Nov 27, 2024

@ahatzz11 I do appreciate the initiative and time you've taken to contribute a PR. The PR itself looks great (code, docs, tests, etc) and something we could possibly merge. However I think this is going in the wrong direction. Whatever time the cache will be set to will eventually raise aforementioned failure as the "cleaning" process runs at a non-predictable time (at last for ECR).

The diagram below hopefully shows deleting an image from the registry will always raise a failure due to the internal cache regardless of the TTL.

@migueleliasweb Awesome! I really like the idea of using ImagePullBackOff as a signal to invalidate the cache for a particular image. This would also avoid further dependency on the image registry or possible cleaning policies and keeps everything within the "Kubernetes world".

sequenceDiagram
    Kubernetes API->>+k8s-image-swapper: 1. Webhook request
    alt Image exists according to cache?
    else Image exists according to image registry?
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    else
        k8s-image-swapper->>Image Registry: Push image to registry
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    end
    k8s-image-swapper->k8s-image-swapper: Replace image in webhook response
    k8s-image-swapper->>-Kubernetes API: 1. Webhook response
    Image Registry "Cleaner"->>Image Registry: Delete images
    Kubernetes API->>+k8s-image-swapper: 2. Webhook request
    alt Image exists according to cache?
    else Image exists according to image registry?
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    else
        k8s-image-swapper->>Image Registry: Push image to registry
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    end
    k8s-image-swapper->k8s-image-swapper: Replace image in webhook response
    k8s-image-swapper->>-Kubernetes API: 2. Webhook response
Loading

@migueleliasweb
Copy link

migueleliasweb commented Nov 27, 2024

@estahn I love those flowcharts from MermaidJS 🤩

However I think this is going in the wrong direction. Whatever time the cache will be set to will eventually raise aforementioned failure as the "cleaning" process runs at a non-predictable time (at last for ECR).

The diagram below hopefully shows deleting an image from the registry will always raise a failure due to the internal cache regardless of the TTL.

That's precisely my point. Mitigations can be made, but there will always be a bit of delay/disruption if we go this direction.

Ps: Do you think by listening to pod events and periodically performing a full-sync from the ECR registry this problem would vanish? I personally think so.

Ps2: Is there a reason why not to cache the whole ECR repo tags (maybe with an option for a filter)? Even with many thousands of images/tags as strings, something like a sync.Map would be able to handle it with under a hundred megs, I reckon. Maybe that's an option too?

@estahn
Copy link
Owner

estahn commented Nov 27, 2024

Ps: Do you think by listening to pod events and periodically performing a full-sync from the ECR registry this problem would vanish? I personally think so.

I think listening to the pod event as you described and either purging the entire cache or a specific cache entry solves the issue.

I would argue its unlikely you need all images of the registry. The usual scenario is moving from one image version to the next. Therefore I don't see much of a benefit to hold this all in cache. If there is a good argument to be made I'm not against it.

Ps2: Is there a reason why not to cache the whole ECR repo tags (maybe with an option for a filter)? Even with many thousands of images/tags as strings, something like a sync.Map would be able to handle it with under a hundred megs, I reckon. Maybe that's an option too?

There wasn't a need to as far as I'm concerned. Also keeping things lean and efficient :)

@ahatzz11
Copy link

ahatzz11 commented Dec 3, 2024

@estahn

However I think this is going in the wrong direction.

I agree 😄 Having the cache at all introduces the opportunity for it to be wrong. I figured being able to set it to something like 30 minutes or something much smaller than 24 hours would at least make that time that is wrong a lot shorter/maybe unnoticeable, at least in our situation where it took us a couple hours to figure out what was even wrong. Listening to pod events would be spectacular, but much more of a lift than I have time to put into it. If this is something you do we would love to test it out!

@ppapp92
Copy link
Author

ppapp92 commented Dec 5, 2024

TTL is probably the easiest fix but a more comprehensive solution makes sense long term. For our use case right now even if it had to check ECR every time a new pod was being deployed the amount of API calls to check ECR would be pretty low.

@ahatzz11
Copy link

ahatzz11 commented Dec 9, 2024

@estahn What are your thoughts about having this caching configuration as a short-medium term option for people to improve this, with the long term plan being a bit more tied to k8s events and truly be dynamic? It would definitely help out our organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants