You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have faced with an issue when we try to drain the kubernetes node group, where k8s-image-swapper pods are running alongside with a bunch of other pods: it is not capable to swap images paths for all pods in time.
So, I can find that some images are lost
kubectl get pods -A -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].image}" |tr -s '[[:space:]]' '\n' |sort |uniq -c | egrep -v '(dkr.ecr|public.ecr)'
Of course, pdb has been enabled and set to 3, podAntiAffinity in preferred mode has been configured as well, even priorityClassName: system-cluster-critical has been added (wasn't expected that it would help a lot).
Total number of running pods is 4, I suppose, we could try to increase the replica count and pdb accordingly even more and, probably, it might help, but I see no sense in such ineffective scaling,
Could you, please, check these probes? Does they represent the service health correctly?
The possibility of configuration of this block in the helm chart might be considered too: I guess, increasing successThreshold from default 1 to 2 could have some impact since it will provide the service more time on initialization.
Mostly forgot to say, we was able to mitigate mentioned issue by adding kubectl cordon and kubectl rollout restart deployment k8s-image-swapper, only after that we run kubectl drain. It buys some time for the service to start, but we would be very happy to remove this additional logic.
Many thanks for your work!
The text was updated successfully, but these errors were encountered:
Hi,
We have faced with an issue when we try to drain the kubernetes node group, where
k8s-image-swapper
pods are running alongside with a bunch of other pods: it is not capable to swap images paths for all pods in time.So, I can find that some images are lost
Of course,
pdb
has been enabled and set to3
,podAntiAffinity
inpreferred
mode has been configured as well, evenpriorityClassName: system-cluster-critical
has been added (wasn't expected that it would help a lot).Total number of running pods is
4
, I suppose, we could try to increase the replica count and pdb accordingly even more and, probably, it might help, but I see no sense in such ineffective scaling,I am mostly sure that the reason is in
livenessProbe
andreadinessProbe
: they return success to fast, the service is not capable to handle requests yet.https://github.com/estahn/charts/blob/main/charts/k8s-image-swapper/templates/deployment.yaml#L76
Could you, please, check these probes? Does they represent the service health correctly?
The possibility of configuration of this block in the helm chart might be considered too: I guess, increasing
successThreshold
from default1
to2
could have some impact since it will provide the service more time on initialization.Mostly forgot to say, we was able to mitigate mentioned issue by adding
kubectl cordon
andkubectl rollout restart deployment k8s-image-swapper
, only after that we runkubectl drain
. It buys some time for the service to start, but we would be very happy to remove this additional logic.Many thanks for your work!
The text was updated successfully, but these errors were encountered: