Add support to reconcile allocated Pod IPs. #3113

hbhasker · 2024-11-13T17:50:11Z

What type of PR is this?

improvement

Which issue does this PR fix?:

What does this PR do / Why do we need it?:
The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime.

In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods.

This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse.

Fixes #3109

Testing done on this change:

Added unit tests.

Will this PR introduce any new dependencies?:

No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

No

- Reconcile allocated Pod IPs against existing host veth devices periodically to clean up stale IP allocations [Fixes #3109].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

orsenthil · 2024-11-13T19:43:29Z

The CNI today only reconciles its datastore with existing pods at startup but never again.

This is done periodically. That's how the datastore keeps track of the ips used, available and does a ENI call if the additional ENI is needed.

hbhasker · 2024-11-13T20:28:57Z

The CNI today only reconciles its datastore with existing pods at startup but never again.

This is done periodically. That's how the datastore keeps track of the ips used, available and does a ENI call if the additional ENI is needed.

I guess I meant reconcile the allocated IPs against running pods. It doesn't check in steady state that a pod is still around when its IP is allocated. The assumption is that if the pod was deleted the IPAMD would have been called by the CNI plugin. But I think there are some cases where this assumption breaks down, like if the CNI->ipamd grpc call fails for any reason. There are probably other edge cases but I couldn't fully reason out what could cause this to happen.

The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime. In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods. This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse. Fixes aws#3109

hbhasker · 2024-11-14T19:48:22Z

ping for a review!

orsenthil · 2024-11-14T19:53:41Z

cmd/aws-k8s-agent/main.go

@@ -77,6 +77,9 @@ func _main() int {
 	// Pool manager
 	go ipamContext.StartNodeIPPoolManager()

+	// Pod IP allocation reconcile loop to clear up dangling pod IPs.
+	go ipamContext.PodIPAllocationReconcileLoop()


As an initial review, I am hesitant to add this additional 1 minute delay for the IP Sync in Reconcile Loop. I am not sure on the need for this. I will try understand the problem you encountered and see if there is any other approach to resolve this.

I am not sure what you mean by 1 minute delay. This is not adding to the reconcile loop that reconciles pod IP allocation against IPs assigned to the node but instead is just cleaning up pod IPs that may not be in-use anymore because the corresponding pod is gone. This is a separate goroutine that runs a loop that just does that and should not add any delay to the regular IP reconcile loop.

I think, the need for this separate routine itself is introducing some concerns here. We will see why this is strictly needed.

Any update on this?

hbhasker requested a review from a team as a code owner November 13, 2024 17:50

hbhasker force-pushed the reconcile-pod-ip branch from cc5834c to 9d4cc1d Compare November 13, 2024 23:14

orsenthil reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to reconcile allocated Pod IPs. #3113

Add support to reconcile allocated Pod IPs. #3113

hbhasker commented Nov 13, 2024

orsenthil commented Nov 13, 2024

hbhasker commented Nov 13, 2024

hbhasker commented Nov 14, 2024

orsenthil Nov 14, 2024

hbhasker Nov 14, 2024

orsenthil Nov 14, 2024

orsenthil Nov 14, 2024

hbhasker Nov 19, 2024

Add support to reconcile allocated Pod IPs. #3113

Are you sure you want to change the base?

Add support to reconcile allocated Pod IPs. #3113

Conversation

hbhasker commented Nov 13, 2024

orsenthil commented Nov 13, 2024

hbhasker commented Nov 13, 2024

hbhasker commented Nov 14, 2024

orsenthil Nov 14, 2024

Choose a reason for hiding this comment

hbhasker Nov 14, 2024

Choose a reason for hiding this comment

orsenthil Nov 14, 2024

Choose a reason for hiding this comment

orsenthil Nov 14, 2024

Choose a reason for hiding this comment

hbhasker Nov 19, 2024

Choose a reason for hiding this comment