Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thick plugin graceful termination #1338

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dougbtv
Copy link
Member

@dougbtv dougbtv commented Sep 19, 2024

This PR introduces graceful shutdown functionality to the Multus daemon by adding a /readyz endpoint alongside the existing /healthz. The /readyz endpoint starts returning 500 once a SIGTERM is received, indicating the daemon is in shutdown mode. During this time, CNI requests can still be processed for a short window. The daemonset configs have been updated to increase terminationGracePeriodSeconds from 10 to 30 seconds, ensuring we have a bit more time for these clean shutdowns.

This addresses a race condition during pod transitions where the readiness check might return true, but a subsequent CNI request could fail if the daemon shuts down too quickly. By introducing the /readyz endpoint and delaying the shutdown, we can handle ongoing CNI requests more gracefully, reducing the risk of disruptions during critical transitions.

Major thanks to @deads2k for the find, identification, fix, and of course, the explanations. Appreciate it.

@dougbtv dougbtv force-pushed the thickplugin_graceful_term2 branch 2 times, most recently from c187bed to dcbd737 Compare September 19, 2024 19:04
@coveralls
Copy link

coveralls commented Sep 19, 2024

Coverage Status

coverage: 63.822% (-0.04%) from 63.857%
when pulling 531dec1 on dougbtv:thickplugin_graceful_term2
into f1e887e on k8snetworkplumbingwg:master.

…on by adding a /readyz endpoint

That is added alongside the existing /healthz. The /readyz endpoint starts returning 500 once a SIGTERM is received, indicating the daemon is in shutdown mode. During this time, CNI requests can still be processed for a short window. The daemonset configs have been updated to increase terminationGracePeriodSeconds from 10 to 30 seconds, ensuring we have a bit more time for these clean shutdowns.

This addresses a race condition during pod transitions where the readiness check might return true, but a subsequent CNI request could fail if the daemon shuts down too quickly. By introducing the /readyz endpoint and delaying the shutdown, we can handle ongoing CNI requests more gracefully, reducing the risk of disruptions during critical transitions.

Major thanks to @deads2k for the find, identification, fix, and of course, the explanations. Appreciate it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants