Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config-daemon: Restart all instances of device-plugin #783

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zeeke
Copy link
Member

@zeeke zeeke commented Oct 4, 2024

When the operator changes the device-plugin Spec (e.g. .Spec.NodeSelector), it may happen that there are two device plugin pods for a given node, one that is terminating, the other that is initializing.
If the config-daemon executes restartDevicePluginPod() at the same time, it may kill the terminating pod, while the initializing one will run with the old dp configuration. This may cause one or more resources to not being advertised, until a manual device plugin restart occurs.

Make the config-daemon restart all the device-plugin instances it founds for its own node.

Copy link

github-actions bot commented Oct 4, 2024

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

When the operator changes the device-plugin Spec (e.g. .Spec.NodeSelector), it may happen
that there are two device plugin pods for a given node, one that is terminating, the other that is
initializing.
If the config-daemon executes `restartDevicePluginPod()` at the same time, it may kill the terminating
pod, while the initializing one will run with the old dp configuration. This may cause one or more resources
to not being advertised, until a manual device plugin restart occurs.

Make the config-daemon restart all the device-plugin instances it founds for its own node.

Signed-off-by: Andrea Panattoni <[email protected]>
@zeeke zeeke force-pushed the us/multiple-device-plugins branch from f5f470b to f2fdee5 Compare October 4, 2024 17:13
@coveralls
Copy link

Pull Request Test Coverage Report for Build 11184102549

Details

  • 11 of 25 (44.0%) changed or added relevant lines in 1 file are covered.
  • 12 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.02%) to 45.132%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/daemon/daemon.go 11 25 44.0%
Files with Coverage Reduction New Missed Lines %
pkg/daemon/daemon.go 1 44.5%
controllers/generic_network_controller.go 5 74.38%
controllers/sriovnetworkpoolconfig_controller.go 6 63.7%
Totals Coverage Status
Change from base Build 11139435662: -0.02%
Covered Lines: 6657
Relevant Lines: 14750

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants