Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gfd] Add option to disable automatic cleanup features file on gpu-feature-discovery exit #796

Open
belo4ya opened this issue Jul 1, 2024 · 2 comments
Assignees

Comments

@belo4ya
Copy link
Contributor

belo4ya commented Jul 1, 2024

Issue description

We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases: Target Number == nvidia.com/gpu.count == Node Allocatable.

We have noticed that sometimes after restarting gpu-feature-discovery, all the features (labels nvidia.com/*) exported by gpu-feature-discovery disappear from the node for a period roughly equal to the nfd-worker sleepInterval (in our case, 1 minute). This causes false positives in our monitoring system.

We found that this occurs because gpu-feature-discovery deletes the features.d/gfd file before terminating if it is not running in one-shot mode (done using the removeOutputFile function).

This behavior is very inconvenient (and undesirable) for us, especially when updating the gpu-feature-discovery version in the cluster.

Feature request

I found that this behavior was added with this commit - NVIDIA/gpu-feature-discovery@bc91c4a. However, I did not find an associated Issue justifying the need for this behavior.

Could you please consider:

  1. Adding an option to disable automatic cleanup before gpu-feature-discovery terminates using a flag (and/or environment variable) (e.g., --no-cleanup-on-exit).
  2. Or the refusal to automatically clean up before shutting down the gpu-feature-discovery.

An argument for 2. could be that node-feature-discovery does not do this. Instead, it uses a prune-job.

@belo4ya
Copy link
Contributor Author

belo4ya commented Jul 10, 2024

@elezar, @klueska, @ArangoGutierrez, please take a look at this

@elezar
Copy link
Member

elezar commented Aug 12, 2024

Thanks @belo4ya, I have created #899 to add this option and we can continue this discussion there.

@ArangoGutierrez one thing that I noted is that we don't do any cleanup when the NodeFeatureAPI is used. How are labels removed in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants