You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases: Target Number == nvidia.com/gpu.count == Node Allocatable.
We have noticed that sometimes after restarting gpu-feature-discovery, all the features (labels nvidia.com/*) exported by gpu-feature-discovery disappear from the node for a period roughly equal to the nfd-worker sleepInterval (in our case, 1 minute). This causes false positives in our monitoring system.
We found that this occurs because gpu-feature-discovery deletes the features.d/gfd file before terminating if it is not running in one-shot mode (done using the removeOutputFile function).
This behavior is very inconvenient (and undesirable) for us, especially when updating the gpu-feature-discovery version in the cluster.
Feature request
I found that this behavior was added with this commit - NVIDIA/gpu-feature-discovery@bc91c4a. However, I did not find an associated Issue justifying the need for this behavior.
Could you please consider:
Adding an option to disable automatic cleanup before gpu-feature-discovery terminates using a flag (and/or environment variable) (e.g., --no-cleanup-on-exit).
Or the refusal to automatically clean up before shutting down the gpu-feature-discovery.
An argument for 2. could be that node-feature-discovery does not do this. Instead, it uses a prune-job.
The text was updated successfully, but these errors were encountered:
Issue description
We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases:
Target Number == nvidia.com/gpu.count == Node Allocatable
.We have noticed that sometimes after restarting gpu-feature-discovery, all the features (labels
nvidia.com/*
) exported by gpu-feature-discovery disappear from the node for a period roughly equal to the nfd-workersleepInterval
(in our case, 1 minute). This causes false positives in our monitoring system.We found that this occurs because gpu-feature-discovery deletes the
features.d/gfd
file before terminating if it is not running in one-shot mode (done using the removeOutputFile function).This behavior is very inconvenient (and undesirable) for us, especially when updating the gpu-feature-discovery version in the cluster.
Feature request
I found that this behavior was added with this commit - NVIDIA/gpu-feature-discovery@bc91c4a. However, I did not find an associated Issue justifying the need for this behavior.
Could you please consider:
--no-cleanup-on-exit
).An argument for 2. could be that node-feature-discovery does not do this. Instead, it uses a prune-job.
The text was updated successfully, but these errors were encountered: