You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We sometimes get no space left on device errors when running high throughput jobs on our Lustre FSx filesystem. This happens most often when the filesystem is also low on space.
There is an AWS documentation page about this error, see here. It suggests a fix by setting this on the host: sudo lctl set_param osc.*.max_dirty_mb=64.
Describe the solution you'd like in detail
Our idea was to fix this by running an init container similar to this example on the node DaemonSet. This way we are 100% sure that this setting is set before our actual workload starts.
Describe alternatives you've considered
Run an initContainer on our workload pods. But since we are running hundreds of pods, some of which are running on the same node, this is not practical.
Would this be a reasonable approach to fix this problem? If so, I would raise a PR to be able to add an initContainer to the node DaemonSet.
The text was updated successfully, but these errors were encountered:
hi @jon-rei, Sorry for taking so long to respond, I think this approach makes sense since Lustre functionality should be maintained in the min base image + it does save the need to have all workload pods running this init container.
Is your feature request related to a problem? Please describe.
We sometimes get
no space left on device
errors when running high throughput jobs on our Lustre FSx filesystem. This happens most often when the filesystem is also low on space.There is an AWS documentation page about this error, see here. It suggests a fix by setting this on the host:
sudo lctl set_param osc.*.max_dirty_mb=64
.Describe the solution you'd like in detail
Our idea was to fix this by running an init container similar to this example on the node DaemonSet. This way we are 100% sure that this setting is set before our actual workload starts.
Describe alternatives you've considered
Run an initContainer on our workload pods. But since we are running hundreds of pods, some of which are running on the same node, this is not practical.
Would this be a reasonable approach to fix this problem? If so, I would raise a PR to be able to add an initContainer to the node DaemonSet.
The text was updated successfully, but these errors were encountered: