Add initContainers to the helm chart for the node DaemonSet #388

jon-rei · 2024-08-23T06:19:15Z

Is your feature request related to a problem? Please describe.

We sometimes get no space left on device errors when running high throughput jobs on our Lustre FSx filesystem. This happens most often when the filesystem is also low on space.
There is an AWS documentation page about this error, see here. It suggests a fix by setting this on the host: sudo lctl set_param osc.*.max_dirty_mb=64.

Describe the solution you'd like in detail

Our idea was to fix this by running an init container similar to this example on the node DaemonSet. This way we are 100% sure that this setting is set before our actual workload starts.

Describe alternatives you've considered

Run an initContainer on our workload pods. But since we are running hundreds of pods, some of which are running on the same node, this is not practical.

Would this be a reasonable approach to fix this problem? If so, I would raise a PR to be able to add an initContainer to the node DaemonSet.

The text was updated successfully, but these errors were encountered:

jacobwolfaws · 2024-10-14T20:34:15Z

hi @jon-rei, Sorry for taking so long to respond, I think this approach makes sense since Lustre functionality should be maintained in the min base image + it does save the need to have all workload pods running this init container.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initContainers to the helm chart for the node DaemonSet #388

Add initContainers to the helm chart for the node DaemonSet #388

jon-rei commented Aug 23, 2024 •

edited

Loading

jacobwolfaws commented Oct 14, 2024

Add initContainers to the helm chart for the node DaemonSet #388

Add initContainers to the helm chart for the node DaemonSet #388

Comments

jon-rei commented Aug 23, 2024 • edited Loading

jacobwolfaws commented Oct 14, 2024

jon-rei commented Aug 23, 2024 •

edited

Loading