Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dfilemaker triggering OOM while writing content #611

Open
ofaaland opened this issue Nov 26, 2024 · 0 comments
Open

dfilemaker triggering OOM while writing content #611

ofaaland opened this issue Nov 26, 2024 · 0 comments
Assignees

Comments

@ofaaland
Copy link
Collaborator

ofaaland commented Nov 26, 2024

When run on a single node, allocated by slurm, I can trigger an OOM while writing content to files, and dfilemaker is killed. I'm not sure if this is flaw in dfilemaker or a flaw in something else (e.g. slurm config).

Below was on mutt, node allocated with "salloc -N1" and the target file system was Lustre

mpifileutils version was this:

* eb57445 (HEAD -> b-bad-option, olagit/b-bad-option) dfilemaker: remove duplicate longopts struct
* 999ecff dfilemaker: fail and stop execution on unrecognized option

and command run was this

bash-4.4$ srun -n32 ~/projects/mfu-install/bin/dfilemaker --fill=alternate --depth=1-30 -nitems=10000-$((10*1000*1000)) --verbose
[2024-11-25T17:04:49] Creating 1429103 directories
[2024-11-25T17:04:59] Created 144290 directories (10%) in 10.056 secs (14348.591 dirs/sec) 90 secs left ...
[2024-11-25T17:05:09] Created 293797 directories (21%) in 20.114 secs (14606.766 dirs/sec) 78 secs left ...
...
[2024-11-25T17:07:41] Created 1278791 items (89%) in 70.013 secs (18264.982 items/sec) 8 secs left ...
[2024-11-25T17:07:51] Created 1425534 items (100%) in 80.010 secs (17817.046 items/sec) 0 secs left ...
[2024-11-25T17:07:52] Created 1429953 items (100%) in 81.207 secs (17608.759 items/sec) done
[2024-11-25T17:07:52] Writing content to files.
slurmstepd: error: Detected 1 oom_kill event in StepId=60053.4. Some of the step tasks have been OOM Killed.
srun: error: mutt11: task 10: Out Of Memory
srun: First task exited 30s ago
srun: StepId=60053.4 tasks 0-9,11-24,26-31: running
srun: StepId=60053.4 tasks 10,25: exited abnormally
srun: Terminating StepId=60053.4
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 60053.4 ON mutt11 CANCELLED AT 2024-11-25T17:12:22 ***
@ofaaland ofaaland self-assigned this Nov 26, 2024
@ofaaland ofaaland changed the title dfilemaker triggering OOM dfilemaker triggering OOM while writing content Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant