Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm oom-kill due to memory is ignored. #5332

Open
oliverdrechsel opened this issue Sep 25, 2024 · 2 comments
Open

Slurm oom-kill due to memory is ignored. #5332

oliverdrechsel opened this issue Sep 25, 2024 · 2 comments

Comments

@oliverdrechsel
Copy link

Bug report

Expected behavior and actual behavior

Slurm jobs that run out of memory get oom-killed. In nearly all cases this works. In an awk process i run there is excessive RAM usage that gets only logged in .command.log but is ignored in the nextflow process. This results in premature end of the awk processes leading to corrupted output.

Steps to reproduce the problem

The following code produces issues with fastq.gz files with 20 million reads or more.

process count_reads {

    label "count_reads"

    publishDir path: "${params.analysesdir}/${stage}/${sample}/", pattern: "*.csv", mode: "copy"

    // SLURM cluster options
    cpus 1
    memory "5 GB"
    time "1h"

    tag "readcount_${sample}"

    input:
        tuple val(sample), path(reads)
        val(stage)
        
    output:
        tuple val(sample), path("${sample}_read_count.csv"), emit: read_count
        
    script:
        """
            zless ${reads[0]} | awk 'END {printf "%s", "${sample},"; printf "%.0f", NR/4; print ""}' > ${sample}_read_count.csv

        """

    stub:
        """
            mkdir -p ${sample}
            touch ${sample}_read_count.csv
        """
}

Program output

In the nextflow.log the jobs look as if they are successful.

$ cat .command.log
slurmstepd-hpc-...: error: Detected 1 oom_kill event in StepId=71xxxxx.batch. Some of the step tasks have been OOM Killed.

Environment

  • Nextflow version: 24.04.2
  • Java version: openjdk 21-internal 2023-09-19
  • Operating system: Linux
  • Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Additional context

(Add any other context about the problem here)

@bentsherman
Copy link
Member

Some questions:

  • Are you using the scratch directive for local scratch storage?
  • Can you share the output/error log for the slurm job?

I'm wondering if the failure happened with the process script or with the copying of task outputs. Possibly related to #3711

@oliverdrechsel
Copy link
Author

Hi @bentsherman

  • no i'm not using the scratch directive
  • the .command.log is given in the bug report

Do you mean this output?

$ sjob 716xxxx

JobID                  : 716xxxx
Job name               : nf-ASSEMBLY_FILTER_count_trimmed_reads_(readcount_CK_47_030432_T_S)
State                  : OUT_OF_MEMORY
Reason | ExitCode      : None | 0:125
Priority               : 1721

SubmitLine             : sbatch .command.run
WorkDir                : /scratch/xxxx/188

Start Time             : 2024-09-23 06:16:07
End Time               : 2024-09-23 06:17:57

UserID                 : dxxxx
Account                : xxx
Partition              : main

Requested TRES         : billing=1,cpu=1,gres/local=10,mem=5G,node=1
Nodelist               : hpc-node03

TIME requested         :            01:00:00
TIME elapsed           :            00:01:50
TIME request efficiency:                   3%    [ 00:01:50 / 01:00:00 ]
TIME overbook          :                  31x    [ 01:00:00 / 00:01:50 - 1 ]

MEM requested          :                   5G
MEM max RSS            :                   5G
MEM request efficiency :                  91%    [ 4,575,784K / 5G ]
MEM overbook           :                   9%    [ 5G / 4,575,784K - 1 ]

CPUs requested         :                   1
CPUs allocated         :                   2     [ number of threads filled up to complete cores ]
CPU total time usage   :            00:02:17
CPU load average       :                   1.245 [ 00:02:17 / 00:01:50 ]
CPU request efficiency :                 124%    [ 00:02:17 / 00:01:50 / 1 ]
CPU alloc.  efficiency :                  62%    [ 00:02:17 / 00:01:50 / 2 ]

Disk Read Max          :            9,771.89M
Disk Write Max         :            8,834.29M

TRES Usage IN max      : cpu=00:02:18,energy=0,fs/disk=10246572489,mem=4575784K,pages=0,vmem=4649776K
TRES Usage OUT max     : energy=0,fs/disk=9263426962

[ locale settings LC_NUMERIC="en_US.UTF-8": decimal_point="." | thousands_sep="," ]

I'd doubt that this failure is linked to the output step, because it happens way before. As far as i can tell the job is killed by Slurm while running and the output is generated ignoring the kill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants