Too few arguments for '--mm2-opts' when resuming dorado #1204

billytcl · 2024-12-30T18:49:51Z

Issue Report

Please describe the issue:

I am on an HPC where I am trying to resume a pre-existing dorado run. The system can pre-empt my job if a higher priority one gets queued. It was running fine for 8 hours, then got pre-empted. Upon resume, my HPC script triggers dorado again with the "resume-from" option. However, it fails to start up again on resume. The expected behavior is that it should just continue. I've used this framework for other runs that do not use --mm2-opts.

The HPC log:

(base) [billylau@sh02-ln02 login /scratch/groups/hanleeji/20241221_PRM_placeholderRNA]$ cat slurm-57553328.out
basecalls do not exist
[2024-12-30 01:25:39.891] [info] Running: "basecaller" "sup" "Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/" "--mm2-opts" "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" "--recursive" "--min-qscore" "7" "--no-trim" "--kit-name" "EXP-NBD196" "--reference" "/home/groups/hanleeji/Resources/hs38_naa.fna"
[2024-12-30 01:25:41.189] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-12-30 01:25:41.192] [info]  - downloading [email protected] with httplib
[2024-12-30 01:25:45.965] [info] > Creating basecall pipeline
[2024-12-30 01:25:48.802] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /scratch/groups/hanleeji/20241221_PRM_placeholderRNA/.temp_dorado_model-4f25f59ad8abf8a7/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-30 01:25:57.276] [info] cuda:0 using chunk size 11520, batch size 512
[2024-12-30 01:25:57.710] [info] cuda:0 using chunk size 5760, batch size 576
slurmstepd: error: *** JOB 57553328 ON sh03-14n09 CANCELLED AT 2024-12-30T09:18:44 DUE TO PREEMPTION ***
basecalls exists - resuming
[2024-12-30 09:37:14.183] [info] Running: "basecaller" "sup" "Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/" "--mm2-opts" "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" "--recursive" "--min-qscore" "7" "--no-trim" "--resume-from" "dorado/20241221_1125_3H_PBA28444_cc23ecf0/old.bam" "--kit-name" "EXP-NBD196" "--reference" "/home/groups/hanleeji/Resources/hs38_naa.fna"
[2024-12-30 09:37:15.346] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-12-30 09:37:15.350] [info]  - downloading [email protected] with httplib
[2024-12-30 09:37:20.278] [info] > Creating basecall pipeline
[2024-12-30 09:37:23.217] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-80GB" and model /scratch/groups/hanleeji/20241221_PRM_placeholderRNA/.temp_dorado_model-11523696db1f8c87/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-30 09:37:30.946] [info] cuda:0 using chunk size 11520, batch size 864
[2024-12-30 09:37:31.626] [info] cuda:0 using chunk size 5760, batch size 864
[2024-12-30 09:38:41.222] [info] > Inspecting resume file...
[2024-12-30 09:38:42.285] [error] finalise() not called on a HtsFile.
[2024-12-30 09:38:42.313] [error] Too few arguments for '--mm2-opts'.

Steps to reproduce the issue:

This is the HPC script:

#!/bin/bash
#
#SBATCH --time=2-00:00
#SBATCH --mem=100GB
#SBATCH -p owners,gpu
#SBATCH -c 10
#SBATCH -G 1
#SBATCH -C "GPU_SKU:A100_PCIE|GPU_SKU:A100_SXM4|GPU_SKU:H100_SXM5"
#SBATCH --open-mode=append
#SBATCH --mail-type=ALL
#SBATCH --mail-user=billylau

mkdir -p dorado/20241221_1125_3H_PBA28444_cc23ecf0/

if test -f dorado/20241221_1125_3H_PBA28444_cc23ecf0/calls.bam; then
  echo "basecalls exists - resuming";
  
  mv dorado/20241221_1125_3H_PBA28444_cc23ecf0/calls.bam dorado/20241221_1125_3H_PBA28444_cc23ecf0/old.bam

  timeout 1.97d /home/groups/hanleeji/Tools/dorado-0.9.0-linux-x64/bin/dorado basecaller sup Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/ --mm2-opts "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" --recursive --min-qscore 7 --no-trim --resume-from dorado/20241221_1125_3H_PBA28444_cc23ecf0/old.bam --kit-name EXP-NBD196 --reference /home/groups/hanleeji/Resources/hs38_naa.fna > dorado/20241221_1125_3H_PBA28444_cc23ecf0/calls.bam
  
  if [[ $? == 124 || $? == 125 ]]; then
    scontrol requeue $SLURM_JOBID
  fi

else
  echo "basecalls do not exist";

  timeout 1.97d /home/groups/hanleeji/Tools/dorado-0.9.0-linux-x64/bin/dorado basecaller sup Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/ --mm2-opts "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" --recursive --min-qscore 7 --no-trim --kit-name EXP-NBD196 --reference /home/groups/hanleeji/Resources/hs38_naa.fna > dorado/20241221_1125_3H_PBA28444_cc23ecf0/calls.bam

  if [[ $? == 124 || $? == 125 ]]; then
    scontrol requeue $SLURM_JOBID
  fi

fi

Run environment:

Dorado version: v0.9.0
Dorado command: see script
Operating system: Linux
Hardware (CPUs, Memory, GPUs): A100/H100
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.):
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

The text was updated successfully, but these errors were encountered:

billytcl · 2024-12-30T22:14:49Z

I just noticed that this is possibly a duplicate of #1048 . Is there a fix for this or do I have to do the workaround as specified in that thread?

HalfPhoton · 2025-01-02T10:07:23Z

Hi @billytcl, we are aware of this issue but unfortunately we didn't get around to implementing a fix for this for the 0.9.0 release, but we'll hopefully get a fix in soon.

The workaround you reference from the original ticket should work.

Best regards,
Rich

HalfPhoton added bug Something isn't working duplicate This issue or pull request already exists labels Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too few arguments for '--mm2-opts' when resuming dorado #1204

Too few arguments for '--mm2-opts' when resuming dorado #1204

billytcl commented Dec 30, 2024

billytcl commented Dec 30, 2024

HalfPhoton commented Jan 2, 2025

Too few arguments for '--mm2-opts' when resuming dorado #1204

Too few arguments for '--mm2-opts' when resuming dorado #1204

Comments

billytcl commented Dec 30, 2024

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

billytcl commented Dec 30, 2024

HalfPhoton commented Jan 2, 2025