Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too few arguments for '--mm2-opts' when resuming dorado #1204

Open
billytcl opened this issue Dec 30, 2024 · 2 comments
Open

Too few arguments for '--mm2-opts' when resuming dorado #1204

billytcl opened this issue Dec 30, 2024 · 2 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@billytcl
Copy link

Issue Report

Please describe the issue:

I am on an HPC where I am trying to resume a pre-existing dorado run. The system can pre-empt my job if a higher priority one gets queued. It was running fine for 8 hours, then got pre-empted. Upon resume, my HPC script triggers dorado again with the "resume-from" option. However, it fails to start up again on resume. The expected behavior is that it should just continue. I've used this framework for other runs that do not use --mm2-opts.

The HPC log:

(base) [billylau@sh02-ln02 login /scratch/groups/hanleeji/20241221_PRM_placeholderRNA]$ cat slurm-57553328.out
basecalls do not exist
[2024-12-30 01:25:39.891] [info] Running: "basecaller" "sup" "Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/" "--mm2-opts" "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" "--recursive" "--min-qscore" "7" "--no-trim" "--kit-name" "EXP-NBD196" "--reference" "/home/groups/hanleeji/Resources/hs38_naa.fna"
[2024-12-30 01:25:41.189] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-12-30 01:25:41.192] [info]  - downloading [email protected] with httplib
[2024-12-30 01:25:45.965] [info] > Creating basecall pipeline
[2024-12-30 01:25:48.802] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /scratch/groups/hanleeji/20241221_PRM_placeholderRNA/.temp_dorado_model-4f25f59ad8abf8a7/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-30 01:25:57.276] [info] cuda:0 using chunk size 11520, batch size 512
[2024-12-30 01:25:57.710] [info] cuda:0 using chunk size 5760, batch size 576
slurmstepd: error: *** JOB 57553328 ON sh03-14n09 CANCELLED AT 2024-12-30T09:18:44 DUE TO PREEMPTION ***
basecalls exists - resuming
[2024-12-30 09:37:14.183] [info] Running: "basecaller" "sup" "Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/" "--mm2-opts" "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" "--recursive" "--min-qscore" "7" "--no-trim" "--resume-from" "dorado/20241221_1125_3H_PBA28444_cc23ecf0/old.bam" "--kit-name" "EXP-NBD196" "--reference" "/home/groups/hanleeji/Resources/hs38_naa.fna"
[2024-12-30 09:37:15.346] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-12-30 09:37:15.350] [info]  - downloading [email protected] with httplib
[2024-12-30 09:37:20.278] [info] > Creating basecall pipeline
[2024-12-30 09:37:23.217] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-80GB" and model /scratch/groups/hanleeji/20241221_PRM_placeholderRNA/.temp_dorado_model-11523696db1f8c87/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-30 09:37:30.946] [info] cuda:0 using chunk size 11520, batch size 864
[2024-12-30 09:37:31.626] [info] cuda:0 using chunk size 5760, batch size 864
[2024-12-30 09:38:41.222] [info] > Inspecting resume file...
[2024-12-30 09:38:42.285] [error] finalise() not called on a HtsFile.
[2024-12-30 09:38:42.313] [error] Too few arguments for '--mm2-opts'.

Steps to reproduce the issue:

This is the HPC script:

#!/bin/bash
#
#SBATCH --time=2-00:00
#SBATCH --mem=100GB
#SBATCH -p owners,gpu
#SBATCH -c 10
#SBATCH -G 1
#SBATCH -C "GPU_SKU:A100_PCIE|GPU_SKU:A100_SXM4|GPU_SKU:H100_SXM5"
#SBATCH --open-mode=append
#SBATCH --mail-type=ALL
#SBATCH --mail-user=billylau

mkdir -p dorado/20241221_1125_3H_PBA28444_cc23ecf0/

if test -f dorado/20241221_1125_3H_PBA28444_cc23ecf0/calls.bam; then
  echo "basecalls exists - resuming";
  
  mv dorado/20241221_1125_3H_PBA28444_cc23ecf0/calls.bam dorado/20241221_1125_3H_PBA28444_cc23ecf0/old.bam

  timeout 1.97d /home/groups/hanleeji/Tools/dorado-0.9.0-linux-x64/bin/dorado basecaller sup Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/ --mm2-opts "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" --recursive --min-qscore 7 --no-trim --resume-from dorado/20241221_1125_3H_PBA28444_cc23ecf0/old.bam --kit-name EXP-NBD196 --reference /home/groups/hanleeji/Resources/hs38_naa.fna > dorado/20241221_1125_3H_PBA28444_cc23ecf0/calls.bam
  
  if [[ $? == 124 || $? == 125 ]]; then
    scontrol requeue $SLURM_JOBID
  fi

else
  echo "basecalls do not exist";

  timeout 1.97d /home/groups/hanleeji/Tools/dorado-0.9.0-linux-x64/bin/dorado basecaller sup Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/ --mm2-opts "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" --recursive --min-qscore 7 --no-trim --kit-name EXP-NBD196 --reference /home/groups/hanleeji/Resources/hs38_naa.fna > dorado/20241221_1125_3H_PBA28444_cc23ecf0/calls.bam

  if [[ $? == 124 || $? == 125 ]]; then
    scontrol requeue $SLURM_JOBID
  fi

fi

Run environment:

  • Dorado version: v0.9.0
  • Dorado command: see script
  • Operating system: Linux
  • Hardware (CPUs, Memory, GPUs): A100/H100
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
  • Source data location (on device or networked drive - NFS, etc.):
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
@billytcl
Copy link
Author

I just noticed that this is possibly a duplicate of #1048 . Is there a fix for this or do I have to do the workaround as specified in that thread?

@HalfPhoton
Copy link
Collaborator

Hi @billytcl, we are aware of this issue but unfortunately we didn't get around to implementing a fix for this for the 0.9.0 release, but we'll hopefully get a fix in soon.

The workaround you reference from the original ticket should work.

Best regards,
Rich

@HalfPhoton HalfPhoton added bug Something isn't working duplicate This issue or pull request already exists labels Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants