You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am on an HPC where I am trying to resume a pre-existing dorado run. The system can pre-empt my job if a higher priority one gets queued. It was running fine for 8 hours, then got pre-empted. Upon resume, my HPC script triggers dorado again with the "resume-from" option. However, it fails to start up again on resume. The expected behavior is that it should just continue. I've used this framework for other runs that do not use --mm2-opts.
The HPC log:
(base) [billylau@sh02-ln02 login /scratch/groups/hanleeji/20241221_PRM_placeholderRNA]$ cat slurm-57553328.out
basecalls do not exist
[2024-12-30 01:25:39.891] [info] Running: "basecaller" "sup" "Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/" "--mm2-opts" "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" "--recursive" "--min-qscore" "7" "--no-trim" "--kit-name" "EXP-NBD196" "--reference" "/home/groups/hanleeji/Resources/hs38_naa.fna"
[2024-12-30 01:25:41.189] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-12-30 01:25:41.192] [info] - downloading [email protected] with httplib
[2024-12-30 01:25:45.965] [info] > Creating basecall pipeline
[2024-12-30 01:25:48.802] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /scratch/groups/hanleeji/20241221_PRM_placeholderRNA/.temp_dorado_model-4f25f59ad8abf8a7/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-30 01:25:57.276] [info] cuda:0 using chunk size 11520, batch size 512
[2024-12-30 01:25:57.710] [info] cuda:0 using chunk size 5760, batch size 576
slurmstepd: error: *** JOB 57553328 ON sh03-14n09 CANCELLED AT 2024-12-30T09:18:44 DUE TO PREEMPTION ***
basecalls exists - resuming
[2024-12-30 09:37:14.183] [info] Running: "basecaller" "sup" "Seq_Output/20241221_1125_3H_PBA28444_cc23ecf0/" "--mm2-opts" "-x splice --junc-bed /home/groups/hanleeji/Resources/gencode.v47.annotation.junc.bed --secondary=no" "--recursive" "--min-qscore" "7" "--no-trim" "--resume-from" "dorado/20241221_1125_3H_PBA28444_cc23ecf0/old.bam" "--kit-name" "EXP-NBD196" "--reference" "/home/groups/hanleeji/Resources/hs38_naa.fna"
[2024-12-30 09:37:15.346] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-12-30 09:37:15.350] [info] - downloading [email protected] with httplib
[2024-12-30 09:37:20.278] [info] > Creating basecall pipeline
[2024-12-30 09:37:23.217] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-80GB" and model /scratch/groups/hanleeji/20241221_PRM_placeholderRNA/.temp_dorado_model-11523696db1f8c87/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-30 09:37:30.946] [info] cuda:0 using chunk size 11520, batch size 864
[2024-12-30 09:37:31.626] [info] cuda:0 using chunk size 5760, batch size 864
[2024-12-30 09:38:41.222] [info] > Inspecting resume file...
[2024-12-30 09:38:42.285] [error] finalise() not called on a HtsFile.
[2024-12-30 09:38:42.313] [error] Too few arguments for '--mm2-opts'.
Hi @billytcl, we are aware of this issue but unfortunately we didn't get around to implementing a fix for this for the 0.9.0 release, but we'll hopefully get a fix in soon.
The workaround you reference from the original ticket should work.
Issue Report
Please describe the issue:
I am on an HPC where I am trying to resume a pre-existing dorado run. The system can pre-empt my job if a higher priority one gets queued. It was running fine for 8 hours, then got pre-empted. Upon resume, my HPC script triggers dorado again with the "resume-from" option. However, it fails to start up again on resume. The expected behavior is that it should just continue. I've used this framework for other runs that do not use --mm2-opts.
The HPC log:
Steps to reproduce the issue:
This is the HPC script:
Run environment:
The text was updated successfully, but these errors were encountered: