Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Batch executor does not always respect error strategy #5350

Open
raylim opened this issue Oct 2, 2024 · 4 comments
Open

Azure Batch executor does not always respect error strategy #5350

raylim opened this issue Oct 2, 2024 · 4 comments

Comments

@raylim
Copy link

raylim commented Oct 2, 2024

Bug report

Expected behavior and actual behavior

Error strategy should be respected, but when a process fails due to running out of walltime, the whole pipeline is terminated.

Steps to reproduce the problem

process TEST {
    executor 'azurebatch'
    container 'ubuntu'
    errorStrategy 'ignore'
    time "10s"
    script:
    """
    sleep 100
    """
}

workflow {
    TEST()
}

Program output

$ nextflow run main.nf  -w az://nextflow/test                                                                        [90/4787]

 N E X T F L O W   ~  version 24.08.0-edge

Launching `main.nf` [grave_hopper] DSL2 - revision: 8398d93773

executor >  azurebatch (1)
[6b/6fa6f5] TEST [100%] 1 of 1, failed: 1 ✘
ERROR ~ Error executing process > 'TEST'

Caused by:
  The task was ended by user request


Command executed:

  sleep 100

Command exit status:
  -

Command output:
  (empty)

Work dir:
  az://nextflow/test/6b/6fa6f56c51b72a15dde5893e4ac028

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Environment

  • Nextflow version: 24.08.0-edge build 5922
  • Java version: openjdk 17.0.3-internal 2022-04-19
  • Operating system: Linux
  • Bash version: (GNU bash, version 4.4.19(1)-release (x86_64-redhat-linux-gnu)
@raylim raylim changed the title Azure Batch executor does not respect error strategy Azure Batch executor does not always respect error strategy Oct 2, 2024
@bentsherman
Copy link
Member

How did the task fail? What does this mean?

Caused by:
  The task was ended by user request

@raylim
Copy link
Author

raylim commented Oct 9, 2024

I've also had this issue happen with the slurm executor. Error strategy works with the latest stable release so long as a closure isn't used.

@bentsherman
Copy link
Member

It looks like certain errors cannot be retried or ignored:

// -- when is a task level error and the user has chosen to ignore error,
// just report and error message and DO NOT stop the execution
if( task && error instanceof ProcessException ) {

message << "Error executing process > '${safeTaskName(task)}'"
switch( error ) {
case ProcessException:
formatTaskError( message, error, task )
break
case ProcessEvalException:
formatCommandError( message, error, task )
break
case FailedGuardException:
formatGuardError( message, error as FailedGuardException, task )
break;
default:
message << formatErrorCause(error)
dumpStackTrace = true
}

Need to figure out which case is being reached here

@raylim
Copy link
Author

raylim commented Oct 24, 2024

An error executing a process will bring everything to a halt (on version 24.04.4):


Caused by:
  Status code 409, {
    "odata.metadata":"https://ocra.eastus2.batch.azure.com/$metadata#Microsoft.Azure.Batch.Protocol.Entities.Container.errors/@Element","code":"TaskExists","message":{
      "lang":"en-US","value":"The specified task already exists.\nRequestId:d1437934-d0cd-4924-9fb3-1b500406ec0b\nTime:2024-10-24T20:26:59.8426275Z"
    }
  }



 -- Check '.nextflow.log' file for details
WARN: Killing running tasks (11100)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants