You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our initial plan was that when cancelling a workflow all tasks should just keep running on the task manager side. This requires a lot less work on our part, but it can cause some issues for users. For example, a long running task might eat up time they have available on a partition. Or, a workflow with many small independent tasks could make it annoying having to stop each individual one. Additionally, a job can fail when the user isn't actually aware of it (e.g. one of the tasks fail), it could take the user quite a while before they realize and manually cancel the slurm jobs.
I propose we modify the workflow manager to add a cancel workflow endpoint which will remove all jobs currently in the submit queue for a specified workflow, and call the slurm worker cancel_task function on each task currently in the submit queue that belongs to the cancelled workflow.
The text was updated successfully, but these errors were encountered:
We need a definite plan for cancelling workflows. Right now the jobs continue and the workflow.
Right now if a workflow is cancelled, jobs continue to run, but the states are at whatever point they were when cancelled. If we allow jobs to continue we need to have a Workflow "Cancelling State" until they are completed, and probably archive the workflow since some of it may have completed.
#960 Fixed issues with Cancel workflow and archives the results (i.e. there is an archive of whatever ran)
We need an option that can be invoked to cancel all running jobs if the user chooses to do so. Something like
''beeflow cancel --all""
Our initial plan was that when cancelling a workflow all tasks should just keep running on the task manager side. This requires a lot less work on our part, but it can cause some issues for users. For example, a long running task might eat up time they have available on a partition. Or, a workflow with many small independent tasks could make it annoying having to stop each individual one. Additionally, a job can fail when the user isn't actually aware of it (e.g. one of the tasks fail), it could take the user quite a while before they realize and manually cancel the slurm jobs.
I propose we modify the workflow manager to add a cancel workflow endpoint which will remove all jobs currently in the submit queue for a specified workflow, and call the slurm worker cancel_task function on each task currently in the submit queue that belongs to the cancelled workflow.
The text was updated successfully, but these errors were encountered: