Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --fail-fast #38

Merged
merged 7 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ https://github.com/flatironinstitute/disBatch/pull/32
- Refreshed the readme
- Added `disbatch --version` and `disbatch.__version__`
- Added MacOS test
- Added `--fail-fast` option [https://github.com/flatironinstitute/disBatch/pull/38]

### Changes
- `kvsstcp` submodule is now vendored
71 changes: 28 additions & 43 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,14 +289,11 @@ disBatch refers to a collection of execution resources as a *context* and the re

## Invocation
```
usage: disbatch [-h] [-e] [--force-resume] [--kvsserver [HOST:PORT]]
[--logfile FILE]
[--loglevel {CRITICAL,ERROR,WARNING,INFO,DEBUG}] [--mailFreq N]
[--mailTo ADDR] [-p PATH] [-r STATUSFILE] [-R] [-S]
[--status-header] [--use-address HOST:PORT] [-w]
[--taskcommand COMMAND] [--taskserver [HOST:PORT]]
[-C TASK_LIMIT] [-c N] [--fill] [-g] [--no-retire] [-l COMMAND]
[--retire-cmd COMMAND] [-s HOST:CORECOUNT] [-t N]
usage: disbatch [-h] [-e] [--force-resume] [--kvsserver [HOST:PORT]] [--logfile FILE]
[--loglevel {CRITICAL,ERROR,WARNING,INFO,DEBUG}] [--mailFreq N] [--mailTo ADDR] [-p PATH]
[-r STATUSFILE] [-R] [-S] [--status-header] [--use-address HOST:PORT] [-w] [-f]
[--taskcommand COMMAND] [--taskserver [HOST:PORT]] [--version] [-C TASK_LIMIT] [-c N] [--fill]
[--no-retire] [-l COMMAND] [--retire-cmd COMMAND] [-s HOST:CORECOUNT] [-t N]
[taskfile]

Use batch resources to process a file of tasks, one task per line.
Expand All @@ -306,63 +303,51 @@ positional arguments:

options:
-h, --help show this help message and exit
-e, --exit-code When any task fails, exit with non-zero status (default:
only if disBatch itself fails)
--force-resume With -r, proceed even if task commands/lines are
different.
-e, --exit-code When any task fails, exit with non-zero status (default: only if disBatch itself fails)
--force-resume With -r, proceed even if task commands/lines are different.
--kvsserver [HOST:PORT]
Use a running KVS server.
--logfile FILE Log file.
--loglevel {CRITICAL,ERROR,WARNING,INFO,DEBUG}
Logging level (default: INFO).
--mailFreq N Send email every N task completions (default: 1). "--
mailTo" must be given.
--mailFreq N Send email every N task completions (default: 1). "--mailTo" must be given.
--mailTo ADDR Mail address for task completion notification(s).
-p PATH, --prefix PATH
Path for log, dbUtil, and status files (default: ".").
If ends with non-directory component, use as prefix for
these files names (default:
<Taskfile>_disBatch_<YYYYMMDDhhmmss>_<Random>).
Path for log, dbUtil, and status files (default: "."). If ends with non-directory component,
use as prefix for these files names (default: <Taskfile>_disBatch_<YYYYMMDDhhmmss>_<Random>).
-r STATUSFILE, --resume-from STATUSFILE
Read the status file from a previous run and skip any
completed tasks (may be specified multiple times).
-R, --retry With -r, also retry any tasks which failed in previous
runs (non-zero return).
-S, --startup-only Startup only the disBatch server (and KVS server if
appropriate). Use "dbUtil..." script to add execution
contexts. Incompatible with "--ssh-node".
Read the status file from a previous run and skip any completed tasks (may be specified
multiple times).
-R, --retry With -r, also retry any tasks which failed in previous runs (non-zero return).
-S, --startup-only Startup only the disBatch server (and KVS server if appropriate). Use "dbUtil..." script to
add execution contexts. Incompatible with "--ssh-node".
--status-header Add header line to status file.
--use-address HOST:PORT
Specify hostname and port to use for this run.
-w, --web Enable web interface.
-f, --fail-fast Exit on first task failure. Running tasks will be interrupted and disBatch will exit with a
non-zero exit code.
--taskcommand COMMAND
Tasks will come from the command specified via the KVS
server (passed in the environment).
Tasks will come from the command specified via the KVS server (passed in the environment).
--taskserver [HOST:PORT]
Tasks will come from the KVS server.
--version Print the version and exit
-C TASK_LIMIT, --context-task-limit TASK_LIMIT
Shutdown after running COUNT tasks (0 => no limit).
-c N, --cpusPerTask N
Number of cores used per task; may be fractional
(default: 1).
--fill Try to use extra cores if allocated cores exceeds
requested cores.
-g, --gpu Use assigned GPU resources [DEPRECATED]
--no-retire Don't retire nodes from the batch system (e.g., if
running as part of a larger job).
Number of cores used per task; may be fractional (default: 1).
--fill Try to use extra cores if allocated cores exceeds requested cores.
--no-retire Don't retire nodes from the batch system (e.g., if running as part of a larger job).
-l COMMAND, --label COMMAND
Label for this context. Should be unique.
--retire-cmd COMMAND Shell command to run to retire a node (environment
includes $NODE being retired, remaining $ACTIVE node
list, $RETIRED node list; default based on batch
system). Incompatible with "--ssh-node".
--retire-cmd COMMAND Shell command to run to retire a node (environment includes $NODE being retired, remaining
$ACTIVE node list, $RETIRED node list; default based on batch system). Incompatible with "--
ssh-node".
-s HOST:CORECOUNT, --ssh-node HOST:CORECOUNT
Run tasks over SSH on the given nodes (can be specified
multiple times for additional hosts; equivalent to
setting DISBATCH_SSH_NODELIST)
Run tasks over SSH on the given nodes (can be specified multiple times for additional hosts;
equivalent to setting DISBATCH_SSH_NODELIST)
-t N, --tasksPerNode N
Maximum concurrently executing tasks per node (up to
cores/cpusPerTask).
Maximum concurrently executing tasks per node (up to cores/cpusPerTask).
```

The options for mail will only work if your computing environment permits processes to access mail via SMTP.
Expand Down
15 changes: 15 additions & 0 deletions disbatch/disBatch.py
Original file line number Diff line number Diff line change
Expand Up @@ -1217,6 +1217,7 @@ def __init__(self, kvs, db_info, tasks, trackResults=None):
self.statusLastOffset = self.statusFile.tell()
self.noMoreTasks = False
self.tasksDone = False
self.failFast = db_info.args.fail_fast

self.daemon = True
self.start()
Expand Down Expand Up @@ -1525,6 +1526,13 @@ def run(self):
# Remember the first failure. Somewhat arbitrary.
self.currentReturnCode = rc

if self.failed and self.failFast:
logger.info(f'Failing fast, task exited with code: {self.currentReturnCode}')
print('Quitting early due to task failure with --fail-fast', file=sys.stderr)
self.ageQ.put('CheckFailExit')
# Break out of the main driver control loop and drop into the exit code
break

# Maybe we want to track results by streamIndex instead of taskId? But then there could be more than
# one per key
if self.trackResults:
Expand Down Expand Up @@ -1571,6 +1579,7 @@ def run(self):
# A "check" barrier fails if any tasks before it do (since the start or the last barrier).
logger.info('Barrier check failed: %d.', self.currentReturnCode)
self.ageQ.put('CheckFailExit')
# Break out of the main driver control loop and drop into the exit code
break
# Let the feeder know.
self.ageQ.put(bTinfo.taskId)
Expand Down Expand Up @@ -2234,6 +2243,12 @@ def shutdown(s=None, f=None):
'--use-address', default=None, metavar='HOST:PORT', help='Specify hostname and port to use for this run.'
)
argp.add_argument('-w', '--web', action='store_true', help='Enable web interface.')
argp.add_argument(
'-f',
'--fail-fast',
action='store_true',
help='Exit on first task failure. Running tasks will be interrupted and disBatch will exit with a non-zero exit code.',
)
source = argp.add_mutually_exclusive_group(required=True)
source.add_argument(
'--taskcommand',
Expand Down
30 changes: 19 additions & 11 deletions tests/test_slurm/run.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
#!/bin/bash

workdir=$(mktemp -d -p ./ disbatch-test.XXXX)
cp Tasks $workdir
exit_fail() {
err=$?
echo "Slurm test failed! Output is in $workdir"
exit $err
}

trap exit_fail ERR

workdir=$(mktemp -d -p $PWD disbatch-test.XXXX)
cp Tasks Tasks_failfast $workdir
cd $workdir

# Run the test
Expand All @@ -10,15 +18,15 @@ salloc -n 2 disBatch Tasks
# Check that all 3 tasks ran,
# which means A.txt, B.txt, and C.txt exist
[[ -f A.txt && -f B.txt && -f C.txt ]]
success=$?

cd - > /dev/null
rm -f A.txt B.txt C.txt

if [[ $success -eq 0 ]]; then
echo "Slurm test passed."
rm -rf $workdir
else
echo "Slurm test failed! Output is in $workdir"
fi
# disBatch is expected to exit with a non-zero exit code here
salloc -n 2 disbatch --fail-fast Tasks_failfast || true

# check that we failed fast and didn't run any more tasks
[[ ! -f A.txt ]]

exit $success
trap - ERR
echo "Slurm test passed."
rm -rf $workdir
3 changes: 3 additions & 0 deletions tests/test_ssh/Tasks_failfast
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sleep 1000
exit 1
touch A.txt
30 changes: 19 additions & 11 deletions tests/test_ssh/run.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
#!/bin/bash

workdir=$(mktemp -d -p ./ disbatch-test.XXXX)
cp Tasks $workdir
exit_fail() {
err=$?
echo "SSH test failed! Output is in $workdir"
exit $err
}

trap exit_fail ERR

workdir=$(mktemp -d -p $PWD disbatch-test.XXXX)
cp Tasks Tasks_failfast $workdir
cd $workdir

# Run the test
Expand All @@ -10,15 +18,15 @@ disBatch -s localhost:2 Tasks
# Check that all 3 tasks ran,
# which means A.txt, B.txt, and C.txt exist
[[ -f A.txt && -f B.txt && -f C.txt ]]
success=$?

cd - > /dev/null
rm -f A.txt B.txt C.txt

if [[ $success -eq 0 ]]; then
echo "SSH test passed."
rm -rf $workdir
else
echo "SSH test failed! Output is in $workdir"
fi
# disBatch is expected to exit with a non-zero exit code here
disbatch -s localhost:2 --fail-fast Tasks_failfast || true

# check that we failed fast and didn't run any more tasks
[[ ! -f A.txt ]]

exit $success
trap - ERR
echo "SSH test passed."
rm -rf $workdir