CHANGELOG

A recent issue (17982) revealed a problem with cleaning up swarm directories, the gist of which is that swarm_cleanup.pl can't always determine if a swarm has actually finished.

To make the problem easier to deal with, the way in which swarm creates and handles batch and command scripts has been changed.  The temporary subdirectory that holds the command scripts now begins with the prefix 'tmp', while swarm runs in debug or devel mode begin with 'dev'.  Further, once the swarm has been submitted successfully, a symlink is created to associate the jobid with the temporary subdirectory.  All files (command scripts, the main batch script, and any temporary .o and .e files for --singleout) are written into the temporary subdirectory.

Thus, all swarms that have been successfully placed in the slurm queue have a symlink pointing to the temporary directories, positively identifying the jobs to which the swarm belongs.

This change should be essentially invisible to the user, but make it easier to deal with cleaning up.  It will also simplify retroactive recovery of temporary .o and .e files that remain after --singleout swarms either timeout or are cancelled, if attempted.

--------

2016-03-11

David G. pointed out that there is a race condition between creation of the symlink to the temporary script directory and swarm.batch script execution.  Because
the swarm.batch script uses $SLURM_ARRAY_JOB_ID to locate the command scripts, if the job array begins running before the symlink is in place, the command
scripts would not be found.

To prevent this, the swarm.batch script now hard-codes the path to the temporary directory, rather than relying on the symlink.

This change will also make it possible to rerun the swarm, because rerunning would not depend on the symlink.  However, rerunning is not without risk, as the
swarm rerun would need to complete before swarm_cleanup.pl removes the temporary directory and the symlink when the original job was finished more than 
7 days earlier.

--------

2016-04-19

Created parse_batch_logs.pl that will parse through the archived sbatch logfiles and create a Perl store file that associates the randomly generated string used for the 
temporary directories in /swarm with a jobid, or an aborted submission.  This store file is now used by swarm_cleanup.pl to speed up the search for jobids associated with
orphan directories that don't have a symlink.

Additionally, fixed a bug.  Swarms of a single job were being ignored by the parsing expression, and were not being deleted.

--------

2016-11-14

Added support for maximum simultaneously running subjobs (the module function for Slurm jobarrays).  --maxrunning will limit the number of running
subjobs within a swarm.

--------

2017-03-23

swarm now writes two index files, .tempdir.idx and .swarm.idx, to simplify the process of clean up and lay the foundation for rerunning swarms.

--------

2017-04-03

Removed swarmdel.

--------

2017-04-12

swarm_manager uses the index files created by swarm and the jobs table for the dashboard to keep track of swarms in a more comprehensive and systematic way than simply running find and sacct a lot.

--------

2020-01-24

swarm_cleanup.pl is now the script for cleaning.  Removed swarm_manager.

--------

2020-08-02

Began versioning (21.08.0)

--------

2020-12-22 (version 21.12.0) 

Added file directives and made adding options easier by creating a single options object.

Also, -f and --f are deprecated.  swarm now accepts 1 argument, the swarm file.

--------