Tools for efficiently using UGER
This tool set is comprised of two tools:
submitlog
is a wrapper for qsub that takes care of most of the nitty gritty details of using qsub and simultaneously logs the command you ran, the date, the qsub command, and the job ID in a file.
newGEArrayJob
simply sets up a new batch file to run an array job.
csj
watches the status of the jobs submitted from the current directory, refreshing every 2 seconds (using qstat)
Simply git clone <thisURL>
to download
and then you can create a shortcut (link) to your bin: ln -s <pathToEachScriptFile> <pathToYourBin>
You can also either copy or move them to your bin. Only the files newGEArrayJob
, newGEHead
, csj
, checkSubbedJobs
, and submitlog
are needed.
Your bin is ~/bin/.
.
Submitlog uses python3 and you may need to install some of the scripts dependencies, including argparse. On the cluster, you must install all packages locally (i.e. using the python3 setup.py install
--user
). You then would also need to include this install location in PYTHONPATH (i.e. put export PYTHONPATH=$HOME/.local/lib
in your ~/.my.bashrc
file and re-login.
On the Broad cluster, you can simply type use Python-3.4
to enable python3 and run submitlog. If you also use python < 3 (e.g. 2.7), you should instead include both use .anaconda3-2.5.0
and use Python-2.7
in that order in your login script (~/.my.bashrc
). The reason for this is that if use .anaconda3-2.5.0
is run after use Python-2.7
, the anaconda python
will be a higher priority in your $PATH
, which is actually python 3. submitlog
assumes that the order of uses doesn't matter, which in this case is not true, so that is why you should include it in the login script which is run before any of the use
commands made by submitlog
.
If you have dedicated resources that you can access and you want to make using them default, you should customize submitlog to use it by modifying the line:
parser.add_argument('-P',dest='project', metavar='<project>',help='project [default=""]', required=False, default = "");
to
parser.add_argument('-P',dest='project', metavar='<project>',help='project [default="myDedicatedQueue"]', required=False, default = "myDedicatedQueue");
where myDedicatedQueue
is the name of the dedicated resources.
This is a general use script and comes with the following options:
$ submitlog -h
usage: submitlog [-h] [-q <queue>] [-e <errFile>] -o <outFile> [-m <memory>]
[-t <tasks>] [-op <otherParam>] [-P <project>]
[-p <priority>] [-N <name>] [-hosts <hosts>] [-n <threads>]
[-v]
<command> [<command> ...]
submit a job and log it.
positional arguments:
<command> The command to be run
optional arguments:
-h, --help show this help message and exit
-q <queue> run queue {long|short}
-e <errFile> Where to output stderr [default=stdout]
-o <outFile> Where to output stdout
-m <memory> RAM required (in GB)
-t <tasks> tasks for GE - either a file with one task per line or
1-<nTasks>
-op <otherParam> other qsub parameters, in quotes
-P <project> project [default="test"]
-p <priority> priority from -1023 to 0, with - being lower priority
(default 0)
-N <name> the job name
-hosts <hosts> server hosts, comma separated
-n <threads> CPU threads
-v Verbose output?
Some words of warning:
-n <threads>
, the current grid engine uses something called parallel environments and I haven't quite figured out what the differences between them are, so this defaults to using only one.
-P <project>
doesn't seem to matter any more, so should generally be unused.
-o <outFile
should not be re-used for different jobs - what is actually run is a file <outFile>.sh
, so if two different jobs are trying to run different commands with the same output file, the second one will overwrite the first in <outFile>.sh
and so both jobs could potentially run the second job's commands.
-hosts <hosts>
- haven't used this since LSF, don't count on it to work
Here are a couple examples of submitlog.
$ submitlog -m 1 -q short -o test_echo.olog 'echo "this is how you use submitlog"' # a job with 1GB RAM and using the short queue
Your job 907994 ("echo "this is how you use submitlog"") has been submitted
$ cat submitlog.log # see what was written to the log file
Fri Jan 15 13:56:24 2016
~/testDir
submitlog -m 1 -q short -o test_echo.olog 'echo "this is how you use submitlog"'
qsub -b y -cwd -q short -o test_echo.olog -l m_mem_free=1g -e test_echo.olog -N 'echo "this is how you use submitlog"' ./test_echo.olog.sh
Your job 907994 ("echo "this is how you use submitlog"") has been submitted
$ cat test_echo.olog.sh # this file is created and run behind the scenes and is what is actually run, which is why you should not run two simultaneous jobs with the same output file
#!/bin/bash -l
use Python-2.7; use .python-2.7.1-sqlite3-rtrees; use .zlib-1.2.6; use .hdf5-1.8.9; use .graphviz-2.28.0; use .db-4.7.25; use .tcltk8.5.9; use GCC-5.1; use .gcc-5.1.0; use reuse; use UGER; use default++; use .aliases++; use default; use .local; use .broad; use .hostname; use .lang;
echo Job $JOB_ID started on $HOST: `date`
echo "this is how you use submitlog"
echo Job finished: `date`
$ cat test_echo.olog # See what is contained in the output file
/broad/software/dotkit/bash/use: line 35: unalias: ish: not found
Job 907994 started on hw-uger-1079: Fri Jan 15 13:56:35 EST 2016
this is how you use submitlog
Job finished: Fri Jan 15 13:56:35 EST 2016
This file simply simply copies the contents of newGEHead to a new file of your choosing (specified as the only parameter to the script), and opens it in vim for editing.
Here, I will show two common uses with trivial examples.
$ cat tasksNames.txt # I have a set of things I want to perform some task on
sample1
sample2
sample3
sample4
$ newGEArrayJob echoNameInFile.sh
At this point, vim opens (you can change this by modifying the script). Initially the script contains the following:
#!/bin/bash -l
#insert use commands here
# $1 is the task file
# input the $SGE_TASK_IDth line of this file into $id
export id=`awk "NR==$SGE_TASK_ID" $1`
#split id on whitespace into array that can be accessed like ${splitID[0]}
#splitID=($id)
#export id=${splitID[0]}
echo $SGE_TASK_ID : $id
#redirect stdout and stderr to file
export logDir=""
exec 1>>$logDir/$id.olog
exec 2>&1
set -e
if [ ! -e $logDir/$id.done ]
then
echo Job $JOB_ID:$SGE_TASK_ID started on $HOST: `date`
#PLACE COMMANDS HERE
touch $logDir/$id.done
echo Job finished: `date`
else
echo Job already done. To redo, rm $logDir/$id.done
fi
I edit it to be the following:
#!/bin/bash -l
#insert use commands here
# $1 is the task file
# input the $SGE_TASK_IDth line of this file into $id
export id=`awk "NR==$SGE_TASK_ID" $1`
#split id on whitespace into array that can be accessed like ${splitID[0]}
#splitID=($id)
#export id=${splitID[0]}
echo $SGE_TASK_ID : $id
#redirect stdout and stderr to file
export logDir="example1" #output logs to current directory
mkdir -p $logDir # make the directory if it doesn't exist already
exec 1>>$logDir/$id.olog ##this is where the output of this script will go
exec 2>&1
set -e
if [ ! -e $logDir/$id.done ] #this line means that the commands included below are only run if the file $logDir/$id.done does not exist so that you can re-run the command if some failed for some reason
then
echo Job $JOB_ID:$SGE_TASK_ID started on $HOST: `date`
echo $id >$id.example1.txt
touch $logDir/$id.done
echo Job finished: `date`
else
echo Job already done. To redo, rm $logDir/$id.done
fi
Now I run it:
$ submitlog -q short -m 1 -o doExample1.olog -t tasksNames.txt ./echoNameInFile.sh tasksNames.txt
Your job-array 908319.1-4:1 ("..echoNameInFile.sh tasksNames.txt") has been submitted
$ cat doExample1.olog
/broad/software/dotkit/bash/use: line 35: unalias: ish: not found
/broad/software/dotkit/bash/use: line 35: unalias: ish: not found
/broad/software/dotkit/bash/use: line 35: unalias: ish: not found
/broad/software/dotkit/bash/use: line 35: unalias: ish: not found
4 : sample4
3 : sample3
1 : sample1
2 : sample2
$ ls example1/
sample1.done sample1.olog sample2.example1.txt sample3.done sample3.olog sample4.example1.txt
sample1.example1.txt sample2.done sample2.olog sample3.example1.txt sample4.done sample4.olog
$ cat example1/sample2.example1.txt # it echoed the sample $id into the file I specified
sample2
$ cat example1/sample2.olog # this is what the output looks like
Job 908367:2 started on hw-uger-1084: Fri Jan 15 14:18:47 EST 2016
Job finished: Fri Jan 15 14:18:47 EST 2016
So now a more complex example. Say you have your different tasks, but you want to do slightly different things for each task, with a tab-delimited file containing what things you want to do:
$ cat tasksNamesOutsTimes.txt
sample1 outFileA 6
sample2 outFileB 7
sample3 outFileC 8
sample4 outFileD 9
So this file tells me the sample name, the output file, and how many times the sample name should be printed in the output file. My new arrayjob is as follows:
This little snippit takes the current line in the task list and splits it up into an array, taking the first element in the array as the job $id
splitID=($id)
export id=${splitID[0]}
Now, I run it:
$ submitlog -q short -m 1 -o doExample2.olog -t tasksNamesOutsTimes.txt ./echoNTimes.sh tasksNamesOutsTimes.txt
Your job-array 908493.1-4:1 ("..echoNTimes.sh tasksNamesOutsTimes.txt") has been submitted
$ ls example2/ # check the output files
outFileA.example2.txt outFileC.example2.txt sample1.done sample2.done sample3.done sample4.done
outFileB.example2.txt outFileD.example2.txt sample1.olog sample2.olog sample3.olog sample4.olog
$ cat example2/outFileB.example2.txt # seven instances of the sample name!
sample2
sample2
sample2
sample2
sample2
sample2
sample2
$ submitlog -q short -m 1 -o doExample2.olog -t tasksNamesOutsTimes.txt ./echoNTimes.sh tasksNamesOutsTimes.txt ## rerun the script
Your job-array 908516.1-4:1 ("..echoNTimes.sh tasksNamesOutsTimes.txt") has been submitted
$ cat example2/sample2.olog # the first time I ran the script, it worked, so the second time it skipped execution
Job 908493:2 started on hw-uger-1083: Fri Jan 15 14:32:04 EST 2016
Job finished: Fri Jan 15 14:32:04 EST 2016
Job already done. To redo, rm example2/sample2.done
Finally, check out what was logged by submitlog:
$ cat submitlog.log
Fri Jan 15 13:56:24 2016
~/testDir
submitlog -m 1 -q short -o test_echo.olog 'echo "this is how you use submitlog"'
qsub -b y -cwd -q short -o test_echo.olog -l m_mem_free=1g -e test_echo.olog -N 'echo "this is how you use submitlog"' ./test_echo.olog.sh
Your job 907994 ("echo "this is how you use submitlog"") has been submitted
Fri Jan 15 14:18:44 2016
~/testDir
submitlog -q short -m 1 -o doExample1.olog -t tasksNames.txt ./echoNameInFile.sh tasksNames.txt
qsub -b y -cwd -q short -o doExample1.olog -l m_mem_free=1g -e doExample1.olog -N '..echoNameInFile.sh tasksNames.txt' -t 1-4 './echoNameInFile.sh tasksNames.txt'
Your job-array 908367.1-4:1 ("..echoNameInFile.sh tasksNames.txt") has been submitted
Fri Jan 15 14:31:58 2016
~/testDir
submitlog -q short -m 1 -o doExample2.olog -t tasksNamesOutsTimes.txt ./echoNTimes.sh tasksNamesOutsTimes.txt
qsub -b y -cwd -q short -o doExample2.olog -l m_mem_free=1g -e doExample2.olog -N '..echoNTimes.sh tasksNamesOutsTimes.txt' -t 1-4 './echoNTimes.sh tasksNamesOutsTimes.txt'
Your job-array 908493.1-4:1 ("..echoNTimes.sh tasksNamesOutsTimes.txt") has been submitted
Fri Jan 15 14:34:25 2016
~/testDir
submitlog -q short -m 1 -o doExample2.olog -t tasksNamesOutsTimes.txt ./echoNTimes.sh tasksNamesOutsTimes.txt
qsub -b y -cwd -q short -o doExample2.olog -l m_mem_free=1g -e doExample2.olog -N '..echoNTimes.sh tasksNamesOutsTimes.txt' -t 1-4 './echoNTimes.sh tasksNamesOutsTimes.txt'
Your job-array 908516.1-4:1 ("..echoNTimes.sh tasksNamesOutsTimes.txt") has been submitted
#csj
and checkSubbedJobs
checkSubbedJobs
shows the current submitted jobs (running/queued) that were submitted from the current directory (it looks for them in ./submitlog.log). csj
uses checkSubbedJobs
to show the current running/queued jobs, refreshing every 2 seconds. I almost never use checkSubbedJobs
directly.