-
Notifications
You must be signed in to change notification settings - Fork 4
Batch Compute Jobs
A batch job runs a program or series of programs without manual intervention. This is useful when launching complex or time-consuming jobs. Unlike interactive qlogin sessions batch jobs continue to run even if you log out. All of their output is written to a file (or files) instead of to the screen.
You do not need to specify which compute nodes to run the job. The qstat command will automatically choose an available compute node or nodes. If no suitable nodes are available, your job will be put into a wait state and will launch as soon as compute nodes are available.
Batch jobs are defined in a job submission script and launched via the qsub command.
The file /share/apps/examples/template.qsub (included below) as a starting point for creating your own job scripts
# template.qsub: Example job submission script
# This is a comment.
# Lines beginning with the # symbol are comments and are not interpreted by
# the Job Scheduler.
# Lines beginning with #$ are special commands to configure the job.
### Job Configuration Starts Here #############################################
# Run the job in the current working directory (Don't change this)
#$ -cwd
# Run this job in the normal Bash shell environment (Don't change this)
#$ -S /bin/bash
# Export all current environment variables to the job (Don't change this)
#$ -V
# Parallel environment and number of CPU cores
# Syntax -pe <parallel_environment> <cpu_cores>
# Possible values include:
# threaded - for shared-memory jobs running on a single compute node
# orte - for multi-node jobs using OpenMPI
# CPU cores: 1-32 for parallel jobs this should match what you pass to
# mpirun below
#
# In this example we request 1 CPU core on a single compute node
#$ -pe threaded 1
# Job name - this will show up in the output of the qstat command
# Don't use spaces or special characters
#$ -N myjob
# Output file name - all regular output goes to this file instead of the screen
#$ -o myjob-out.txt
# Error output file name - all Job Errors are written to this file
#$ -e myjob-err.txt
# Send email upon job end/abort/suspend to the specified address
# Possible values for -m are:
# 'b' Mail is sent at the beginning of the job.
# 'e' Mail is sent at the end of the job.
# 'a' Mail is sent when the job is aborted or rescheduled.
# 's' Mail is sent when the job is suspended.
# 'n' No mail is sent.
# Fill in your email address and uncomment the line below to enable email
##$ -M [email protected] -m eas
### Commands to run your program start here ####################################
# Set up environment
# Examples:
# source /share/apps/nwchem/env.sh
# source /share/apps/qiime/env.sh
# source /share/apps/openmpi/env.sh
# Run your program(s)
# NOTE: CPU count passed to "mpirun -np" must match the number erquestred via
# the -pe parameter above.
# Examples:
# mpirun -np 4 nwchem myfile
# mpirun -np 16 /share/apps/mrbayes/mb-multi myfile.nex
# python myscript.py
echo "Hello, World!"
##Monitoring Batch Jobs All jobs are assigned a job ID and can be monitored with the qstat command which lists all of your running and scheduled jobs. Here is an example of user jsmith monitoring a job.
[jsmith@login-0-0 best]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
---------------------------------------------------------------------------------------
5842 0.00000 best-jsmi jsmith qw 12/14/2009 12:18:07 4
The output of the qstat command shows a list of your jobs and what state they’re in. In this case there is only one job (job 5842) running and it is in the qw or "queued, waiting" state. The qstat command will show an 'r' for jobs that are running, and an 'e' for jobs in an error state.
To see a listing of all running and scheduled jobs for all users type
qstat -u '*'
Once a job completes, it is no longer listed in the qstat output.
##Canceling a Job To cancel a running or waiting job use the qdel command and pass it the job id of the job you’d like to terminate. You can only cancel jobs for which you are the owner. Here’s an example:
[jsmith@login-0-0 ~]$ qdel 5842
jsmith has registered the job 5842 for deletion
The directory /share/apps/examples contains example batch jobs.
djennewe@login-0-0 my_examples $ cp -r /share/apps/examples/elephant .
djennewe@login-0-0 my_examples $ ls
elephant
djennewe@login-0-0 my_examples $ cd elephant
djennewe@login-0-0 elephant $ ls
elephant.py elephant.qsub README
djennewe@login-0-0 elephant $ qsub elephant.qsub
Your job 510127 ("elephant-job") has been submitted
djennewe@login-0-0 elephant $ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
510127 0.00000 elephant-j djennewe qw 01/19/2017 11:37:27 1
djennewe@login-0-0 elephant $ qstat
djennewe@login-0-0 elephant $ ls
elephant-err.txt elephant-out.txt elephant.png elephant.py elephant.qsub README
djennewe@login-0-0 elephant $
You can view the output file elephant.png by copying it to your computer and opening it.
djennewe@login-0-0 my_examples $ cp -r /share/apps/examples/nwchem .
djennewe@login-0-0 my_examples $ ls
nwchem
djennewe@login-0-0 my_examples $ cd nwchem
djennewe@login-0-0 nwchem $ ls
nwchem-water.qsub README water_molecule.nw
djennewe@login-0-0 nwchem $ qsub nwchem-water.qsub
Your job 510128 ("nwchem-water-job") has been submitted
djennewe@login-0-0 nwchem $ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
510128 0.00000 nwchem-wat djennewe qw 01/19/2017 11:41:36 4
djennewe@login-0-0 nwchem $ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
510128 0.51968 nwchem-wat djennewe r 01/19/2017 11:41:46 [email protected] 4
djennewe@login-0-0 nwchem $ qstat
djennewe@login-0-0 nwchem $ ls
h2o_freq.b h2o_freq.db h2o_freq.hess h2o_freq.nmode nwchem-water-err.txt README
h2o_freq.b^-1 h2o_freq.drv.hess h2o_freq.movecs h2o_freq.p nwchem-water-out.txt water_molecule.nw
h2o_freq.c h2o_freq.fd_ddipole h2o_freq.mp2nos h2o_freq.zmat nwchem-water.qsub
djennewe@login-0-0 nwchem $