Skip to content

Commit

Permalink
Merge pull request #665 from EwDa291/jobcli
Browse files Browse the repository at this point in the history
Torque frontend for jobcli page
  • Loading branch information
boegel authored Aug 27, 2024
2 parents 95741e1 + b291d40 commit 33833e4
Show file tree
Hide file tree
Showing 2 changed files with 154 additions and 0 deletions.
1 change: 1 addition & 0 deletions config/templates/hpc.template
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ nav:
- Troubleshooting: troubleshooting.md
- HPC Policies: sites/hpc_policies.md
- Advanced topics:
- Torque frontend via jobcli: torque_frontend_via_jobcli.md
- Fine-tuning Job Specifications: fine_tuning_job_specifications.md
- Multi-job submission: multi_job_submission.md
- Compiling and testing your software on the HPC: compiling_your_software.md
Expand Down
153 changes: 153 additions & 0 deletions mkdocs/docs/HPC/torque_frontend_via_jobcli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Torque frontend via jobcli

## What is Torque

[Torque](https://en.wikipedia.org/wiki/TORQUE) is a resource manager for submitting and managing jobs on an HPC cluster. It is an implementation of [PBS (Portable Batch System)](https://en.wikipedia.org/wiki/Portable_Batch_System).
Torque is not widely used anymore, so the {{hpcinfra}} no longer uses Torque in the backend since 2021 in favor of Slurm.
The Torque user interface, which consists of commands like `qsub` and `qstat`, was kept however, to avoid that researchers had to learn other commands to submit and manage jobs.

## Slurm backend

[Slurm](https://en.wikipedia.org/wiki/Slurm_Workload_Manager) is a resource manager for submitting and managing jobs on an HPC cluster, similar to Torque (but more advanced/modern in some ways). Currently, Slurm is the most popular workload manager on HPC systems worldwide, but it has a user interface that is different and in some sense less user friendly than Torque/PBS.

## jobcli

Jobcli is a Python library that was developed by {{hpcteam}} to make it possible for the {{hpcinfra}} to use a Torque frontend and a Slurm backend. In addition to that, it adds some additional options for Torque commands. Put simply, jobcli can be thought of as a Python script that "translates" Torque commands into equivalent Slurm commands, and in the case of `qsub` also makes some changes to the provided job script to make it compatible with Slurm.

### Additional options for Torque commands supported by jobcli

#### help option

Adding `--help` to a Torque command when using it on the {{hpcinfra}} will output an extensive overview of all supported options for that command, including all possible options for that command (including the original ones from Torque and the ones added by jobcli) and a short description for each one.

For example:
```shell
$ qsub --help
usage: qsub [--version] [--debug] [--dryrun] [--pass OPTIONS] [--dump PATH]...

Submit job script

positional arguments:
script_file_path Path to job script to be submitted (default: read job
script from stdin)

optional arguments:
-A ACCOUNT Charge resources used by this job to specified account
...
```

#### dryrun option

Adding `--dryrun` to a Torque command when using it on the {{hpcinfra}} will show the user what Slurm commands are generated by that Torque command by jobcli. Using `--dryrun` will not actually execute the Slurm backend command.

See also [the examples](./#examples) below.

#### debug option

Similarly to `--dryrun`, adding `--debug` to a Torque command when using it on the {{hpcinfra}} will show the user what Slurm commands are generated by that Torque command by jobcli. However in contrast to `--dryrun`, using `--debug` will actually run the Slurm backend command.

See also [the examples](./#examples) below.

#### Examples

The following examples illustrate the working of the `--dryrun` and `--debug` options with an example jobscript.

`example.sh`:

```shell
#/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=2:30:00

module load SciPy-bundle/2023.11-gfbf-2023b

python script.py > script.out.${PBS_JOBID}
```

##### Example of the dryrun option

Running the following command:

```shell
$ qsub --dryrun example.sh -N example
```

will generate this output:

```shell

Command that would have been run:
---------------------------------

/usr/bin/sbatch

Job script that would have been submitted:
------------------------------------------

#!/bin/bash
#SBATCH --chdir="/user/gent/400/{{userid}}"
#SBATCH --error="/kyukon/home/gent/400/{{userid}}/examples/%x.e%A"
#SBATCH --export="NONE"
#SBATCH --get-user-env="60L"
#SBATCH --job-name="example"
#SBATCH --mail-type="NONE"
#SBATCH --nodes="1"
#SBATCH --ntasks-per-node="8"
#SBATCH --ntasks="8"
#SBATCH --output="/kyukon/home/gent/400/{{userid}}/examples/%x.o%A"
#SBATCH --time="02:30:00"

### (start of lines that were added automatically by jobcli)
#
# original submission command:
# qsub --dryrun example.sh -N example
#
# directory where submission command was executed:
# /kyukon/home/gent/400/{{userid}}/examples
#
# original script header:
# #PBS -l nodes=1:ppn=8
# #PBS -l walltime=2:30:00
#
### (end of lines that were added automatically by jobcli)

#/bin/bash

module load SciPy-bundle/2023.11-gfbf-2023b

python script.py > script.out.${PBS_JOBID}
```
This output consist of a few components. For our example the most important lines are the ones that start with `#SBATCH` since these contain the translation of the Torque commands to Slurm commands. For example the job-name is the one we specified with the `-N` option in the command.

With this dryrun, you can see that the only changes were made to the header, the job script itself is not changed at all. If the job script were to use any PBS-related structures, like `$PBS_JOBID`, they are retained. Slurm is configured such on the {{hpcinfra}} that common `PBS_*` environment variables are defined in the job environment, next to the Slurm equivalents.

##### Example of the debug option

Similarly to the `--dryrun` example, we start by running the following command:

```shell
$ qsub --debug example.sh -N example
```

which generates this output:

```shell
DEBUG: Submitting job script location at example.sh
DEBUG: Generated script header
#SBATCH --chdir="/user/gent/400/{{userid}}"
#SBATCH --error="/kyukon/home/gent/400/{{userid}}/examples/%x.e%A"
#SBATCH --export="NONE"
#SBATCH --get-user-env="60L"
#SBATCH --job-name="example"
#SBATCH --mail-type="NONE"
#SBATCH --nodes="1"
#SBATCH --ntasks-per-node="8"
#SBATCH --ntasks="8"
#SBATCH --output="/kyukon/home/gent/400/{{userid}}/examples/%x.o%A"
#SBATCH --time="02:30:00"
DEBUG: HOOKS: Looking for hooks in directory '/etc/jobcli/hooks'
DEBUG: HOOKS: Directory '/etc/jobcli/hooks' does not exist, so no hooks there
DEBUG: Running command '/usr/bin/sbatch'
64842138
```
The output once again consists of the translated Slurm commands with some additional debug information and a job id for the job that was submitted.

0 comments on commit 33833e4

Please sign in to comment.