Note before reading this I would recommend reading this HEC guide as it explains it a lot better. This is an extended NLP example of that guide on creating an array job
. In essence an array job
is a single job that contains multiple tasks whereby each task is run on a different node in the HEC.
In this example we are going to tag all of the files/books in the directory ./files with Named Entities using the Stanza Named Entity Recognizer (NER).
- ./files/files_1 directory contains 2 books authored by Jane Austen, downloaded from Project Gutenberg
- ./files/files_2 directory contains 2 books authored by Arthur Conan Doyle, downloaded from Project Gutenberg
We are going to use a similar tagging python script as the single job example, ./tagging.py, which takes four arguments:
- A file which will be split into paragraphs of text, whereby the paragraphs will be batched and tagged using the Stanza Named Entity Recognizer (NER).
- A file to store the Named Entities found through tagging the file found in the first argument. This file will be
TSV
formatted with the following fields:
paragraph_number | entity text | entity label | start character offset | end character offset |
---|
- The batch size. This states the number of paragraphs that the NER model will process at once. The larger the batch size the more RAM required but the faster the model will process the whole text.
- A file path to a directory that will store the Stanza pre-trained models that the script downloads.
There are two ways we could create a submission script to the HEC (in all of these examples we assume the Stanza pre-trained models are stored at $global_storage/stanza_models
):
- We use one node on the HEC and get that one node to process all of the files in the file folder (./files), in essence this is the approach used in the single job example. The code below comes from the ./single_tagging.com submission script.
source /etc/profile
module add anaconda3/wmlce
source $global_storage/conda_environments/py3.8-multiple-similar-job
mkdir -p ./output_files
for folder_number in `seq 1 2`; do
mkdir -p ./output_files/files_$folder_number
for file_number in `seq 0 1`; do
python tagging.py ./files/files_$folder_number/$file_number.txt ./output_files/files_$folder_number/$file_number.tsv 50 $global_storage/stanza_models
done
done
- We can submit each file folder to a different node on the HEC thus meaning that while one node is tagging one folder of files another is processing the other folder of files. This therefore makes the processing, in this case, twice as fast as we have two nodes processing the files rather than one. If the HEC does not have enough free nodes it will queue the second folder of files until a node is free. The code below comes from the ./tagging.com submission script.
#$ -t 1-2:1
source /etc/profile
module add anaconda3/wmlce
source activate $global_storage/conda_environments/py3.8-multiple-similar-job
mkdir -p ./output_files
mkdir -p ./output_files/files_$SGE_TASK_ID
for file_number in `seq 0 1`; do
python tagging.py ./files/files_$SGE_TASK_ID/$file_number.txt ./output_files/files_$SGE_TASK_ID/$file_number.tsv 50 $global_storage/stanza_models
done
In this bash script we are using 2 nodes, or more correctly we are submitting two tasks to the HEC in the hope of having two nodes at the same time. These two tasks are stated through the #$ -t 1-2:1
whereby it states the tasks are represented as task 1 and 2 (1-2:1 means create a sequence of tasks starting at 1 ending at 2 with an increment of 1). These task indexes/ids are represented through the environment variable $SGE_TASK_ID
. We then create an output_files
folder and within that folder each task will have their own output folder ./output_files/files_$SGE_TASK_ID
. Lastly each task will then tag the files in their representative task index folder and output to their own output folder. This is the reason why the files directory where named ./files/file_1
and ./files/file_2
as it makes setting up this way of processing easier. It is also the reason why the book/files are named 0-1
.txt as it makes processing them easier as we can set up the processing as a for loop in the script above.
In this example we are going to use the second way to process this data as it is the quicker way of tagging the data.
You might be wondering why we do not use a node/task per file/book, the reason for this is that tagging one book/file is a relatively quick task (less than 5 minutes) and it states on the HEC guide that quick/short jobs should be avoided as it is an inefficient use of the HEC. Thus it is better to batch the processing of these books/files hence why each node/task is processing 2 books each rather than just one. In reality depending on how many books/files you need to process you would have a batch of more than 2 books, but I wanted to keep this example as short in processing time as possible for environmental and learning reasons. For more details on this see the managing short jobs section on the HEC guide.
Note that the #$ -t 1-2:1
task index sequence has to start at an index equal to or greater than 1. It cannot start at index 0. For example #$ -t 0-4:2
is not valid as the index sequence would start at 0.
The rest of this tutorial is laid out as follows:
- Explain any differences to the standard installation process.
- How to run the script on the HEC.
Before running this script we will need to crate a custom Conda environment so that we have a Python environment that has Stanza installed. For details on creating your own custom Conda/Python environment see the installation tutorial. For this task we also need the Stanza English pre-trained NER model, to do so we download this in the installation submission script, ./install.com, on line 20:
python ./download_stanza.py $global_storage/stanza_models
This line runs the ./download_stanza.py python script whereby the first argument states the directory to save the Stanza models too. In this case we save them to the $global_storage/stanza_models
directory.
- Transfer this directory to your home directory on the HEC:
scp -r ../multiple_similar_jobs/ [email protected]:./
- Login to the HEC
ssh [email protected]
and go to the single job directory:cd multiple_similar_jobs
- Create the Conda environment with the relevant python dependencies and download the Stanza English NER model. This can be done by submitting the ./install.com job e.g.
qsub install.com
. This will create the Conda environment at$global_storage/conda_environments/py3.8-multiple-similar-job
. This may take some time e.g. 5 minutes, to monitor the progress of the job useqstat
, see the HEC monitoring page for more details on the command. - We can now run the ./tagging.py script by submitting the following job
qsub tagging.com
. The main points of the ./tagging.com were explained above. - If you run the
qstat
command you should see that you have two tasks submitted e.g.:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
6841142 0.50000 multiple-s moorea r 01/20/2021 08:02:48 [email protected] 1 1
6841142 0.50000 multiple-s moorea r 01/20/2021 08:02:48 [email protected] 1 2
At the start of the job you may see the following from qstat:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
6841142 0.00000 multiple-s moorea qw 01/20/2021 08:02:44 1 1,2
But the HEC knows it is an array job as it shows tasks indexes in ja-task-id
and eventually it will be splits into two tasks.
- After the tagging script has finished running (takes around 28 minutes to run the longest task see step 7 below), the Named Entities found in each book/file will be outputted into the
./output_files
folder. Additionally as we are running multiple tasks each task has its own standard error and output file e.g. in my case the standard output file for task 1 wasmultiple-similar-job.o6841142.1
and for task 2 wasmultiple-similar-job.o6841142.2
. The files are named throughjob name
.job ID
.task ID
. - If you would like to know some run time statistics, like total time each task took you can use the
qacct
command (see the HEC pages for more details on this command) e.g. for me to get these statistics for the first task I would runqacct -j 6841278 -t 1
and for the second taskqacct -j 6841278 -t 2
. The longest run timeru_wallclock
was task 1 with a total run time of1657
seconds which is around 28 minutes. You may notice that it states the max memory/RAM used (maxvmem
in the reportqacct
creates) to be close to 1 GB, from my experience it still requires a memory allocations of 3 GB as it would appear that parts of the Stanza model require large amounts of memory for a very short period of time which the memory monitor appears not to pick up on. - Optional: As you have limited space in your storage directory you may want to delete the Conda environment created from this job. To do so
rm -r $global_storage/conda_environments/py3.8-multiple-similar-job
- Optional: As you have 100 GB space in your
$global_storage
directory you may want to delete the downloaded stanza models. To do socd $global_storage
andrm -r stanza_models
- Optional: You may want to transfer the results of extracting the Named Entities to your home/local computer, to do so open a terminal on your home computer, change to the directory you want the Named Entities folder to be saved too and
scp -r [email protected]:./multiple_similar_jobs/output_files .
Just as a comparison I have ran the ./single_tagging.com script on the HEC which only uses one node, and it takes 49 minutes compared to 28 minutes that the ./tagging.com. This shows the benefit of making using of multiple nodes on the HEC compared to just using one node.