In this example we show how you can submit a job to the HEC which only uses one single CPU on a node on the HEC. This is all detailed in the job submission script (typically these use the .com
file extension), ./tagging.com. The explanation of the parameters of that script is best explained on the HEC documentation within the batch jobs
section under sub-section example of a batch job script
found here. This tagging script uses about 120 MB of memory hence why we do not need to specify the #$ -l h_vmem
flag/parameter in ./tagging.com.
This example will show how to tag the Alice in Wonderland text, which can be found at ./alice-in-wonderland.txt, with Named Entities using the SpaCy Named Entity Recognizer (NER). To do so we can use the ./tagging.py python script which takes 3 arguments:
- A file which will be split into paragraphs of text, whereby the paragraphs will be batched and tagged using a SpaCy Named Entity Recognizer (NER).
- A file to store the Named Entities found through tagging the file found in the first argument. This file will be
TSV
formatted with the following fields:
paragraph_number | entity text | entity label | start character offset | end character offset |
---|
- The batch size. This states the number of paragraphs that the NER model will process at once. The larger the batch size the more RAM required but the faster the model will process the whole text.
Given this script we can process the Alice in Wonderland text and extract all Named Entities by simply running the Python script as follows:
python tagging.py ./alice-in-wonderland.txt ./output.tsv 50
Whereby the Named Entities will be saved to ./output.tsv
. To run this script on the HEC we will have to install the relevant Python dependencies, which is explained next.
The rest of this tutorial is laid out as follows:
- Explain any differences to the standard installation process.
- How to run the script on the HEC.
Before running this script we will need to crate a custom Conda environment so that we have a Python environment that has SpaCy installed. For details on creating your own custom Conda/Python environment see the installation tutorial. For this task we also need the SpaCy English pre-trained NER model, to do so we download this in the installation submission script, ./install.com, on line 19:
python -m spacy download en_core_web_sm
These SpaCy models are saved by default within the Conda environment as they are a Python pip package, thus when you delete the Conda environment you will delete these downloaded English pre-trained models.
- Transfer this directory to your home directory on the HEC:
scp -r ../single_job/ [email protected]:./
- Login to the HEC
ssh [email protected]
and go to the single job directory:cd single_job
- Create the Conda environment with the relevant python dependencies and download the SpaCy English model. This can be done by submitting the ./install.com job e.g.
qsub install.com
. This will create the Conda environment at$global_storage/conda_environments/py3.8-single-job
. This may take some time e.g. 5 minutes, to monitor the progress of the job useqstat
, see the HEC monitoring page for more details on the command. - We can now run the ./tagging.py script by submitting the following job:
qsub tagging.com
The ./tagging.com submission script first adds the anaconda3/wmlce
module so that we have Conda installed on the compute node we are using, then activates our custom Conda/Python environment source activate $global_storage/conda_environments/py3.8-single-job
, and lastly runs the ./tagging.py script python tagging.py ./alice-in-wonderland.txt ./output.tsv 50
source /etc/profile
module add anaconda3/wmlce
source activate $global_storage/conda_environments/py3.8-single-job
python tagging.py ./alice-in-wonderland.txt ./output.tsv 50
- After the tagging script has finished running, the Named Entities found in the text will be outputted into the
output.tsv
file. - Optional: As you have limited space in your home directory you may want to delete the Conda environment created from this job. To do so
rm -r $global_storage/conda_environments/py3.8-single-job
. Further as we cached the Conda packages you may want to clean the Conda cache directory, to do so follow the directions at ../../install_packages/README.md#Conda and Pip cache management. - Optional: You may want to transfer the results of extracting the Named Entities to your home/local computer, to do so open a terminal on your home computer, change to the directory you want the Named Entities file to be saved too and
scp [email protected]:./single_job/output.tsv .