Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
alice-in-wonderland.txt		alice-in-wonderland.txt
environment.yaml		environment.yaml
install.com		install.com
tagging.com		tagging.com
tagging.py		tagging.py

README.md

Single job example

In this example we show how you can submit a job to the HEC which only uses one single CPU on a node on the HEC. This is all detailed in the job submission script (typically these use the .com file extension), ./tagging.com. The explanation of the parameters of that script is best explained on the HEC documentation within the batch jobs section under sub-section example of a batch job script found here. This tagging script uses about 120 MB of memory hence why we do not need to specify the #$ -l h_vmem flag/parameter in ./tagging.com.

This example will show how to tag the Alice in Wonderland text, which can be found at ./alice-in-wonderland.txt, with Named Entities using the SpaCy Named Entity Recognizer (NER). To do so we can use the ./tagging.py python script which takes 3 arguments:

A file which will be split into paragraphs of text, whereby the paragraphs will be batched and tagged using a SpaCy Named Entity Recognizer (NER).
A file to store the Named Entities found through tagging the file found in the first argument. This file will be TSV formatted with the following fields:

paragraph_number	entity text	entity label	start character offset	end character offset

The batch size. This states the number of paragraphs that the NER model will process at once. The larger the batch size the more RAM required but the faster the model will process the whole text.

Given this script we can process the Alice in Wonderland text and extract all Named Entities by simply running the Python script as follows:

python tagging.py ./alice-in-wonderland.txt ./output.tsv 50

Whereby the Named Entities will be saved to ./output.tsv. To run this script on the HEC we will have to install the relevant Python dependencies, which is explained next.

The rest of this tutorial is laid out as follows:

Explain any differences to the standard installation process.
How to run the script on the HEC.

Installation

Before running this script we will need to crate a custom Conda environment so that we have a Python environment that has SpaCy installed. For details on creating your own custom Conda/Python environment see the installation tutorial. For this task we also need the SpaCy English pre-trained NER model, to do so we download this in the installation submission script, ./install.com, on line 19:

python -m spacy download en_core_web_sm

These SpaCy models are saved by default within the Conda environment as they are a Python pip package, thus when you delete the Conda environment you will delete these downloaded English pre-trained models.

Run on the HEC

Transfer this directory to your home directory on the HEC: scp -r ../single_job/ [email protected]:./
Login to the HEC ssh [email protected] and go to the single job directory: cd single_job
Create the Conda environment with the relevant python dependencies and download the SpaCy English model. This can be done by submitting the ./install.com job e.g. qsub install.com. This will create the Conda environment at $global_storage/conda_environments/py3.8-single-job. This may take some time e.g. 5 minutes, to monitor the progress of the job use qstat, see the HEC monitoring page for more details on the command.
We can now run the ./tagging.py script by submitting the following job: qsub tagging.com

The ./tagging.com submission script first adds the anaconda3/wmlce module so that we have Conda installed on the compute node we are using, then activates our custom Conda/Python environment source activate $global_storage/conda_environments/py3.8-single-job, and lastly runs the ./tagging.py script python tagging.py ./alice-in-wonderland.txt ./output.tsv 50

source /etc/profile
module add anaconda3/wmlce
source activate $global_storage/conda_environments/py3.8-single-job

python tagging.py ./alice-in-wonderland.txt ./output.tsv 50

After the tagging script has finished running, the Named Entities found in the text will be outputted into the output.tsv file.
Optional: As you have limited space in your home directory you may want to delete the Conda environment created from this job. To do so rm -r $global_storage/conda_environments/py3.8-single-job. Further as we cached the Conda packages you may want to clean the Conda cache directory, to do so follow the directions at ../../install_packages/README.md#Conda and Pip cache management.
Optional: You may want to transfer the results of extracting the Named Entities to your home/local computer, to do so open a terminal on your home computer, change to the directory you want the Named Entities file to be saved too and scp [email protected]:./single_job/output.tsv .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single_job

single_job

README.md

Single job example

Installation

Run on the HEC

Files

single_job

Directory actions

More options

Directory actions

More options

Latest commit

History

single_job

Folders and files

parent directory

README.md

Single job example

Installation

Run on the HEC