The GOMAP-singularity Boilerplate repo has been distributed for the use with GOMAP-singularity to annotate GO terms to plant genes on SLURM high-performance computing environments.
git clone https://github.com/Dill-PICL/GOMAP-boilerplate.git /path/for/annotation
GOMAP-singularity has seven steps for annotating the proteins. Some of these steps are parallelized and some of them are not parallelized.
Number | Step | Single | MPI | Concurrent |
---|---|---|---|---|
1 | seqsim | Y | N | Y |
2 | domain | Y | Y | Y |
3 | fanngo | Y | N | Y |
4 | mixmeth-blast | Y | Y | Y |
5 | mixmeth-preproc | Y | N | N |
6 | mixmeth | Y | N | N |
7 | aggregate | Y | N | N |
The boilerplate has steps to run the annotation on Sorghum reference genome, and the input file is Sbicolor_454_v3.1.1.protein_primaryTranscriptOnly.fa
. Files have to be customized for individual clusters
input:
#input fasta file name
fasta: your_input.fasta
# output file basename
basename: desired_output_basename
#input NCBI taxonomy id
taxon: "XXXXX"
# Name of the species
species: "You species"
# Email is mandatory
email: [email protected]
#Number of CPUs used for tools
cpus: Find this out from the cluster help page
#Whether mpi should be used (mpich-3.2.1 is default)
mpi: True
#what the name of the
tmpdir: "/tmpdir"
# set the tmpdir to a variable that will be local to the specific node
export tmpdir="$LOCAL" # or other location based on cluster
# Loading necessary modules and this would be specific to each cluster
# Loading the mpich module. This is based on how it is called on the cluster and contact the cluster help to get the correct module to load
module load mpich
# Load the singulairy module. GOMAP-singulrity currently only works with Singulairty 2.6.
x and you have to use the correct version
module load singularity/3.1.x
#Set the location of GOMAP-singularity base directory. This is essential to correctly run the pipeline
export GOMAP_LOC="/path/to/GOMAP-singularity/"
Each cluster has specific parameters that need to be optimized for the sbatch files before they can be successfully run. A list of such parameters are given below.
-
--ntasks-per-node=XX This parameter is used by some clusters to assign entire nodes so assign the number based on the number of cores available on each compute node
-
--nodes=XX This parameter is used for assigning number of nodes for each job. The domain and mixmeth-blast jobs need customization based on the number of nodes you can use for parallelized jobs
-
--partition=XX Some of the clusters have specific partitions that have to be used for big or parallelized jobs so find out which partition can be used for that
-
-mail-user=[email protected] Assign a correct value to get job status changes to see which jobs are done.