I describe the procedure for the PSSM generation from the FASTA sequences. It is asynchronous parallel processing that can process up to n-sequence at a time. I spend a considerable amount of time on the PSSM generation purpose. It is definitely a hard and tedious procedure, but I make it easy so that other researchers can use it efficiently. People can use it for PSSM generation; unfortunately, I did not check the benchmark yet.
Please find the latest version of BLAST tool from the given website (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). Then download one of the files as per your operating system (OS) requirement. As I am a Linux OS user, that is why I downloaded "ncbi-blast-...-linux.tar.gz". Please don't worry about the version; it usually changes over time.
user@machine:~$ wget 'https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.10.1+-x64-linux.tar.gz' ### Fetch from website
user@machine:~$ tar -xvzf ncbi-blast-2.10.1+-x64-linux.tar.gz ### Extract the tool after the download
We can download the nr
database from official website (https://ftp.ncbi.nlm.nih.gov/blast/db/), and the downloading processes are given below.
user@machine:~$ wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz'
user@machine:~$ wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz.md5'
user@machine:~$ /home/user/ncbi-blast-2.10.1+/bin/update_blastdb.pl --decompress nr [*]
- The database had 39 segments, and the initial file size was approximately 100GB; we would get around 450GB after the extraction (Until 2020).
- The database was 54 segments (Last Update: August 1, 2021).
- The database is now 78 segments; the initial file size was approximately 193GB, and we got around xxGB after the extraction; (Last Update: July 12, 2023).
- The segments change frequently.
- We can get the
update_blastdb.pl
file from BLAST tool.
When the nr
database will be old, no need to download (or upgrade) rather than update the previous one.
user@machine:~$ /home/user/ncbi-blast-2.10.1+/bin/update_blastdb.pl --decompress nr [*]
- We can get the
update_blastdb.pl
file from BLAST tool. - Plese run the script from the
nr
directory (or folder), otherwise it won't work.
n=38 ### If the number of the segment is n, then we will use n-1.
for i in $(seq 0 1 $n); do
if [ $i -lt 10 ]; then
tar -xvzf nr.0$i.tar.gz
else
tar -xvzf nr.$i.tar.gz
fi
done
- A question can arise why I used 38 in the loop? The answer is, I got the 39 segments in the
nr
directory (or folder). - Please make sure that how many segments you have, then update the value of
n
. We can find it fromnr.pal
innr
directory (or folder). - Plese run the script from the
nr
directory (or folder), otherwise it won't work. - We will find the update decompress procedure from given URL.
File = '/home/user/Bioinformatics/multiSequences.fa'
from Bio import SeqIO # Install (If you don't have it.): pip install biopython
C= 1
for record in SeqIO.parse(File, 'fasta'):
openFile = open(str(C) + '.fasta', 'w')
SeqIO.write(record, openFile, 'fasta')
C += 1
#end-for
- I renamed the origial name of FASTA sequence as it is helpful for tracking the implementation.
- I used sequential numerical order rather than the original sequence name.
- Renaming the sequence is optional.
- We will find the updated FASTA splitting procedure from given URL.
- We can also use Colab for the splitting Multiple FASTA sequence into single sequences [Update Implementation].
File = '/home/rafsanjani/Downloads/TS88.fa'
from Bio import SeqIO
C= 1
print('Original-Sequence-Name, Renamed-Sequence, Corresponding-PSSM')
for record in SeqIO.parse(File, 'fasta'):
print('{}, {}.fasta, {}.fasta.pssm'.format(record.id, C, C))
C += 1
#end-for
- As I rename the sequences, you can track the original sequences.
- You will find the update procedure from given URL.
###
database = '/home/learning/mrzResearchArena/NR/nr' # Please, set path where "nr" database directory is located.
PSSM = '/home/learning/mrzResearchArena/PSSM' # Please, set path where PSSM directory is located.
core = 8 # multiprocessing.cpu_count()
###
###
import multiprocessing
import time
import glob
import os
os.chdir(PSSM)
###
###
def runPSIBLAST(file):
try:
os.system('/home/learning/ncbi-blast-2.10.1+/bin/psiblast -query {} -db {} -out {}.out -num_iterations 3 -out_ascii_pssm {}.pssm -inclusion_ethresh 0.001 -comp_based_stats 0 -num_threads 1'.format(file, database, file, file))
except:
print('PSI-BLAST is error for the sequence {}!'.format(file))
return '{}, is error.'.format(file)
return '{}, is done.'.format(file)
#end-def
###
###
begin = time.time()
pool = multiprocessing.Pool(processes=core)
results = [ pool.apply_async(runPSIBLAST, args=(file,)) for file in glob.glob('*.fasta') ] # for x in range(1, 10)
###
###
outputs = [result.get() for result in results]
end = time.time()
###
###
print(sorted(outputs))
print()
print('Time elapsed: {} seconds.'.format(end - begin))
###
- You will find the update procedure from given URL.
Acknowledgement: I would like to thank you to Professor Iman Dehzangi, who helped me initially for the PSSM generation.