This program is for detecting recombination from biological sequences that cannot be aligned. This framework develops on the basis of the jumping hidden markov model (JHMM, Zilversmit et al., 2013). The JHMM searches each query sequence (called target sequence) against a database (called source sequences), the mosaic representation for each target sequence is then constructed. We improved the JHMM by constraining recombination to occur at homologous locations between sequences. Our model shows similar accuracy with the JHMM but much higher effiency than the JHMM.
See more details about this model in Chapter 4 ("An improved JHMM for recombination detection") of my PhD thesis.
- Python >=3.10
This program was written by Python. The python modules you may want to install are listed in requirements.txt
pip install -r requirements.txt
- Input fasta format biological sequences. For the format of sequence identifier, maximum length for identifiers length is 15 (consist with the JHMM in Zilversmit et al., 2013). Although the blank space is not allowed in the identifier, punctuations (e.g., "|", ";", "_", "-") generally work well with our scripts.
There are two mode in this program.
- If you want to search each sequence against all others, you may run our all_vs_all mode.
- If you specify target sequences and source sequences separately, you may run our target_vs_db mode.
Every mode consists of three steps:
- Estimate non-jump parameters (gap opening probability delta and gap extension probability epsilon) with the Baum-Welch algorithm.
- Estimate the jump parameter (the probability of recombination rho) with the forward algorithm, golden section search method is used to speed up the estimation.
- Compute the Viterbi paths of all target sequences.
Each step corresponds to a python script. You would need to run each step in order.
Our example input file is example1.fasta, which stores at test_files/data folder.
Step 1:
python scripts/all_vs_all/BW_improve_all_vs_all.py test_files/data/example1.fasta test_files/results/all_vs_all/we_20.txt 20 > test_files/results/all_vs_all/we_20_middle.txt
This script requires the following arguments:
- "<filename>": name of the file which contains your input fasta.
- "<filename>": name of your output file.
- "<filename>": radius.
- "<filename>": If you need the middle files generated by the Baum-Welch algorithm, use "> filename" in the end.
The output file is we_20.txt in test_files/results/all_vs_all folder. The first two rows are estimated delta and epsilon. Other rows are the iteration index, running time (second), average running time (second) per iteration.
Step 2:
python scripts/all_vs_all/rho_improve_all_vs_all_gss.py test_files/data/example1.fasta test_files/results/all_vs_all/we_20_rho_gss.txt 20 0.005491850534938501 0.9555200033995157
This script requires the following arguments:
- "<filename>": name of the file which contains your input fasta.
- "<filename>": name of your output file.
- "<filename>": radius.
- "<filename>": delta from step 1
- "<filename>": epsilon from step 1
The output file is we_20_rho_gss.txt in test_files/results/all_vs_all folder. Each row is estimated delta, epsilon, rho and running time (second).
Step 3:
python scripts/all_vs_all/viterbi_path_improve_all_vs_all.py test_files/data/example1.fasta test_files/results/all_vs_all/we_20_viterbi.txt 20 0.005491850534938501 0.9555200033995157 0.02070998910340327
This script requires the following arguments:
- "<filename>": name of the file which contains your input fasta.
- "<filename>": name of your output file.
- "<filename>": radius.
- "<filename>": delta from step 1
- "<filename>": epsilon from step 1
- "<filename>": rho from step 2
The output viterbi path for each sequence is in test_files/results/all_vs_all/we_20_viterbi.txt. The end of this file contains the running time of this step.
Our example input file is example2.fasta, which stores at test_files/data folder. The identifier for target sequences should start with "target_", while identifier for source sequences start with "db_"
Step 1:
python scripts/target_vs_db/BW_improve_target_vs_db.py test_files/data/example2.fasta test_files/results/target_vs_db/we_20.txt test_files/data/ref.fasta 20 > test_files/results/target_vs_db/we_20_middle.txt
This script requires the following arguments:
- "<filename>": name of the file which contains your input fasta.
- "<filename>": name of your output file.
- "<filename>": name of your reference fasta. It should contain target sequences with identifier starting with "target_". These target sequences provide emission probabilities for insert hidden state. We recommend it to be a larger dataset.
- "<filename>": radius.
- "<filename>": If you need the middle files generated by the Baum-Welch algorithm, use "> filename" in the end.
The output file is we_20.txt in test_files/results/target_vs_db folder. The first two rows are estimated delta and epsilon. Other rows are the iteration index, running time (second), average running time (second) per iteration.
Step 2:
python scripts/target_vs_db/rho_improve_target_vs_db_gss.py test_files/data/example2.fasta test_files/results/target_vs_db/we_20_rho_gss.txt test_files/data/ref.fasta 20 0.02131052651155812 0.8409128461304629
This script requires the following arguments:
- "<filename>": name of the file which contains your input fasta.
- "<filename>": name of your output file.
- "<filename>": name of your reference fasta. Format requirement same as above.
- "<filename>": radius.
- "<filename>": delta from step 1
- "<filename>": epsilon from step 1
The output file is we_20_rho_gss.txt in test_files/results/target_vs_db folder. Each row is estimated delta, epsilon, rho and running time (second).
Step 3:
python scripts/target_vs_db/viterbi_path_improve_target_vs_db.py test_files/data/example2.fasta test_files/results/target_vs_db/we_20_viterbi.txt test_files/data/ref.fasta 20 0.02131052651155812 0.8409128461304629 0.082453832618475
This script requires the following arguments:
- "<filename>": name of the file which contains your input fasta.
- "<filename>": name of your output file.
- "<filename>": name of your reference fasta. Format requirement same as above.
- "<filename>": radius.
- "<filename>": delta from step 1
- "<filename>": epsilon from step 1
- "<filename>": rho from step 2
The output viterbi path for each sequence is in test_files/results/target_vs_db/we_20_viterbi.txt. The end of this file contains the running time of this step.
- Our program currently only accepts the protein sequences as input. If your data is DNA, please translate it.
Our improved JHMM can apply to DNA sequences as well, incorporating this option into our program is the subject of current work.
This program was written by Qian Feng, supervised by Heejung Shim and Yao-ban Chan from the University of Melbourne. For any problems, please report an issue in Github or send an email to Qian Feng.
- Zilversmit, M. M., Chase, E. K., Chen, D. S., Awadalla, P., Day, K. P., & McVean, G. (2013). Hypervariable antigen genes in malaria have ancient roots. BMC evolutionary biology, 13(1), 110.