Skip to content

An Improved JHMM to Detect Recombination From Unaligned Sequences

License

Notifications You must be signed in to change notification settings

qianfeng2/impJHMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An Improved JHMM to Detect Recombination From Unaligned Sequences

Python 3.6 License: GPL-3.0 DOI

About

This program is for detecting recombination from biological sequences that cannot be aligned. This framework develops on the basis of the jumping hidden markov model (JHMM, Zilversmit et al., 2013). The JHMM searches each query sequence (called target sequence) against a database (called source sequences), the mosaic representation for each target sequence is then constructed. We improved the JHMM by constraining recombination to occur at homologous locations between sequences. Our model shows similar accuracy with the JHMM but much higher effiency than the JHMM.

See more details about this model in Chapter 4 ("An improved JHMM for recombination detection") of my PhD thesis.

Required software

  • Python >=3.10

This program was written by Python. The python modules you may want to install are listed in requirements.txt

pip install -r requirements.txt

Required Input File

  • Input fasta format biological sequences. For the format of sequence identifier, maximum length for identifiers length is 15 (consist with the JHMM in Zilversmit et al., 2013). Although the blank space is not allowed in the identifier, punctuations (e.g., "|", ";", "_", "-") generally work well with our scripts.

Running Mode

There are two mode in this program.

  • If you want to search each sequence against all others, you may run our all_vs_all mode.
  • If you specify target sequences and source sequences separately, you may run our target_vs_db mode.

Running Program

Every mode consists of three steps:

  1. Estimate non-jump parameters (gap opening probability delta and gap extension probability epsilon) with the Baum-Welch algorithm.
  2. Estimate the jump parameter (the probability of recombination rho) with the forward algorithm, golden section search method is used to speed up the estimation.
  3. Compute the Viterbi paths of all target sequences.

Each step corresponds to a python script. You would need to run each step in order.

Run Example for all_vs_all Mode

Our example input file is example1.fasta, which stores at test_files/data folder.

Step 1:

python scripts/all_vs_all/BW_improve_all_vs_all.py test_files/data/example1.fasta test_files/results/all_vs_all/we_20.txt 20 > test_files/results/all_vs_all/we_20_middle.txt 

This script requires the following arguments:

  • "<filename>": name of the file which contains your input fasta.
  • "<filename>": name of your output file.
  • "<filename>": radius.
  • "<filename>": If you need the middle files generated by the Baum-Welch algorithm, use "> filename" in the end.

The output file is we_20.txt in test_files/results/all_vs_all folder. The first two rows are estimated delta and epsilon. Other rows are the iteration index, running time (second), average running time (second) per iteration.

Step 2:

python scripts/all_vs_all/rho_improve_all_vs_all_gss.py test_files/data/example1.fasta test_files/results/all_vs_all/we_20_rho_gss.txt 20 0.005491850534938501 0.9555200033995157  

This script requires the following arguments:

  • "<filename>": name of the file which contains your input fasta.
  • "<filename>": name of your output file.
  • "<filename>": radius.
  • "<filename>": delta from step 1
  • "<filename>": epsilon from step 1

The output file is we_20_rho_gss.txt in test_files/results/all_vs_all folder. Each row is estimated delta, epsilon, rho and running time (second).

Step 3:

python scripts/all_vs_all/viterbi_path_improve_all_vs_all.py test_files/data/example1.fasta test_files/results/all_vs_all/we_20_viterbi.txt 20 0.005491850534938501 0.9555200033995157 0.02070998910340327 

This script requires the following arguments:

  • "<filename>": name of the file which contains your input fasta.
  • "<filename>": name of your output file.
  • "<filename>": radius.
  • "<filename>": delta from step 1
  • "<filename>": epsilon from step 1
  • "<filename>": rho from step 2

The output viterbi path for each sequence is in test_files/results/all_vs_all/we_20_viterbi.txt. The end of this file contains the running time of this step.

Run Example for target_vs_db Mode

Our example input file is example2.fasta, which stores at test_files/data folder. The identifier for target sequences should start with "target_", while identifier for source sequences start with "db_"

Step 1:

python scripts/target_vs_db/BW_improve_target_vs_db.py test_files/data/example2.fasta test_files/results/target_vs_db/we_20.txt test_files/data/ref.fasta 20 > test_files/results/target_vs_db/we_20_middle.txt

This script requires the following arguments:

  • "<filename>": name of the file which contains your input fasta.
  • "<filename>": name of your output file.
  • "<filename>": name of your reference fasta. It should contain target sequences with identifier starting with "target_". These target sequences provide emission probabilities for insert hidden state. We recommend it to be a larger dataset.
  • "<filename>": radius.
  • "<filename>": If you need the middle files generated by the Baum-Welch algorithm, use "> filename" in the end.

The output file is we_20.txt in test_files/results/target_vs_db folder. The first two rows are estimated delta and epsilon. Other rows are the iteration index, running time (second), average running time (second) per iteration.

Step 2:

python scripts/target_vs_db/rho_improve_target_vs_db_gss.py test_files/data/example2.fasta test_files/results/target_vs_db/we_20_rho_gss.txt test_files/data/ref.fasta 20 0.02131052651155812 0.8409128461304629  

This script requires the following arguments:

  • "<filename>": name of the file which contains your input fasta.
  • "<filename>": name of your output file.
  • "<filename>": name of your reference fasta. Format requirement same as above.
  • "<filename>": radius.
  • "<filename>": delta from step 1
  • "<filename>": epsilon from step 1

The output file is we_20_rho_gss.txt in test_files/results/target_vs_db folder. Each row is estimated delta, epsilon, rho and running time (second).

Step 3:

python scripts/target_vs_db/viterbi_path_improve_target_vs_db.py test_files/data/example2.fasta test_files/results/target_vs_db/we_20_viterbi.txt test_files/data/ref.fasta 20 0.02131052651155812 0.8409128461304629 0.082453832618475

This script requires the following arguments:

  • "<filename>": name of the file which contains your input fasta.
  • "<filename>": name of your output file.
  • "<filename>": name of your reference fasta. Format requirement same as above.
  • "<filename>": radius.
  • "<filename>": delta from step 1
  • "<filename>": epsilon from step 1
  • "<filename>": rho from step 2

The output viterbi path for each sequence is in test_files/results/target_vs_db/we_20_viterbi.txt. The end of this file contains the running time of this step.

Note

  • Our program currently only accepts the protein sequences as input. If your data is DNA, please translate it.

Our improved JHMM can apply to DNA sequences as well, incorporating this option into our program is the subject of current work.

Credits

This program was written by Qian Feng, supervised by Heejung Shim and Yao-ban Chan from the University of Melbourne. For any problems, please report an issue in Github or send an email to Qian Feng.

Reference

  • Zilversmit, M. M., Chase, E. K., Chen, D. S., Awadalla, P., Day, K. P., & McVean, G. (2013). Hypervariable antigen genes in malaria have ancient roots. BMC evolutionary biology, 13(1), 110.

About

An Improved JHMM to Detect Recombination From Unaligned Sequences

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages