Skip to content

⚛️📈Protein threading implementation using double dynamic programming👥💻

License

Notifications You must be signed in to change notification settings

zhukovanadezhda/protein-threading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protein Structure Prediction Using Threading

This project focuses on protein structure prediction using the threading (fold recognition) method. Threading is a computational approach that aligns a protein sequence with known template structures to predict its 3D structure, even when sequence similarity is low. This method identifies candidate templates by considering structural similarities, such as predicted secondary structures and solvent accessibility. By mapping the sequence onto structurally similar templates, threading can accurately predict protein folds, making it a valuable tool in bioinformatics, medicine, and biotechnology. The algorithm is inspired by David Jones's THREADER (Computational Methods in Molecular Biology, Chapter 13, 1998).

protein

Fig. 1: Protein fold recognition summary.

🔄Installation

Clone the repository

git clone [email protected]:zhukovanadezhda/protein-threading.git
cd protein-threading

Setup the conda environment

Install miniconda. Create the protein-threading conda environment:

conda env create -f environment.yml

Load the environment

conda activate protein-threading

💡Note: To deactivate an active environment, use:

conda deactivate

🧑‍💻️ Usage

💡Note: Before running the program, ensure the src/config.py file is properly configured to set up your working directories.

  • To run an example: Select one of the provided example directories (e.g., data/example1).
  • To use custom data: Create your own directory (e.g., data/your_dir) with subdirectories for sequences and structures. Place your data in these folders and update the paths in src/config.py.

Example src/config.py modification:

# Directory paths
TEMPLATES_DIR = 'data/your_dir/structures/'
SEQUENCES_DIR = 'data/your_dir/sequences/'

To run the program, use the following command:

python src/main.py [-h] [--sequences SEQUENCES] [--templates TEMPLATES] [--output_file OUTPUT_FILE] \
                   [--jobs JOBS] [--dry_run] [--verbose]

⚙️Arguments

Argument Description Default
-h Show a help message and exit.
--sequences Comma-separated list of sequence filenames (.fasta format). All files from SEQUENCES_DIR from src/config.py.
--templates Comma-separated list of template filenames (.pdb format). All files from TEMPLATES_DIR from src/config.py.
--gap_score The gap penalty. 0
--output_file Name of the output CSV file. results/energy_scores.csv
--jobs Number of parallel jobs to run. All cores
--dry_run If set, only log actions without processing. False (not set)
--verbose If set, verbose output is enabled. False (not set)
--print_alignments If set, the alignments are printed. False (not set)

Table 1: Program parameters.

🎁Examples

☝️Example 1: Small proteins (<50 amino acids)

Estimated execution time: Less than 35 minutes

Input

src/config.py :

# Directory paths
TEMPLATES_DIR = 'data/example1/structures/'
SEQUENCES_DIR = 'data/example1/sequences/'
python src/main.py --sequences 1CRN.fasta,1L2Y.fasta,1VII.fasta,5AWL.fasta --gap_score 0.2 \
                   --output_file results/example1_result.csv --print_alignments

Output

2024-09-17 08:10:16,722 - INFO - Loading DOPE score data...
2024-09-17 08:11:47,306 - INFO - Processing sequences and templates...
2024-09-17 08:11:47,306 - INFO - Processing sequence 5AWL.fasta, length: 10
2024-09-17 08:11:50,621 - INFO - Processing template 5awl.pdb with 10 residues.
2024-09-17 08:11:50,444 - INFO - Processing template 1le0.pdb with 12 residues.
  0  1  2  3  4  5  6  7  8  9
  |  |  |  |  |  |  |  |  |  |
  Y  Y  Y  D  P  E  T  G  T  W
2024-09-17 08:11:57,940 - INFO - Processed template 5awl.pdb. Energy score: -32.68
  0  1  2  3  4  5  6  7  8  9 10 11
  |  |  |  |  |  |  |  |     |  |   
  Y  Y  Y  D  P  E  T  G  -  T  W  -
2024-09-17 08:12:00,712 - INFO - Processed template 1le0.pdb. Energy score: -28.12
...

Results

1crn.pdb
1crn
1l2y.pdb
1l2y
1le0.pdb
1le0
1le1.pdb
1le1
1vii.pdb
1vii
5awl.pdb
5awl

46 aa

20 aa

12 aa

12 aa

36 aa

10 aa

1CRN.fasta

1CRN

-283.99

-35.37

-10.5

-13.05

-162.04

-9.02

1L2Y.fasta

1L2Y

-14.77

-67.4

-19.18

-20.23

-41.71

-17.53

1VII.fasta

1VII

-157.37

-24.16

7.61

7.23

-169.07

6.95

5AWL.fasta

5AWL

6.73

-19.18

-28.12

-28.32

3.62

-32.68

Table 2: Summary of results from the first example.
"aa" stands for amino acids; bold indicates the best score, italics indicate the correct match.

✌️Example 2: Medium-sized proteins (70-120 amino acids)

Estimated execution time: Over 16 hours

Input

src/config.py :

# Directory paths
TEMPLATES_DIR = 'data/example2/structures/'
SEQUENCES_DIR = 'data/example2/sequences/'
python src/main.py --sequences 1E68.fasta,3E8V.fasta --gap_score 0.1 --output_file results/example2_result.csv

Results

1e68.pdb
1e68
1ubq.pdb
1ubq
3zow.pdb
3zow
3e8v.pdb
3e8v
1tit.pdb
1tit
1tvd.pdb
1tvd
3zbv.pdb
3zbv

70 aa

76 aa

81 aa

82 aa

89 aa

116 aa

118 aa

1E68.fasta
1e68

-490.53

-466.0

-472.0

-480.22

-412.13

-316.45

-312.76

3E8V.fasta
3e8v

-426.15

-479.47

-527.57

-558.23

-493.82

-415.08

-407.47

Table 3: Summary of results from the second example.
"aa" stands for amino acids; bold indicates the best score, italics indicate the correct match.

📊 Evaluate results significance

To determine whether the results obtained are significantly different from those expected by chance, a statistical approach based on the calculation of z-scores is used. The energy scores observed for each alignment can be compared to a distribution generated by a random shuffle of the sequences. The z-score quantifies the deviation from the mean of the random scores: a z-score lower than -1.96 indicates, with 95% confidence, a significant structural match, while a z-score higher than 1.96 signals an alignment significantly worse than chance. A z-score close to zero suggests the absence of a significant deviation.

Usage:

💡Note: Before running the program, ensure the src/config.py file is properly configured with the correct paths for your working directories.
The working directories should correspond to those where the sequence and template files referenced in the energy scores file are stored.

python src/evaluate_significance.py --input_csv <path_to_input_csv> --output_file <path_to_output_csv> \
                                    --gap_score <gap_score> --n_shuffles <number_of_shuffles>

Requirements:

input_csv - the input csv file with calculated energy scores, the file should be in the format produced by the main script described above.

Example:

Input

src/config.py :

# Directory paths
TEMPLATES_DIR = 'data/example1/structures/'
SEQUENCES_DIR = 'data/example1/sequences/'
python src/evaluate_significance.py --input_csv results/example1_result.csv \
                                    --output_file results/shuffle_example1.csv \
                                    --n_shuffle 10 --gap_score 0.2

Output

WARNING - The number of shuffles is less than 30. Shapiro-Wilk test will be performed to check for the normality of the shuffled scores.
WARNING - The distribution of shuffled energy scores for sequence 5AWL.fasta is not normally distributed (p-value = 0.0000).
...

Result


1crn.pdb


1l2y.pdb


1le0.pdb


1le1.pdb


1vii.pdb


5awl.pdb


1CRN.fasta


-2.13

0.68

0.28

-0.91

1.22

-0.39


1L2Y.fasta

1.12


-2.49

2.96

2.62

1.69

1.66


1VII.fasta

-0.16

0.18

-1.05

-1.40

-1.48


-2.71


5AWL.fasta

-0.97

-0.16

-1.32

-1.53

-1.19

-1.56

Table 4: Summary of z-scores for alignments between different structures in example 1.

⚠️ Warning: The z-scores may not be interpretable as the shuffled energy scores are not normally distributed. This is merely an example. To perform a meaningful significance evaluation, increase the number of shuffles.

🔧 Test Different Gap Penalties

To explore the impact of gap penalties on sequence-structure matching performance, you can use the test_gaps.py script. This script evaluates several gap penalty values to determine the optimal setting for the algorithm. Performance is assessed based on the following scoring system:

  • Correct Structure Prediction: If the algorithm accurately predicts the structure of the sequence, it earns 2 points.
  • Similar Structure Bonus: If a similar structure appears among the top two predictions, the algorithm earns an additional 1 point.
  • Thus, the maximum score for each sequence is 3 points, and the final score is normalized by dividing it by the maximum possible score.

Usage Instructions

💡 Note: Before running the program, ensure that src/config.py is properly configured. Set the paths for your working directories, the gap penalties to test, and similar structures to evaluate performance.

Example modification to src/config.py:

# Gap penalties to test 
gap_scores = [0, 0.1, 0.2] 

# Proteins with similar structures 
homolog_pairs = { 
    'A.fasta': ['b.pdb'], 
    'B.fasta': ['a.pdb', 'c.pdb'], 
    'C.fasta': ['b.pdb'] 
} 

Usage:

python src/test_gaps.py [--program_path PROGRAM_PATH] [--output_dir OUTPUT_DIR] [--result_file RESULT_FILE]

Example:

Input

src/config.py :

# Gap scores to test
gap_scores = [0, 0.1, 0.2, 0.3, 0.5, 1, 2, 5]

# Proteins with similar structures
homolog_pairs = {
    '5AWL.fasta': ['1l2y.pdb', '1vii.pdb', '1crn.pdb'],
    '1VII.fasta': ['1l2y.pdb', '1crn.pdb'],
    '1L2Y.fasta': ['1vii.pdb', '1crn.pdb'],
    '1CRN.fasta': ['1l2y.pdb','1vii.pdb', '1crn.pdb', '1le0.pdb', '1le1.pdb']
}
python src/test_gaps.py --output_dir results/gaps_test --result_file results/performance.csv

Result


Gap Score


Performance


Correctly Guessed


Similar Structure

0.0

0.5

1

4

0.1

0.58

2

3

0.2

0.92

4

3

0.3

0.92

4

3

0.5

0.83

4

2

1.0

0.75

4

1

2.0

0.75

4

1

5.0

0.75

4

1

Table 5: Summary of algorithm performance based on gap penalty.

🔗References

JONES, D. Threader: protein sequence threading by double dynamic programming. Computational Methods in Molecular Biology. Elsevier, 1996. v. 32, cap. 13, p. 312–338

✉️Contact

For questions or issues, please open an issue on GitHub or contact [email protected].