Protein Structure Prediction Using Threading

This project focuses on protein structure prediction using the threading (fold recognition) method. Threading is a computational approach that aligns a protein sequence with known template structures to predict its 3D structure, even when sequence similarity is low. This method identifies candidate templates by considering structural similarities, such as predicted secondary structures and solvent accessibility. By mapping the sequence onto structurally similar templates, threading can accurately predict protein folds, making it a valuable tool in bioinformatics, medicine, and biotechnology. The algorithm is inspired by David Jones's THREADER (Computational Methods in Molecular Biology, Chapter 13, 1998).

Fig. 1: Protein fold recognition summary.

🔄Installation

Clone the repository

git clone [email protected]:zhukovanadezhda/protein-threading.git
cd protein-threading

Setup the conda environment

Install miniconda. Create the protein-threading conda environment:

conda env create -f environment.yml

Load the environment

conda activate protein-threading

💡Note: To deactivate an active environment, use:
conda deactivate

🧑‍💻️ Usage

💡Note: Before running the program, ensure the src/config.py file is properly configured to set up your working directories.

To run an example: Select one of the provided example directories (e.g., data/example1).

To use custom data: Create your own directory (e.g., data/your_dir) with subdirectories for sequences and structures. Place your data in these folders and update the paths in src/config.py.

Example src/config.py modification:
# Directory paths
TEMPLATES_DIR = 'data/your_dir/structures/'
SEQUENCES_DIR = 'data/your_dir/sequences/'

To run the program, use the following command:

python src/main.py [-h] [--sequences SEQUENCES] [--templates TEMPLATES] [--output_file OUTPUT_FILE] \
                   [--jobs JOBS] [--dry_run] [--verbose]

⚙️Arguments

Argument	Description	Default
`-h`	Show a help message and exit.
`--sequences`	Comma-separated list of sequence filenames (`.fasta` format).	All files from `SEQUENCES_DIR` from `src/config.py`.
`--templates`	Comma-separated list of template filenames (`.pdb` format).	All files from `TEMPLATES_DIR` from `src/config.py`.
`--gap_score`	The gap penalty.	`0`
`--output_file`	Name of the output CSV file.	`results/energy_scores.csv`
`--jobs`	Number of parallel jobs to run.	All cores
`--dry_run`	If set, only log actions without processing.	`False` (not set)
`--verbose`	If set, verbose output is enabled.	`False` (not set)
`--print_alignments`	If set, the alignments are printed.	`False` (not set)

Table 1: Program parameters.

🎁Examples

☝️Example 1: Small proteins (<50 amino acids)

Estimated execution time: Less than 35 minutes

Input

src/config.py :

# Directory paths
TEMPLATES_DIR = 'data/example1/structures/'
SEQUENCES_DIR = 'data/example1/sequences/'

python src/main.py --sequences 1CRN.fasta,1L2Y.fasta,1VII.fasta,5AWL.fasta --gap_score 0.2 \
                   --output_file results/example1_result.csv --print_alignments

Output

2024-09-17 08:10:16,722 - INFO - Loading DOPE score data...
2024-09-17 08:11:47,306 - INFO - Processing sequences and templates...
2024-09-17 08:11:47,306 - INFO - Processing sequence 5AWL.fasta, length: 10
2024-09-17 08:11:50,621 - INFO - Processing template 5awl.pdb with 10 residues.
2024-09-17 08:11:50,444 - INFO - Processing template 1le0.pdb with 12 residues.
  0  1  2  3  4  5  6  7  8  9
  |  |  |  |  |  |  |  |  |  |
  Y  Y  Y  D  P  E  T  G  T  W
2024-09-17 08:11:57,940 - INFO - Processed template 5awl.pdb. Energy score: -32.68
  0  1  2  3  4  5  6  7  8  9 10 11
  |  |  |  |  |  |  |  |     |  |   
  Y  Y  Y  D  P  E  T  G  -  T  W  -
2024-09-17 08:12:00,712 - INFO - Processed template 1le0.pdb. Energy score: -28.12
...

Results

	1crn.pdb	1l2y.pdb	1le0.pdb	1le1.pdb	1vii.pdb	5awl.pdb
	46 aa	20 aa	12 aa	12 aa	36 aa	10 aa
1CRN.fasta	-283.99	-35.37	-10.5	-13.05	-162.04	-9.02
1L2Y.fasta	-14.77	*-67.4*	-19.18	-20.23	-41.71	-17.53
1VII.fasta	-157.37	-24.16	7.61	7.23	*-169.07*	6.95
5AWL.fasta	6.73	-19.18	-28.12	-28.32	3.62	*-32.68*

Table 2: Summary of results from the first example.
"aa" stands for amino acids; bold indicates the best score, italics indicate the correct match.

✌️Example 2: Medium-sized proteins (70-120 amino acids)

Estimated execution time: Over 16 hours

Input

src/config.py :

# Directory paths
TEMPLATES_DIR = 'data/example2/structures/'
SEQUENCES_DIR = 'data/example2/sequences/'

python src/main.py --sequences 1E68.fasta,3E8V.fasta --gap_score 0.1 --output_file results/example2_result.csv

Results

	1e68.pdb	1ubq.pdb	3zow.pdb	3e8v.pdb	1tit.pdb	1tvd.pdb	3zbv.pdb
	70 aa	76 aa	81 aa	82 aa	89 aa	116 aa	118 aa
1E68.fasta	-490.53	-466.0	-472.0	-480.22	-412.13	-316.45	-312.76
3E8V.fasta	-426.15	-479.47	-527.57	-558.23	-493.82	-415.08	-407.47

Table 3: Summary of results from the second example.
"aa" stands for amino acids; bold indicates the best score, italics indicate the correct match.

📊 Evaluate results significance

To determine whether the results obtained are significantly different from those expected by chance, a statistical approach based on the calculation of z-scores is used. The energy scores observed for each alignment can be compared to a distribution generated by a random shuffle of the sequences. The z-score quantifies the deviation from the mean of the random scores: a z-score lower than -1.96 indicates, with 95% confidence, a significant structural match, while a z-score higher than 1.96 signals an alignment significantly worse than chance. A z-score close to zero suggests the absence of a significant deviation.

Usage:

💡Note: Before running the program, ensure the src/config.py file is properly configured with the correct paths for your working directories.
The working directories should correspond to those where the sequence and template files referenced in the energy scores file are stored.

python src/evaluate_significance.py --input_csv <path_to_input_csv> --output_file <path_to_output_csv> \
                                    --gap_score <gap_score> --n_shuffles <number_of_shuffles>

Requirements:

input_csv - the input csv file with calculated energy scores, the file should be in the format produced by the main script described above.

Example:

Input

src/config.py :

# Directory paths
TEMPLATES_DIR = 'data/example1/structures/'
SEQUENCES_DIR = 'data/example1/sequences/'

python src/evaluate_significance.py --input_csv results/example1_result.csv \
                                    --output_file results/shuffle_example1.csv \
                                    --n_shuffle 10 --gap_score 0.2

Output

WARNING - The number of shuffles is less than 30. Shapiro-Wilk test will be performed to check for the normality of the shuffled scores.
WARNING - The distribution of shuffled energy scores for sequence 5AWL.fasta is not normally distributed (p-value = 0.0000).
...

Result

	1crn.pdb	1l2y.pdb	1le0.pdb	1le1.pdb	1vii.pdb	5awl.pdb
1CRN.fasta	-2.13	0.68	0.28	-0.91	1.22	-0.39
1L2Y.fasta	1.12	-2.49	2.96	2.62	1.69	1.66
1VII.fasta	-0.16	0.18	-1.05	-1.40	-1.48	-2.71
5AWL.fasta	-0.97	-0.16	-1.32	-1.53	-1.19	-1.56

Table 4: Summary of z-scores for alignments between different structures in example 1.

⚠️ Warning: The z-scores may not be interpretable as the shuffled energy scores are not normally distributed. This is merely an example. To perform a meaningful significance evaluation, increase the number of shuffles.

🔧 Test Different Gap Penalties

To explore the impact of gap penalties on sequence-structure matching performance, you can use the test_gaps.py script. This script evaluates several gap penalty values to determine the optimal setting for the algorithm. Performance is assessed based on the following scoring system:

Correct Structure Prediction: If the algorithm accurately predicts the structure of the sequence, it earns 2 points.
Similar Structure Bonus: If a similar structure appears among the top two predictions, the algorithm earns an additional 1 point.
Thus, the maximum score for each sequence is 3 points, and the final score is normalized by dividing it by the maximum possible score.

Usage Instructions

💡 Note: Before running the program, ensure that src/config.py is properly configured. Set the paths for your working directories, the gap penalties to test, and similar structures to evaluate performance.
Example modification to src/config.py:
# Gap penalties to test 
gap_scores = [0, 0.1, 0.2] 

# Proteins with similar structures 
homolog_pairs = { 
    'A.fasta': ['b.pdb'], 
    'B.fasta': ['a.pdb', 'c.pdb'], 
    'C.fasta': ['b.pdb'] 
} 

Usage:

python src/test_gaps.py [--program_path PROGRAM_PATH] [--output_dir OUTPUT_DIR] [--result_file RESULT_FILE]

Example:

Input

src/config.py :

# Gap scores to test
gap_scores = [0, 0.1, 0.2, 0.3, 0.5, 1, 2, 5]

# Proteins with similar structures
homolog_pairs = {
    '5AWL.fasta': ['1l2y.pdb', '1vii.pdb', '1crn.pdb'],
    '1VII.fasta': ['1l2y.pdb', '1crn.pdb'],
    '1L2Y.fasta': ['1vii.pdb', '1crn.pdb'],
    '1CRN.fasta': ['1l2y.pdb','1vii.pdb', '1crn.pdb', '1le0.pdb', '1le1.pdb']
}

python src/test_gaps.py --output_dir results/gaps_test --result_file results/performance.csv

Result

Gap Score	Performance	Correctly Guessed	Similar Structure
0.0	0.5	1	4
0.1	0.58	2	3
0.2	0.92	4	3
0.3	0.92	4	3
0.5	0.83	4	2
1.0	0.75	4	1
2.0	0.75	4	1
5.0	0.75	4	1

Table 5: Summary of algorithm performance based on gap penalty.

🔗References

JONES, D. Threader: protein sequence threading by double dynamic programming. Computational Methods in Molecular Biology. Elsevier, 1996. v. 32, cap. 13, p. 312–338

✉️Contact

For questions or issues, please open an issue on GitHub or contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
data		data
doc		doc
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Structure Prediction Using Threading

🔄Installation

Clone the repository

Setup the conda environment

Load the environment

🧑‍💻️ Usage

⚙️Arguments

🎁Examples

☝️Example 1: Small proteins (<50 amino acids)

Input

Output

Results

✌️Example 2: Medium-sized proteins (70-120 amino acids)

Input

Results

📊 Evaluate results significance

Usage:

Requirements:

Example:

Input

Output

Result

🔧 Test Different Gap Penalties

Usage Instructions

Usage:

Example:

Input

Result

🔗References

✉️Contact

About

Languages

License

zhukovanadezhda/protein-threading

Folders and files

Latest commit

History

Repository files navigation

Protein Structure Prediction Using Threading

🔄Installation

Clone the repository

Setup the conda environment

Load the environment

🧑‍💻️ Usage

⚙️Arguments

🎁Examples

☝️Example 1: Small proteins (<50 amino acids)

Input

Output

Results

✌️Example 2: Medium-sized proteins (70-120 amino acids)

Input

Results

📊 Evaluate results significance

Usage:

Requirements:

Example:

Input

Output

Result

🔧 Test Different Gap Penalties

Usage Instructions

Usage:

Example:

Input

Result

🔗References

✉️Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages