This is a public repository for the project in the de novo design of antimicrobial peptides with enhanced activity, created in 2023 by a team of bioengineering students from Sup'Biotech High Engineering School of Biotechnology, Villejuif, France.
Found in all classes of living organisms, antimicrobial peptides (AMPs) play a crucial role in the innate immune response and barrier defence by killing or inhibiting the growth of harmful microorganisms, including bacteria and fungi. The broad-spectrum activity makes them effective against different types of pests and pathogens, thus prompting the use of AMPs as an alternative to pesticides, which are hazardous for both human health and environment.
Even though to date several thousands of AMPs have been isolated from different natural sources, only few of them (nisin, dermaseptin, defensins, cinnamycin) have been translated to the market, primarily to target multidrug-resistant infections or as food preservatives. This is due to several inherent drawbacks of the naturally obtained AMPs, primarily their short half-life owing to the susceptibility to protease degradation, and lower activity as compared to conventional pesticides, resulting in higher production costs.
Synthetic AMPs, strengthened by sequence truncation, mutation, cyclization, or introduction of unnatural amino acids, have been shown to retain or improve the antimicrobial potency along with circumventing the disadvantages of the natural analogues.
This project aims at developing an algorithm for the de novo design of synthetic AMPs with optimized activity, as compared to those present in living organisms. Our ultimate goal would be to validate the nanomolar range activity of the generated sequences against fungal pathogens, by contrast with the micromolar range of natural AMPs.
The following code block allows to setup a virtual environment.
poetry
is a Python packaging and dependency management tool, simplifying the management of a Python project by providing features such as:
- dependency resolution
- virtual environments
- packaging, and publishing.
It allows to define project dependencies in pyproject.toml
and the associated poetry.lock
files and handles the installation and management of those dependencies.
To install poetry
:
poetry install
poetry shell
task check
To add dependencies specified in pyproject.toml
:
poetry add <package-name>
The database used to identify positive AMPs' descriptors, full_positive_db.fasta
, was exported from DRAMP with only alpha-helices selected, due to their enhanced capability to propagate cell membranes.
It was then filtered to select peptides with the length from 3 to 18 amino acids (AA):
filtered_positive_db.fasta
The database used to identify negative AMPs' descriptors, filtered_negative_db.fasta
, was exported from UniProt by selecting the intracellular peptides, since it was assumed that they would not be capable of transmembrane transport.
Selected peptides were subsequently filtered by their length, to keep only short candidates of the length between 3 and 18 AA.
The database of AMPs IC50 AMPs_DB_IC50.xslx
was exported from Fjell et al. (2009).
Original repository by alxdrcirilo
- Reduces the input sequences of 20 AA in accordance with the so-called reduction dictionary, based on the physico-chemical properties of AA. In this work we assume that the optimal classification of AA to reduce the complexity of the protein sequence should be based on their hydrophobicity, size, and charge (RED6):
- Large hydrophobic (A): isoleucine (I), leucine (L), methionine (M), valine (V), tryptophan (W)
- Small hydrophobic (B): alanine (A), cysteine (C), phenylalanine (F), glycine (G), proline (P)
- Positive hydrophilic (P): histidine (H), lysine (K), arginine (R)
- Negative hydrophilic (N): aspartic acid (D), glutamic acid (E)
- Uncharged hydrophilic (U): asparagine (N), glutamine (Q), serine (S), threonine (T), tyrosine (Y).
- Cures positive and negative databases using a size restriction filter. Only AMPs of the length between 3 and 18 AA are selected, due to their enhanced antimicrobial activity, confirmed in the literature.
The k-mer approach is a method commonly used in the analysis of protein sequences. It involves breaking down a sequence into shorter subsequences of length k, known as k-mers, and examining the occurrence and properties of these k-mers within the sequence. A k=5 size was selected for generation of k-mer descriptors because it enabled the integration of all stabilization interaction found in helices (i + 3; i + 4; i + 5) without being too large for interpretation.
- Creates a temporary directory containing
.kmr
files for each peptide sequence with all possible k-mers of size 5 and a maximum of 3 gaps. All generated k-mers are then concatenated to be ranged in accordance with their activity scores.
Scoring function is based on the number of descriptor's occurrences in a positive database as compared to the negative one, and is computed as stated below:
The score is added to each key: the couple of the dictionary value and the according activity score are saved in the file descriptors_activity_scores.tsv
Analyses the frequencies of AA at given positions in positive and negative databases.
Generates a set of peptides with potential antimicrobial properties by:
- A loop approach, comprising of the introduction of random mutations in peptides from the positive database: 1 mutation in 1 AA per iteration, in compliance with the probabilities of the substitution matrix.
- After each iteration, a sequence is monitored for the presence of active and killer descriptors and attributed a score in accordance with descriptors_activity_scores.tsv file, equal to the sum of activity scores of every single descriptor present in the sequence.
- If the calculated sum is higher than the original score, the mutated peptide is added to the candidate AMPs library. The candidates with the highest activity scores are selected as potential AMP candidates for the in vitro testing.
def score_kmers(pep_seq: str, r_dict: int, score_dictionary = None) -> float
- Computing physico-chemical properties for each generated peptide to select those with the highest potential capacity to propagate cell membranes:
def pep_physical_analysis (pep_seq: str) -> list [str, float, float, float]
a. Net charge at pH = 7. Positively charged peptides are preferable due to their attraction to the surfaces of cell membranes.
b. Kyte-Doolittle hydrophobicity profile with auto-correlation transformation is employed to calculate the periodicity of the helix as the average distance between hydrophobic and hydrophilic residues.
c. If the periodicity is close to 4, we suppose that the peptide is an α-helix as consistent with i + 4 periodicity pattern.
Due to the amphiphilic properties of α-helices and their capability of transmembrane transport, they are considered as the most prominent secondary structure for AMPs.
The helical structure can be further validated using helical wheel with the help of helixvis
or AlphaFold 2
.
![img.png](results/Hydrophobicity and helical wheel analysis of databases.png)
A and B line charts represent auto-correlation of peptide hydrophobicity profiles (Eisenberg consensus scale using Biopython Protparam
library) in positive and negative databases, respectively.
Panels C and D represent amino acids distribution along the helical wheels in positive and negative examples, respectively. Color code: gray = hydrophobic; yellow = polar; blue = basic; red = acidic. Generated with helixviz
library.
Plots the distribution of activity scores for the AMPs from filtered positive database (filtered_positive_db.fasta
) and a generated set of candidates (de_novo_peptide_library.xslx
).
Employs Support Vector Classifier (SVC) to discriminate between high and low activity AMPs using available data on IC50 values from Fjell et al. (2009). The model takes into account hydrophobic moments (computed using the Eisenberg formula, imported from JoaoRodrigues/hydrophobic_moment repository), descriptor-based activity scores and in vivo aggregation score (a3v_Sequence_Average) generated by Aggrescan.
The ultimate goal of using SVC is to eliminate peptides of potentially low antimicrobial activity.
SVC is a type of supervised machine learning algorithm used to find an optimal hyperplane that discriminates the data points into different classes or predicts a continuous value.
Here, SVC is used to create a distinction within a training dataset between more than 2 variates, with the estimation of the false positive and false negative classification confidence. As a result, a bivariate density plot is obtained after over 1000 cycles of the SVC repetition.
Load generated data from generate_peptide.py and external tools (Tango and Aggrescan) to select peptides having optimal combination of local and global descriptors. SVC Model was trained and saved to resources using joblib in correlation_score_IC50.py.
Shell command pipeline enabling to generate a number of n peptides with a number i of iteration in the genetical algorithm. It saves data and automatically run Tango cmd to score peptides In-vitro aggregation prediction. Also asks for Aggrescan run file and pertfom selection of the optimal peptides. Multiple modes can be runned : -g enables to generate peptides only and does not perform selection of peptides based on global descriptors --npep n for the number of peptides generated (default 10) --bootstrap i for the number of iterations (default 500) -s enable to perform selection only of a given set of peptides based on global descriptors
Generated peptides are saved in xlsx and fasta format in results directory