This repository containes the code and the trained models for intrinsic protein disorder prediction through deep bidirectional transformers from Peptone Ltd.
ADOPT has been introduced in our paper ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers and it's also available as webserver at adopt.peptone.io.
Our disorder predictor is made up of two main blocks, namely: a self-supervised encoder and a supervised disorder predictor. We use Facebook’s Evolutionary Scale Modeling (ESM) library to extract dense residue evel representations, which feed the supervised machine learning based predictor.
The ESM library exploits a set of deep Transformer encoder models, which processes character sequences of amino acids as inputs.
ADOPT makes use of two datasets: the CheZoD “1325” and the CheZoD “117” databases containing 1325 and 117 sequences, respectively, together with their residue level Z-scores.
Model | Pre-trained model | Datasets | Split level | CV |
---|---|---|---|---|
lasso_esm-1b_cleared_residue |
ESM-1b | Chezod 1325 cleared and Chezod 117 | residue | ❌ |
lasso_esm-1v_cleared_residue |
ESM-1v | Chezod 1325 cleared and Chezod 117 | residue | ❌ |
lasso_esm-msa_cleared_residue |
ESM-MSA | Chezod 1325 cleared and Chezod 117 | residue | ❌ |
lasso_combined_cleared_residue |
Combined | Chezod 1325 cleared and Chezod 117 | residue | ❌ |
lasso_esm-1b_residue_cv |
ESM-1b | Chezod 1325 | residue | ✅ |
lasso_esm-1v_residue_cv |
ESM-1v | Chezod 1325 | residue | ✅ |
lasso_esm-msa_residue_cv |
ESM-MSA | Chezod 1325 | residue | ✅ |
lasso_esm-1b_cleared_residue_cv |
ESM-1b | Chezod 1325 cleared | residue | ✅ |
lasso_esm-1v_cleared_residue_cv |
ESM-1v | Chezod 1325 cleared | residue | ✅ |
lasso_esm-msa_cleared_residue_cv |
ESM-MSA | Chezod 1325 cleared | residue | ✅ |
lasso_esm-1b_cleared_sequence_cv |
ESM-1b | Chezod 1325 cleared | sequence | ✅ |
lasso_esm-1v_cleared_sequence_cv |
ESM-1v | Chezod 1325 cleared | sequence | ✅ |
lasso_esm-msa_cleared_sequence_cv |
ESM-MSA | Chezod 1325 cleared | sequence | ✅ |
Prerequisites (we suggest creating a dedicated python venv or conda env)
pip install \
pandas \
fair-esm \
biopython \
bertviz \
skl2onnx \
onnxruntime \
spacy \
plotly \
wandb
Install the adopt package:
Run
git clone https://github.com/PeptoneInc/ADOPT.git
cd ADOPT
git submodule update --init --recursive
python setup.py install
Then, you can predict the intrinsic disorder of each reesidue in a protein sequence, as follows:
from adopt import MultiHead, ZScorePred
# Prepare protein sequence and name i.e brmid
SEQUENCE = "SLQDGVRQSRASDKQTLLPNDQLYQPLKDREDDQYSHLQGNQLRRN"
PROTID = "Protein 18890"
# Choose model type and training strategy
MODEL_TYPE = "esm-1b"
STRATEGY = "train_on_cleared_1325_test_on_117_residue_split"
# Extract residue level representations
multi_head = MultiHead(MODEL_TYPE)
representation, tokens = multi_head.get_representation(SEQUENCE, PROTID)
# Predict the Z score related to each residue in the sequence specified above
z_score_pred = ZScorePred(STRATEGY, MODEL_TYPE)
predicted_z_scores = z_score_pred.get_z_score(representation)
In order to enable the esm-msa
based variant of ADOPT, MSAs for each sequence are also required.
We provide a stand alone, docker
based tool you must use to exploit all the functionalities of ADOPT for msa
related tasks.
As a prerequisite, you must have Docker installed.
Clone the ADOPT repository, go to the ADOPT directory and run the MSA scripts you are interested in.
The $LOCAL_MSA_DIR
in the MSA scripts serves as the main directory for the MSA related procedures and can be empty
initially when running the above scripts. Under the hood, each MSA script will:
-
Download uniclust dataset (in this case "2020.06") into the
$LOCAL_MSA_DIR/databases
subdirectory.
!NOTE: under the hood, ADOPT checks, whether uniclust is already in this subdirectory. If not, downloading can take several hours, given the size of this dataset is approx 180GB! Download step is skipped only if the$LOCAL_MSA_DIR/databases
folder is non empty and the tar file (UniRef30_2020_06_hhsuite.tar.gz) is found in the$LOCAL_MSA_DIR
folder. -
Once the relevant uniclust is there, a docker image named
msa-gen-adopt
is run with the volume$LOCAL_MSA_DIR
mounted on it.
Note that this setup procedure creates four subfolders:
+-- $LOCAL_MSA_DIR | +-- databases | +-- msas | +-- msa_fastas | +-- esm_msa_reprs
databases
will hold the uniclust;
msas
is where MSAs (.a3m
files) will be saved later, see STEP 2 below;
msa_fastas
is where .fasta
files already used for MSA queries will be saved;
esm_msa_reprs
is allocated for potential esm-msa
representations;
The MSAs will be placed in the $LOCAL_MSA_DIR/msas
folder.
You can set the ESM_MODELS_DIR
and ADOPT_MODELS_DIR
respectively to paths where the ESM and ADOPT pretrained models are stored.
All models will be downloaded from public repositories if not found locally.
The scripts directory contains:
- inference script to predict, in bulk, the disorder of each residue in each protein sequence reported in a FASTA file, with ADOPT where you need to specify:
NEW_PROT_FASTA_FILE_PATH
defining your FASTA file pathNEW_PROT_RES_REPR_DIR_PATH
defining where the residue level representations will be extracted
- training script to train the ADOPT where you need to specify:
TRAIN_STRATEGY
defining the training strategy you want to use
- MSA inference script, which allows to perform inference also with the
esm-msa
model. The predicted Z scores will be written on the host (optional) - MSA training script, which allows to perform training also with the
esm-msa
model. The trained models will be written in theADOPT/models
directory (optional)
The notebooks directory contains:
- disorder prediction notebook
- multi-head attention weights visualisation notebook
In order to predict the Z score related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model.
In the ADOPT directory run:
python adopt/embedding.py <fasta_file_path> \
<residue_level_representation_dir>
Where:
<fasta_file_path>
defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder<residue_level_representation_dir>
defines the path where you want to save the residue level representations--msa
runs the MSA procedure to getesm-msa
representations. We suggest you take a look to the MSA inference script as a quick example (optional)-h
shows help message and exit
A subdirectory containing the residue level representation extracted from each pre-trained model available will be created under both the residue_level_representation_dir
.
Important to note that in order to obtain the representations from the esm-msa
model as well, the relevant MSAs have to
be placed in the root directory /msas
in the system, where ADOPT is running. The MSAs can be created as described in
the MSA setting above.
Once we have extracted the residue level representations we can predict the intrinsic disorder (Z score).
In the ADOPT directory run:
python adopt/inference.py <inference_fasta_file> \
<inference_repr_dir> \
<predicted_z_scores_file> \
--train_strategy <training_strategy> \
--model_type <model_type>
Where:
<inference_fasta_file>
defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder<inference_repr_dir>
defines the path where you've already saved the residue level representations<predicted_z_scores_file>
defines the path where you want the Z scores to be saved--train_strategy
defines the training strategies defined below--model_type
defines the pre-trained model we want to use. We suggest you use theesm-1b
model-h
shows help message and exit
The output is a .json
file contains the Z scores related to each residue of each protein in the FASTA file where you put the proteins you are intereseted in.
Training strategy | Pre-trained models |
---|---|
train_on_cleared_1325_test_on_117_residue_split |
esm-1b , esm-1v , esm-msa and combined |
train_on_1325_cv_residue_split |
esm-1b , esm-1v and esm-msa |
train_on_cleared_1325_cv_residue_split |
esm-1b , esm-1v and esm-msa |
train_on_cleared_1325_cv_sequence_split |
esm-1b , esm-1v and esm-msa |
train_on_total |
esm-1b , esm-1v |
Once we have extracted the residue level representations of the protein for which we want to predict the intrinsic disorder (Z score), we can train the predictor.
NOTE: This step is not mandatory because we've already trained such models. You can find them in the models bucket.
In the ADOPT directory run:
python adopt/training.py <train_json_file_path> \
<test_json_file_path> \
<train_residue_level_representation_dir> \
<test_residue_level_representation_dir> \
--train_strategy <training_strategy>
Where:
<train_json_file_path>
defines the JSON containing the proteins we want to use as training set<test_json_file_path>
defines the JSON containing the proteins we want to use as test set<train_residue_level_representation_dir>
defines the path where we saved the residue level representations of the proteins in the training set<test_residue_level_representation_dir>
defines the path where we saved the residue level representations of the proteins in the test set--train_strategy
defines the training strategies defined above--msa
runs the MSA procedure to get trained models fed with theesm-msa
representations. We suggest you take a look to the MSA training script as a quick example (optional)-h
shows help message and exit
Once we have extracted the residue level representations we can benchmark ADOPT against other methods.
In the ADOPT directory run:
python adopt/benchmarks.py <benchmark_data_path> \
<train_json_file_path> \
<test_json_file_path> \
<train_residue_level_representation_dir> \
<test_residue_level_representation_dir> \
--train_strategy <training_strategy>
Where:
<benchmark_data_path>
defines the directory containing the predictions of the method we want to benchmark againbst ADOPT-h
shows help message and exit
We benchmarked ADOPT against AlphaFold2 computing the spearman correlations between actual Z-scores and predicted pLDDT5 scores along with actual Z-scores and predicted SASA5 scores, obtained by AlphaFold2, collected for the task linked to the model evaluated on the CheZoD “117” validation set and described in the ADOPT paper.
As a prerequisite, you must have Docker installed.
Run:
docker run ghcr.io/peptoneinc/adopt_alphafold2_comparison:1.0.2
Here is the script used to extract the correlations and here are the predictions obtained from Alphafold2.
If you use this work in your research, please cite the the relevant paper:
@article{10.1093/nargab/lqad041,
author = {Redl, Istvan and Fisicaro, Carlo and Dutton, Oliver and Hoffmann, Falk and Henderson, Louie and Owens, Benjamin M J and Heberling, Matthew and Paci, Emanuele and Tamiola, Kamil},
title = "{ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers}",
journal = {NAR Genomics and Bioinformatics},
volume = {5},
number = {2},
year = {2023},
month = {05},
abstract = "{Intrinsically disordered proteins (IDPs) are important for a broad range of biological functions and are involved in many diseases. An understanding of intrinsic disorder is key to develop compounds that target IDPs. Experimental characterization of IDPs is hindered by the very fact that they are highly dynamic. Computational methods that predict disorder from the amino acid sequence have been proposed. Here, we present ADOPT (Attention DisOrder PredicTor), a new predictor of protein disorder. ADOPT is composed of a self-supervised encoder and a supervised disorder predictor. The former is based on a deep bidirectional transformer, which extracts dense residue-level representations from Facebook’s Evolutionary Scale Modeling library. The latter uses a database of nuclear magnetic resonance chemical shifts, constructed to ensure balanced amounts of disordered and ordered residues, as a training and a test dataset for protein disorder. ADOPT predicts whether a protein or a specific region is disordered with better performance than the best existing predictors and faster than most other proposed methods (a few seconds per sequence). We identify the features that are relevant for the prediction performance and show that good performance can already be gained with \\<100 features. ADOPT is available as a stand-alone package at https://github.com/PeptoneLtd/ADOPT and as a web server at https://adopt.peptone.io/.}",
issn = {2631-9268},
doi = {10.1093/nargab/lqad041},
url = {https://doi.org/10.1093/nargab/lqad041},
note = {lqad041},
eprint = {https://academic.oup.com/nargab/article-pdf/5/2/lqad041/50150244/lqad041.pdf},
}
This source code is licensed under the MIT license found in the LICENSE
file in the root directory of this source tree.
A new version of ADOPT has been trained as described in Optimizing protein language models with Sentence Transformers, NeurIPS (2023). Code is available at https://github.com/PeptoneLtd/contrastive-finetuning-plms