Skip to content

Latest commit

 

History

History
318 lines (232 loc) · 16.9 KB

README.md

File metadata and controls

318 lines (232 loc) · 16.9 KB

Attention based DisOrder PredicTor

This repository containes the code and the trained models for intrinsic protein disorder prediction through deep bidirectional transformers from Peptone Ltd.

DOI GitHub Super-Linter

ADOPT has been introduced in our paper ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers and it's also available as webserver at adopt.peptone.io.

Our disorder predictor is made up of two main blocks, namely: a self-supervised encoder and a supervised disorder predictor. We use Facebook’s Evolutionary Scale Modeling (ESM) library to extract dense residue evel representations, which feed the supervised machine learning based predictor.

The ESM library exploits a set of deep Transformer encoder models, which processes character sequences of amino acids as inputs.

ADOPT makes use of two datasets: the CheZoD “1325” and the CheZoD “117” databases containing 1325 and 117 sequences, respectively, together with their residue level Z-scores.

Table of Contents

Intrinsic disorder trained models

Model Pre-trained model Datasets Split level CV
lasso_esm-1b_cleared_residue ESM-1b Chezod 1325 cleared and Chezod 117 residue
lasso_esm-1v_cleared_residue ESM-1v Chezod 1325 cleared and Chezod 117 residue
lasso_esm-msa_cleared_residue ESM-MSA Chezod 1325 cleared and Chezod 117 residue
lasso_combined_cleared_residue Combined Chezod 1325 cleared and Chezod 117 residue
lasso_esm-1b_residue_cv ESM-1b Chezod 1325 residue
lasso_esm-1v_residue_cv ESM-1v Chezod 1325 residue
lasso_esm-msa_residue_cv ESM-MSA Chezod 1325 residue
lasso_esm-1b_cleared_residue_cv ESM-1b Chezod 1325 cleared residue
lasso_esm-1v_cleared_residue_cv ESM-1v Chezod 1325 cleared residue
lasso_esm-msa_cleared_residue_cv ESM-MSA Chezod 1325 cleared residue
lasso_esm-1b_cleared_sequence_cv ESM-1b Chezod 1325 cleared sequence
lasso_esm-1v_cleared_sequence_cv ESM-1v Chezod 1325 cleared sequence
lasso_esm-msa_cleared_sequence_cv ESM-MSA Chezod 1325 cleared sequence

Usage

Quick start

Prerequisites (we suggest creating a dedicated python venv or conda env)

pip install \
  pandas \
  fair-esm \
  biopython \
  bertviz \
  skl2onnx \
  onnxruntime \
  spacy \
  plotly \
  wandb 

Install the adopt package:

Run

git clone https://github.com/PeptoneInc/ADOPT.git
cd ADOPT
git submodule update --init --recursive
python setup.py install

Then, you can predict the intrinsic disorder of each reesidue in a protein sequence, as follows:

from adopt import MultiHead, ZScorePred

# Prepare protein sequence and name i.e brmid
SEQUENCE = "SLQDGVRQSRASDKQTLLPNDQLYQPLKDREDDQYSHLQGNQLRRN"
PROTID = "Protein 18890"

# Choose model type and training strategy
MODEL_TYPE = "esm-1b"
STRATEGY = "train_on_cleared_1325_test_on_117_residue_split"

# Extract residue level representations
multi_head = MultiHead(MODEL_TYPE)
representation, tokens = multi_head.get_representation(SEQUENCE, PROTID)

# Predict the Z score related to each residue in the sequence specified above
z_score_pred = ZScorePred(STRATEGY, MODEL_TYPE)
predicted_z_scores = z_score_pred.get_z_score(representation)

MSA setting (optional)

In order to enable the esm-msa based variant of ADOPT, MSAs for each sequence are also required. We provide a stand alone, docker based tool you must use to exploit all the functionalities of ADOPT for msa related tasks.

First time setup

As a prerequisite, you must have Docker installed.

Clone the ADOPT repository, go to the ADOPT directory and run the MSA scripts you are interested in.

Notes

The $LOCAL_MSA_DIR in the MSA scripts serves as the main directory for the MSA related procedures and can be empty initially when running the above scripts. Under the hood, each MSA script will:

  1. Download uniclust dataset (in this case "2020.06") into the $LOCAL_MSA_DIR/databases subdirectory.
    !NOTE: under the hood, ADOPT checks, whether uniclust is already in this subdirectory. If not, downloading can take several hours, given the size of this dataset is approx 180GB! Download step is skipped only if the $LOCAL_MSA_DIR/databases folder is non empty and the tar file (UniRef30_2020_06_hhsuite.tar.gz) is found in the $LOCAL_MSA_DIR folder.

  2. Once the relevant uniclust is there, a docker image named msa-gen-adopt is run with the volume $LOCAL_MSA_DIR mounted on it.

Note that this setup procedure creates four subfolders:

+-- $LOCAL_MSA_DIR
|   +-- databases
|   +-- msas
|   +-- msa_fastas
|   +-- esm_msa_reprs

databases will hold the uniclust; msas is where MSAs (.a3m files) will be saved later, see STEP 2 below; msa_fastas is where .fasta files already used for MSA queries will be saved; esm_msa_reprs is allocated for potential esm-msa representations;

The MSAs will be placed in the $LOCAL_MSA_DIR/msas folder.

More notes

You can set the ESM_MODELS_DIR and ADOPT_MODELS_DIR respectively to paths where the ESM and ADOPT pretrained models are stored. All models will be downloaded from public repositories if not found locally.

Scripts

The scripts directory contains:

  • inference script to predict, in bulk, the disorder of each residue in each protein sequence reported in a FASTA file, with ADOPT where you need to specify:
    • NEW_PROT_FASTA_FILE_PATH defining your FASTA file path
    • NEW_PROT_RES_REPR_DIR_PATH defining where the residue level representations will be extracted
  • training script to train the ADOPT where you need to specify:
    • TRAIN_STRATEGY defining the training strategy you want to use
  • MSA inference script, which allows to perform inference also with the esm-msa model. The predicted Z scores will be written on the host (optional)
  • MSA training script, which allows to perform training also with the esm-msa model. The trained models will be written in the ADOPT/models directory (optional)

Notebooks

The notebooks directory contains:

Compute residue level representations

In order to predict the Z score related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model.

In the ADOPT directory run:

python adopt/embedding.py <fasta_file_path> \
                          <residue_level_representation_dir>

Where:

  • <fasta_file_path> defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder
  • <residue_level_representation_dir> defines the path where you want to save the residue level representations
  • --msa runs the MSA procedure to get esm-msa representations. We suggest you take a look to the MSA inference script as a quick example (optional)
  • -h shows help message and exit

A subdirectory containing the residue level representation extracted from each pre-trained model available will be created under both the residue_level_representation_dir.

Important to note that in order to obtain the representations from the esm-msa model as well, the relevant MSAs have to be placed in the root directory /msas in the system, where ADOPT is running. The MSAs can be created as described in the MSA setting above.

Predict intrinsic disorder with ADOPT

Once we have extracted the residue level representations we can predict the intrinsic disorder (Z score).

In the ADOPT directory run:

python adopt/inference.py <inference_fasta_file> \
                          <inference_repr_dir> \
                          <predicted_z_scores_file> \
                          --train_strategy <training_strategy> \
                          --model_type <model_type> 

Where:

  • <inference_fasta_file> defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder
  • <inference_repr_dir> defines the path where you've already saved the residue level representations
  • <predicted_z_scores_file> defines the path where you want the Z scores to be saved
  • --train_strategy defines the training strategies defined below
  • --model_type defines the pre-trained model we want to use. We suggest you use the esm-1b model
  • -h shows help message and exit

The output is a .json file contains the Z scores related to each residue of each protein in the FASTA file where you put the proteins you are intereseted in.

Training strategy Pre-trained models
train_on_cleared_1325_test_on_117_residue_split esm-1b, esm-1v, esm-msa and combined
train_on_1325_cv_residue_split esm-1b, esm-1v and esm-msa
train_on_cleared_1325_cv_residue_split esm-1b, esm-1v and esm-msa
train_on_cleared_1325_cv_sequence_split esm-1b, esm-1v and esm-msa
train_on_total esm-1b, esm-1v

Train ADOPT disorder predictor

Once we have extracted the residue level representations of the protein for which we want to predict the intrinsic disorder (Z score), we can train the predictor.

NOTE: This step is not mandatory because we've already trained such models. You can find them in the models bucket.

In the ADOPT directory run:

python adopt/training.py <train_json_file_path> \
                         <test_json_file_path> \
                         <train_residue_level_representation_dir> \
                         <test_residue_level_representation_dir> \
                         --train_strategy <training_strategy> 

Where:

  • <train_json_file_path> defines the JSON containing the proteins we want to use as training set
  • <test_json_file_path> defines the JSON containing the proteins we want to use as test set
  • <train_residue_level_representation_dir> defines the path where we saved the residue level representations of the proteins in the training set
  • <test_residue_level_representation_dir> defines the path where we saved the residue level representations of the proteins in the test set
  • --train_strategy defines the training strategies defined above
  • --msa runs the MSA procedure to get trained models fed with the esm-msa representations. We suggest you take a look to the MSA training script as a quick example (optional)
  • -h shows help message and exit

Run benchmarks

Once we have extracted the residue level representations we can benchmark ADOPT against other methods.

In the ADOPT directory run:

python adopt/benchmarks.py <benchmark_data_path> \
                           <train_json_file_path> \
                           <test_json_file_path> \
                           <train_residue_level_representation_dir> \
                           <test_residue_level_representation_dir> \
                           --train_strategy <training_strategy> 

Where:

  • <benchmark_data_path> defines the directory containing the predictions of the method we want to benchmark againbst ADOPT
  • -h shows help message and exit

AlphaFold2 benchmarks (optional)

We benchmarked ADOPT against AlphaFold2 computing the spearman correlations between actual Z-scores and predicted pLDDT5 scores along with actual Z-scores and predicted SASA5 scores, obtained by AlphaFold2, collected for the task linked to the model evaluated on the CheZoD “117” validation set and described in the ADOPT paper.

As a prerequisite, you must have Docker installed.

Run:

docker run ghcr.io/peptoneinc/adopt_alphafold2_comparison:1.0.2

Here is the script used to extract the correlations and here are the predictions obtained from Alphafold2.

Citations

If you use this work in your research, please cite the the relevant paper:

@article{10.1093/nargab/lqad041,
    author = {Redl, Istvan and Fisicaro, Carlo and Dutton, Oliver and Hoffmann, Falk and Henderson, Louie and Owens, Benjamin M J and Heberling, Matthew and Paci, Emanuele and Tamiola, Kamil},
    title = "{ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {5},
    number = {2},
    year = {2023},
    month = {05},
    abstract = "{Intrinsically disordered proteins (IDPs) are important for a broad range of biological functions and are involved in many diseases. An understanding of intrinsic disorder is key to develop compounds that target IDPs. Experimental characterization of IDPs is hindered by the very fact that they are highly dynamic. Computational methods that predict disorder from the amino acid sequence have been proposed. Here, we present ADOPT (Attention DisOrder PredicTor), a new predictor of protein disorder. ADOPT is composed of a self-supervised encoder and a supervised disorder predictor. The former is based on a deep bidirectional transformer, which extracts dense residue-level representations from Facebook’s Evolutionary Scale Modeling library. The latter uses a database of nuclear magnetic resonance chemical shifts, constructed to ensure balanced amounts of disordered and ordered residues, as a training and a test dataset for protein disorder. ADOPT predicts whether a protein or a specific region is disordered with better performance than the best existing predictors and faster than most other proposed methods (a few seconds per sequence). We identify the features that are relevant for the prediction performance and show that good performance can already be gained with \\&lt;100 features. ADOPT is available as a stand-alone package at https://github.com/PeptoneLtd/ADOPT and as a web server at https://adopt.peptone.io/.}",
    issn = {2631-9268},
    doi = {10.1093/nargab/lqad041},
    url = {https://doi.org/10.1093/nargab/lqad041},
    note = {lqad041},
    eprint = {https://academic.oup.com/nargab/article-pdf/5/2/lqad041/50150244/lqad041.pdf},
}

Licence

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Update

A new version of ADOPT has been trained as described in Optimizing protein language models with Sentence Transformers, NeurIPS (2023). Code is available at https://github.com/PeptoneLtd/contrastive-finetuning-plms