Skip to content

Commit

Permalink
Merge pull request #20 from PeptoneInc/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
CFisicaro authored Nov 22, 2021
2 parents 09e6182 + 23dec76 commit 30813e2
Show file tree
Hide file tree
Showing 6 changed files with 191 additions and 177 deletions.
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
cff-version: 0.2.1
cff-version: 0.3.0
message: "If you use this software, please cite it as below."
authors:
- given-names: "Kamil Tamiola"
Expand All @@ -7,7 +7,7 @@ authors:
affiliation: "Peptone Ltd."
orcid: ""
title: "Attention based DisOrder PredicTor"
version: 0.2.1
version: 0.3.0
doi:
date-released:
url: "https://github.com/PeptoneInc/ADOPT"
46 changes: 23 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ ADOPT has been introduced in our paper [ADOPT: intrinsic protein disorder predic

<img src="https://github.com/PeptoneInc/ADOPT/blob/main/media/adopt_attention.gif" width="600"/>

Our disorder predictor is made up of two main blocks, namely: a **self-supervised encoder** and a **supervised disorder predictor**. We use [Facebook’s Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) library to extract dense residue evel representations, which feed the supervised machine learning based predictor.
Our disorder predictor is made up of two main blocks, namely: a **self-supervised encoder** and a **supervised disorder predictor**. We use [Facebook’s Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) library to extract dense residue evel representations, which feed the supervised machine learning based predictor.

The ESM library exploits a set of deep Transformer encoder models, which processes character sequences of amino acids as inputs.

Expand Down Expand Up @@ -54,7 +54,7 @@ Install the **adopt** package:
Clone the ADOPT repository, go to the ADOPT directory and run

```bash
$ python setup.py install
python setup.py install
```

Then, you can predict the intrinsic disorder of each reesidue in a protein sequence, as follows:
Expand Down Expand Up @@ -82,6 +82,7 @@ predicted_z_scores = z_score_pred.get_z_score(representation)
### Scripts

The [scripts](scripts) directory contains:

* [inference](scripts/adopt_inference.sh) script to predict, in bulk, the disorder of each residue in each protein sequence reported in a FASTA file, with ADOPT where you need to specify:
- `NEW_PROT_FASTA_FILE_PATH` defining your FASTA file path
- `NEW_PROT_RES_REPR_DIR_PATH` defining where the residue level representations will be extracted
Expand All @@ -91,45 +92,47 @@ The [scripts](scripts) directory contains:
### Notebooks

The [notebooks](notebooks) directory contains:
* [disorder prediction](notebooks/adopt_disorder_prediction.ipynb) notebook

* [disorder prediction](notebooks/adopt_disorder_prediction.ipynb) notebook
* [multi-head attention weights visualisation](notebooks/adopt_attention_viz.ipynb) notebook

### Compute residue level representations

In order to predict the **Z score** related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model.
In order to predict the **Z score** related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model.

In the ADOPT directory run:

```bash
$ python embedding.py -f <fasta_file_path>
-r <residue_level_representation_dir>
python embedding.py -f <fasta_file_path> \
-r <residue_level_representation_dir>
```

Where:

* `-f` defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder
* `-r` defines the path where you want to save the residue level representations

A subdirectory containing the residue level representation extracted from each pre-trained model available will be created under both the `residue_level_representation_dir`.


### Predict intrinsic disorder with ADOPT

Once we have extracted the residue level representations we can predict the intrinsic disorder (Z score).

In the ADOPT directory run:

```bash
$ python inference.py -s <training_strategy>
-m <model_type>
-f <inference_fasta_file>
-r <inference_repr_dir>
-p <predicted_z_scores_file>
python inference.py -s <training_strategy> \
-m <model_type> \
-f <inference_fasta_file> \
-r <inference_repr_dir> \
-p <predicted_z_scores_file>
```

Where:
* `-s` defines the **training strategies** defined belowe

* `-s` defines the **training strategies** defined below
* `-m` defines the pre-trained model we want to use. We suggest you use the `esm-1b` model.
* `-f` defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder
* `-m` defines the residue level representation of the pre-trained models we want to use. We suggest you use the `esm-1b` model.
* `-r` defines the path where you've already saved the residue level representations
* `-p` defines the path where you want the Z scores to be saved

Expand All @@ -142,7 +145,6 @@ The output is a `.json` file contains the Z scores related to each residue of ea
| `train_on_cleared_1325_cv_residue_split`| `esm-1b` and `esm-1v` |
| `train_on_cleared_1325_cv_sequence_split`| `esm-1b` and `esm-1v` |


### Train ADOPT disorder predictor

Once we have extracted the residue level representations of the protein for which we want to predict the intrinsic disorder (Z score), we can train the predictor.
Expand All @@ -152,21 +154,21 @@ Once we have extracted the residue level representations of the protein for whic
In the ADOPT directory run:

```bash
$ python training.py -s <training_strategy>
-t <train_json_file_path>
-e <test_json_file_path>
-r <train_residue_level_representation_dir>
-p <test_residue_level_representation_dir>
python training.py -s <training_strategy> \
-t <train_json_file_path> \
-e <test_json_file_path> \
-r <train_residue_level_representation_dir> \
-p <test_residue_level_representation_dir>
```

Where:

* `-s` defines the **training strategies** defined above
* `-t` defines the JSON containing the proteins we want to use as *training set*
* `-e` defines the JSON containing the proteins we want to use as *test set*
* `-r` defines the path where we saved the residue level representations of the proteins in the *training set*
* `-p` defines the path where we saved the residue level representations of the proteins in the *test set*


## Citations <a name="citations"></a>

If you use this work in your research, please cite the the relevant paper:
Expand All @@ -175,8 +177,6 @@ If you use this work in your research, please cite the the relevant paper:
@article{redl2021adopt}
```


## Licence

This source code is licensed under the MIT license found in the `LICENSE` file in the root directory of this source tree.

57 changes: 28 additions & 29 deletions adopt/embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,37 @@
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import getopt
import argparse
import subprocess
import sys
from pathlib import Path

from adopt import constants


def create_parser():
parser = argparse.ArgumentParser(
description="Extract residue level representations"
)

parser.add_argument(
"-f",
"--fasta_path",
type=str,
metavar="",
required=True,
help="FASTA file containing the proteins for which you want to compute the intrinsic disorder",
)
parser.add_argument(
"-r",
"--repr_dir",
type=str,
metavar="",
required=True,
help="Residue level representation directory",
)
return parser


# extract residue level representations of each protein sequence in the fasta file
def get_representations(fasta_file, repr_dir):
for esm_model in constants.esm_models:
Expand Down Expand Up @@ -40,31 +63,7 @@ def get_representations(fasta_file, repr_dir):
output, error = process.communicate()


def main(argv):
try:
opts, args = getopt.getopt(argv, "hf:r:", ["fasta_file=", "repr_dir="])
except getopt.GetoptError:
print(
"usage: embedding.py"
"-f <fasta_file_path>"
"-r <residue_level_representation_dir>"
)
sys.exit(2)
for opt, arg in opts:
if opt == "-h":
print(
"usage: embedding.py"
"-f <fasta_file_path>"
"-r <residue_level_representation_dir>"
)
sys.exit()
elif opt in ("-f", "--fasta_dir"):
fasta_dir = arg
elif opt in ("-r", "--repr_dir"):
repr_dir = arg

get_representations(fasta_dir, repr_dir)


if __name__ == "__main__":
main(sys.argv[1:])
parser = create_parser()
args = parser.parse_args()
get_representations(args.fasta_path, args.repr_dir)
140 changes: 75 additions & 65 deletions adopt/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import getopt
import argparse
import os
import sys

Expand All @@ -14,6 +14,54 @@
from adopt import constants, utils


def create_parser():
parser = argparse.ArgumentParser(
description="Predict the intrinsic disorder (Z score)"
)

parser.add_argument(
"-s",
"--train_strategy",
type=str,
metavar="",
required=True,
help="Training strategies",
)
parser.add_argument(
"-m",
"--model_type",
type=str,
metavar="",
required=True,
help="pre-trained model we want to use",
)
parser.add_argument(
"-f",
"--fasta_path",
type=str,
metavar="",
required=True,
help="FASTA file containing the proteins for which you want to compute the intrinsic disorder",
)
parser.add_argument(
"-r",
"--repr_dir",
type=str,
metavar="",
required=True,
help="Residue level representation directory",
)
parser.add_argument(
"-p",
"--pred_z_scores_path",
type=str,
metavar="",
required=True,
help="Path where you want the Z scores to be saved",
)
return parser


class ZScorePred:
def __init__(self, strategy, model_type):
self.strategy = strategy
Expand Down Expand Up @@ -93,73 +141,35 @@ def get_z_score_from_fasta(
df_results.to_json(predicted_z_scores_path, orient="records")


def main(argv):
try:
opts, args = getopt.getopt(
argv,
"hs:m:f:r:p:",
[
"train_strategy=",
"model_type=",
"infer_fasta_file=",
"infer_repr_dir=",
"pred_z_scores_file",
],
)
except getopt.GetoptError:
def main(args):
if args.train_strategy not in constants.train_strategies:
print("The training strategies are:")
print(*constants.train_strategies, sep="\n")
sys.exit(2)

if (args.model_type not in constants.model_types) and (
args.model_type != "combined"
):
print("The pre-trained models are:")
print(*constants.model_types, sep="\n")
print("combined")
sys.exit(2)

if (args.train_strategy != "train_on_cleared_1325_test_on_117_residue_split") and (
args.model_type == "combined"
):
print(
"usage: inference.py"
"-s <training_strategy>"
"-m <model_type>"
"-f <inference_fasta_file>"
"-r <inference_repr_dir>"
"-p <predicted_z_scores_file>"
"Only the train_on_cleared_1325_test_on_117_residue_split strategy"
"is allowed with the <combined> model"
)
sys.exit(2)
for opt, arg in opts:
if opt == "-h":
print(
"usage: inference.py"
"-s <training_strategy>"
"-m <model_type>"
"-f <inference_fasta_file>"
"-r <inference_repr_dir>"
"-p <predicted_z_scores_file>"
)
sys.exit()
elif opt in ("-s", "--train_strategy"):
train_strategy = arg
if train_strategy not in constants.train_strategies:
print("The training strategies are:")
print(*constants.train_strategies, sep="\n")
sys.exit(2)
elif opt in ("-m", "--model_type"):
model_type = arg
if (model_type not in constants.model_types) and (model_type != "combined"):
print("The pre-trained models are:")
print(*constants.model_types, sep="\n")
print("combined")
sys.exit(2)
if (
train_strategy != "train_on_cleared_1325_test_on_117_residue_split"
) and (model_type == "combined"):
print(
"Only the train_on_cleared_1325_test_on_117_residue_split strategy"
"is allowed with the <combined> model"
)
sys.exit()
elif opt in ("-f", "--infer_fasta_file"):
infer_fasta_file = arg
elif opt in ("-r", "--infer_repr_dir"):
infer_repr_dir = arg
elif opt in ("-p", "--pred_z_scores_file"):
pred_z_scores_file = arg

z_score_pred = ZScorePred(train_strategy, model_type)
z_score_pred.get_z_score_from_fasta(
infer_fasta_file, infer_repr_dir, pred_z_scores_file
)


if __name__ == "__main__":
main(sys.argv[1:])
parser = create_parser()
args = parser.parse_args()
main(args)
z_score_pred = ZScorePred(args.train_strategy, args.model_type)
z_score_pred.get_z_score_from_fasta(
args.fasta_path, args.repr_dir, args.pred_z_scores_path
)
Loading

0 comments on commit 30813e2

Please sign in to comment.