From 4adf0e6a82698e086970eef20c597b66ec8f74f2 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro <66260709+CFisicaro@users.noreply.github.com> Date: Mon, 22 Nov 2021 14:40:32 +0100 Subject: [PATCH 01/16] docs(README): no dollars in commands --- README.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 4e4a82d..2a261f3 100644 --- a/README.md +++ b/README.md @@ -54,7 +54,7 @@ Install the **adopt** package: Clone the ADOPT repository, go to the ADOPT directory and run ```bash -$ python setup.py install +python setup.py install ``` Then, you can predict the intrinsic disorder of each reesidue in a protein sequence, as follows: @@ -101,8 +101,8 @@ In order to predict the **Z score** related to each residue in a protein sequenc In the ADOPT directory run: ```bash -$ python embedding.py -f - -r +python embedding.py -f + -r ``` Where: @@ -119,11 +119,11 @@ Once we have extracted the residue level representations we can predict the intr In the ADOPT directory run: ```bash -$ python inference.py -s - -m - -f - -r - -p +python inference.py -s + -m + -f + -r + -p ``` Where: @@ -152,11 +152,11 @@ Once we have extracted the residue level representations of the protein for whic In the ADOPT directory run: ```bash -$ python training.py -s - -t - -e - -r - -p +python training.py -s + -t + -e + -r + -p ``` Where: From 224b42c05504372023a15ffebe75688ee33bca0e Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro <66260709+CFisicaro@users.noreply.github.com> Date: Mon, 22 Nov 2021 15:22:09 +0100 Subject: [PATCH 02/16] style(README): lines --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index 2a261f3..df9aacd 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,7 @@ The ESM library exploits a set of deep Transformer encoder models, which process ADOPT makes use of two datasets: the [CheZoD “1325” and the CheZoD “117”](https://github.com/protein-nmr/CheZOD) databases containing 1325 and 117 sequences, respectively, together with their residue level **Z-scores**. + ## Table of Contents - [Attention based DisOrder PredicTor](#attention-based-disorder-predictor) @@ -43,6 +44,7 @@ ADOPT makes use of two datasets: the [CheZoD “1325” and the CheZoD “117 | `lasso_esm-1b_cleared_sequence_cv` | ESM-1b | **Chezod 1325 cleared** | residue | :white_check_mark: | | `lasso_esm-1v_cleared_sequence_cv` | ESM-1v | **Chezod 1325 cleared** | sequence | :white_check_mark: | + ## Usage ### Quick start @@ -79,6 +81,7 @@ z_score_pred = ZScorePred(STRATEGY, MODEL_TYPE) predicted_z_scores = z_score_pred.get_z_score(representation) ```` + ### Scripts The [scripts](scripts) directory contains: @@ -88,12 +91,14 @@ The [scripts](scripts) directory contains: * [training](scripts/adopt_chezod_training.sh) script to train the ADOPT where you need to specify: - `TRAIN_STRATEGY` defining the training strategy you want to use + ### Notebooks The [notebooks](notebooks) directory contains: * [disorder prediction](notebooks/adopt_disorder_prediction.ipynb) notebook * [multi-head attention weights visualisation](notebooks/adopt_attention_viz.ipynb) notebook + ### Compute residue level representations In order to predict the **Z score** related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model. From 8a7fd3946d82cb1d28c29590d0dc4ca4972560ca Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro <66260709+CFisicaro@users.noreply.github.com> Date: Mon, 22 Nov 2021 15:39:54 +0100 Subject: [PATCH 03/16] style(README): spaces --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index df9aacd..0ff6305 100644 --- a/README.md +++ b/README.md @@ -30,6 +30,7 @@ ADOPT makes use of two datasets: the [CheZoD “1325” and the CheZoD “117 - [Citations](#citations) - [Licence](#licence) + ## Intrinsic disorder trained models | Model | Pre-trained model | Datasets | Split level | CV | @@ -184,4 +185,3 @@ If you use this work in your research, please cite the the relevant paper: ## Licence This source code is licensed under the MIT license found in the `LICENSE` file in the root directory of this source tree. - From 198e04a543f6bc054b3b230f162ccdcb0003f46c Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro <66260709+CFisicaro@users.noreply.github.com> Date: Mon, 22 Nov 2021 15:46:56 +0100 Subject: [PATCH 04/16] style(README): spaces From 87dfb774f30824de4e9b93aaad2499bd89064fe3 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro <66260709+CFisicaro@users.noreply.github.com> Date: Mon, 22 Nov 2021 15:48:48 +0100 Subject: [PATCH 05/16] build(lint): remove md check --- .github/workflows/linter.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/linter.yml b/.github/workflows/linter.yml index cead368..9c8086d 100644 --- a/.github/workflows/linter.yml +++ b/.github/workflows/linter.yml @@ -56,5 +56,5 @@ jobs: IGNORE_GENERATED_FILES: true VALIDATE_PYTHON_BLACK: false VALIDATE_PYTHON_ISORT: false - FILTER_REGEX_EXCLUDE: /esm + FILTER_REGEX_EXCLUDE: /esm, *.md From 9a45a76958a9227696140bbd7435872720814ad9 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro <66260709+CFisicaro@users.noreply.github.com> Date: Mon, 22 Nov 2021 16:14:06 +0100 Subject: [PATCH 06/16] docs(readme): grammar --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 0ff6305..858c00b 100644 --- a/README.md +++ b/README.md @@ -133,7 +133,7 @@ python inference.py -s ``` Where: -* `-s` defines the **training strategies** defined belowe +* `-s` defines the **training strategies** defined below * `-f` defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder * `-m` defines the residue level representation of the pre-trained models we want to use. We suggest you use the `esm-1b` model. * `-r` defines the path where you've already saved the residue level representations From f17dc812f47e0a765016b190a06f733597ab50e6 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro <66260709+CFisicaro@users.noreply.github.com> Date: Mon, 22 Nov 2021 16:27:09 +0100 Subject: [PATCH 07/16] docs(readme): better description --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 858c00b..ad4706f 100644 --- a/README.md +++ b/README.md @@ -134,8 +134,8 @@ python inference.py -s Where: * `-s` defines the **training strategies** defined below +* `-m` defines the pre-trained model we want to use. We suggest you use the `esm-1b` model. * `-f` defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder -* `-m` defines the residue level representation of the pre-trained models we want to use. We suggest you use the `esm-1b` model. * `-r` defines the path where you've already saved the residue level representations * `-p` defines the path where you want the Z scores to be saved From b9b63f7412d8aee89a3949e96b1dc6ed413d6b70 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 17:13:07 +0100 Subject: [PATCH 08/16] refactor: add argparse routines --- adopt/embedding.py | 52 +++++++++---------- adopt/inference.py | 126 ++++++++++++++++++++++----------------------- adopt/training.py | 108 +++++++++++++++++++------------------- 3 files changed, 136 insertions(+), 150 deletions(-) diff --git a/adopt/embedding.py b/adopt/embedding.py index 04d20a4..051df0c 100644 --- a/adopt/embedding.py +++ b/adopt/embedding.py @@ -3,14 +3,32 @@ # This source code is licensed under the MIT license found in the # LICENSE file in the root directory of this source tree. -import getopt +import argparse import subprocess -import sys from pathlib import Path from adopt import constants +def create_parser(): + parser = argparse.ArgumentParser(description='Extract residue level representations') + + parser.add_argument('-f', + '--fasta_path', + type=str, metavar='', + required=True, + help='FASTA file containing the proteins for which you want to compute the intrinsic disorder') + + parser.add_argument('-r', + '--repr_dir', + type=str, + metavar='', + required=True, + help='Residue level representation directory') + + return parser + + # extract residue level representations of each protein sequence in the fasta file def get_representations(fasta_file, repr_dir): for esm_model in constants.esm_models: @@ -40,31 +58,7 @@ def get_representations(fasta_file, repr_dir): output, error = process.communicate() -def main(argv): - try: - opts, args = getopt.getopt(argv, "hf:r:", ["fasta_file=", "repr_dir="]) - except getopt.GetoptError: - print( - "usage: embedding.py" - "-f " - "-r " - ) - sys.exit(2) - for opt, arg in opts: - if opt == "-h": - print( - "usage: embedding.py" - "-f " - "-r " - ) - sys.exit() - elif opt in ("-f", "--fasta_dir"): - fasta_dir = arg - elif opt in ("-r", "--repr_dir"): - repr_dir = arg - - get_representations(fasta_dir, repr_dir) - - if __name__ == "__main__": - main(sys.argv[1:]) + parser = create_parser() + args = parser.parse_args() + get_representations(args) diff --git a/adopt/inference.py b/adopt/inference.py index 9caf5f0..43465fd 100644 --- a/adopt/inference.py +++ b/adopt/inference.py @@ -3,7 +3,7 @@ # This source code is licensed under the MIT license found in the # LICENSE file in the root directory of this source tree. -import getopt +import argparse import os import sys @@ -14,6 +14,43 @@ from adopt import constants, utils +def create_parser(): + parser = argparse.ArgumentParser(description='Predict the intrinsic disorder (Z score)') + + parser.add_argument('-s', + '--train_strategy', + type=str, + metavar='', + required=True, + help='Training strategies') + + parser.add_argument('-m', + '--model_type', + type=str, + metavar='', + required=True, + help='pre-trained model we want to use') + + parser.add_argument('-f', + '--fasta_path', + type=str, metavar='', + required=True, + help='FASTA file containing the proteins for which you want to compute the intrinsic disorder') + + parser.add_argument('-r', + '--repr_dir', + type=str, metavar='', + required=True, + help='Residue level representation directory') + + parser.add_argument('-p', + '--pred_z_scores_path', + type=str, metavar='', + required=True, + help='Path where you want the Z scores to be saved') + + return parser + class ZScorePred: def __init__(self, strategy, model_type): self.strategy = strategy @@ -93,73 +130,32 @@ def get_z_score_from_fasta( df_results.to_json(predicted_z_scores_path, orient="records") -def main(argv): - try: - opts, args = getopt.getopt( - argv, - "hs:m:f:r:p:", - [ - "train_strategy=", - "model_type=", - "infer_fasta_file=", - "infer_repr_dir=", - "pred_z_scores_file", - ], - ) - except getopt.GetoptError: +def main(args): + if args.train_strategy not in constants.train_strategies: + print("The training strategies are:") + print(*constants.train_strategies, sep="\n") + sys.exit(2) + + if (args.model_type not in constants.model_types) and (args.model_type != "combined"): + print("The pre-trained models are:") + print(*constants.model_types, sep="\n") + print("combined") + sys.exit(2) + + if (args.train_strategy != "train_on_cleared_1325_test_on_117_residue_split") and (args.model_type == "combined"): print( - "usage: inference.py" - "-s " - "-m " - "-f " - "-r " - "-p " + "Only the train_on_cleared_1325_test_on_117_residue_split strategy" + "is allowed with the model" ) sys.exit(2) - for opt, arg in opts: - if opt == "-h": - print( - "usage: inference.py" - "-s " - "-m " - "-f " - "-r " - "-p " - ) - sys.exit() - elif opt in ("-s", "--train_strategy"): - train_strategy = arg - if train_strategy not in constants.train_strategies: - print("The training strategies are:") - print(*constants.train_strategies, sep="\n") - sys.exit(2) - elif opt in ("-m", "--model_type"): - model_type = arg - if (model_type not in constants.model_types) and (model_type != "combined"): - print("The pre-trained models are:") - print(*constants.model_types, sep="\n") - print("combined") - sys.exit(2) - if ( - train_strategy != "train_on_cleared_1325_test_on_117_residue_split" - ) and (model_type == "combined"): - print( - "Only the train_on_cleared_1325_test_on_117_residue_split strategy" - "is allowed with the model" - ) - sys.exit() - elif opt in ("-f", "--infer_fasta_file"): - infer_fasta_file = arg - elif opt in ("-r", "--infer_repr_dir"): - infer_repr_dir = arg - elif opt in ("-p", "--pred_z_scores_file"): - pred_z_scores_file = arg - - z_score_pred = ZScorePred(train_strategy, model_type) - z_score_pred.get_z_score_from_fasta( - infer_fasta_file, infer_repr_dir, pred_z_scores_file - ) if __name__ == "__main__": - main(sys.argv[1:]) + parser = create_parser() + args = parser.parse_args() + main(args) + z_score_pred = ZScorePred(args.train_strategy, args.model_type) + z_score_pred.get_z_score_from_fasta( + args.fasta_file, args.repr_dir, args.pred_z_scores_file + ) + diff --git a/adopt/training.py b/adopt/training.py index fb90d76..cdb9ee1 100644 --- a/adopt/training.py +++ b/adopt/training.py @@ -3,7 +3,7 @@ # This source code is licensed under the MIT license found in the # LICENSE file in the root directory of this source tree. -import getopt +import argparse import sys import numpy as np @@ -13,8 +13,44 @@ from adopt import CheZod, constants, utils -# disorder predictor training +# disorder predictor training +def create_parser(): + parser = argparse.ArgumentParser(description='Train ADOPT') + + parser.add_argument('-s', + '--train_strategy', + type=str, + metavar='', + required=True, + help='Training strategies') + + parser.add_argument('-t', + '--train_json_file', + type=str, + metavar='', + required=True, + help='JSON file containing the proteins we want to use as training set') + + parser.add_argument('-e', + '--test_json_file', + type=str, metavar='', + required=True, + help='JSON file containing the proteins we want to use as test set') + + parser.add_argument('-r', + '--train_repr_dir', + type=str, metavar='', + required=True, + help='Training set residue level representation directory') + + parser.add_argument('-p', + '--test_repr_dir', + type=str, metavar='', + required=True, + help='Test set residue level representation directory') + + return parser class DisorderPred: def __init__( @@ -331,68 +367,28 @@ def cleared_sequence_cv(self): ) -def main(argv): - try: - opts, args = getopt.getopt( - argv, - "hs:t:e:r:p:", - [ - "train_strategy=", - "train_json_file=", - "test_json_file=", - "train_repr_dir=", - "test_repr_dir=", - ], - ) - except getopt.GetoptError: - print( - "usage: training.py" - "-s " - "-t " - "-e " - "-r " - "-p " - ) +def main(args): + if args.train_strategy not in constants.train_strategies: + print("The training strategies are:") + print(*constants.train_strategies, sep="\n") sys.exit(2) - for opt, arg in opts: - if opt == "-h": - print( - "usage: training.py" - "-s " - "-t " - "-e " - "-r " - "-p " - ) - sys.exit() - elif opt in ("-s", "--train_strategy"): - train_strategy = arg - if train_strategy not in constants.train_strategies: - print("The training strategies are:") - print(*constants.train_strategies, sep="\n") - sys.exit(2) - elif opt in ("-t", "--train_json_file"): - train_sequences = arg - elif opt in ("-e", "--test_json_file"): - test_sequences = arg - elif opt in ("-r", "--train_repr_dir"): - train_repr_dir = arg - elif opt in ("-p", "--test_repr_dir"): - test_repr_dir = arg + +if __name__ == "__main__": + parser = create_parser() + args = parser.parse_args() + main(args) disorder_pred = DisorderPred( - train_sequences, test_sequences, train_repr_dir, test_repr_dir + args.train_sequences, args.test_sequences, args.train_repr_dir, args.test_repr_dir ) - - if train_strategy == "train_on_cleared_1325_test_on_117_residue_split": + if args.train_strategy == "train_on_cleared_1325_test_on_117_residue_split": disorder_pred.cleared_residue() - elif train_strategy == "train_on_1325_cv_residue_split": + elif args.train_strategy == "train_on_1325_cv_residue_split": disorder_pred.residue_cv() - elif train_strategy == "train_on_cleared_1325_cv_residue_split": + elif args.train_strategy == "train_on_cleared_1325_cv_residue_split": disorder_pred.cleared_residue_cv() else: disorder_pred.cleared_sequence_cv() -if __name__ == "__main__": - main(sys.argv[1:]) + From e48b3ae94bfc1b7de42af9e076d0179736f18787 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 17:18:44 +0100 Subject: [PATCH 09/16] style: indentation --- README.md | 14 +++---- adopt/embedding.py | 33 ++++++++++------- adopt/inference.py | 91 ++++++++++++++++++++++++++++------------------ adopt/training.py | 88 +++++++++++++++++++++++++------------------- 4 files changed, 131 insertions(+), 95 deletions(-) diff --git a/README.md b/README.md index ad4706f..3926b57 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,6 @@ The ESM library exploits a set of deep Transformer encoder models, which process ADOPT makes use of two datasets: the [CheZoD “1325” and the CheZoD “117”](https://github.com/protein-nmr/CheZOD) databases containing 1325 and 117 sequences, respectively, together with their residue level **Z-scores**. - ## Table of Contents - [Attention based DisOrder PredicTor](#attention-based-disorder-predictor) @@ -30,7 +29,6 @@ ADOPT makes use of two datasets: the [CheZoD “1325” and the CheZoD “117 - [Citations](#citations) - [Licence](#licence) - ## Intrinsic disorder trained models | Model | Pre-trained model | Datasets | Split level | CV | @@ -45,7 +43,6 @@ ADOPT makes use of two datasets: the [CheZoD “1325” and the CheZoD “117 | `lasso_esm-1b_cleared_sequence_cv` | ESM-1b | **Chezod 1325 cleared** | residue | :white_check_mark: | | `lasso_esm-1v_cleared_sequence_cv` | ESM-1v | **Chezod 1325 cleared** | sequence | :white_check_mark: | - ## Usage ### Quick start @@ -82,10 +79,10 @@ z_score_pred = ZScorePred(STRATEGY, MODEL_TYPE) predicted_z_scores = z_score_pred.get_z_score(representation) ```` - ### Scripts The [scripts](scripts) directory contains: + * [inference](scripts/adopt_inference.sh) script to predict, in bulk, the disorder of each residue in each protein sequence reported in a FASTA file, with ADOPT where you need to specify: - `NEW_PROT_FASTA_FILE_PATH` defining your FASTA file path - `NEW_PROT_RES_REPR_DIR_PATH` defining where the residue level representations will be extracted @@ -96,10 +93,10 @@ The [scripts](scripts) directory contains: ### Notebooks The [notebooks](notebooks) directory contains: + * [disorder prediction](notebooks/adopt_disorder_prediction.ipynb) notebook * [multi-head attention weights visualisation](notebooks/adopt_attention_viz.ipynb) notebook - ### Compute residue level representations In order to predict the **Z score** related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model. @@ -112,12 +109,12 @@ python embedding.py -f ``` Where: + * `-f` defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder * `-r` defines the path where you want to save the residue level representations A subdirectory containing the residue level representation extracted from each pre-trained model available will be created under both the `residue_level_representation_dir`. - ### Predict intrinsic disorder with ADOPT Once we have extracted the residue level representations we can predict the intrinsic disorder (Z score). @@ -133,6 +130,7 @@ python inference.py -s ``` Where: + * `-s` defines the **training strategies** defined below * `-m` defines the pre-trained model we want to use. We suggest you use the `esm-1b` model. * `-f` defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder @@ -148,7 +146,6 @@ The output is a `.json` file contains the Z scores related to each residue of ea | `train_on_cleared_1325_cv_residue_split`| `esm-1b` and `esm-1v` | | `train_on_cleared_1325_cv_sequence_split`| `esm-1b` and `esm-1v` | - ### Train ADOPT disorder predictor Once we have extracted the residue level representations of the protein for which we want to predict the intrinsic disorder (Z score), we can train the predictor. @@ -166,13 +163,13 @@ python training.py -s ``` Where: + * `-s` defines the **training strategies** defined above * `-t` defines the JSON containing the proteins we want to use as *training set* * `-e` defines the JSON containing the proteins we want to use as *test set* * `-r` defines the path where we saved the residue level representations of the proteins in the *training set* * `-p` defines the path where we saved the residue level representations of the proteins in the *test set* - ## Citations If you use this work in your research, please cite the the relevant paper: @@ -181,7 +178,6 @@ If you use this work in your research, please cite the the relevant paper: @article{redl2021adopt} ``` - ## Licence This source code is licensed under the MIT license found in the `LICENSE` file in the root directory of this source tree. diff --git a/adopt/embedding.py b/adopt/embedding.py index 051df0c..12ecc87 100644 --- a/adopt/embedding.py +++ b/adopt/embedding.py @@ -11,21 +11,28 @@ def create_parser(): - parser = argparse.ArgumentParser(description='Extract residue level representations') + parser = argparse.ArgumentParser( + description="Extract residue level representations" + ) - parser.add_argument('-f', - '--fasta_path', - type=str, metavar='', - required=True, - help='FASTA file containing the proteins for which you want to compute the intrinsic disorder') + parser.add_argument( + "-f", + "--fasta_path", + type=str, + metavar="", + required=True, + help="FASTA file containing the proteins for which you want to compute the intrinsic disorder", + ) + + parser.add_argument( + "-r", + "--repr_dir", + type=str, + metavar="", + required=True, + help="Residue level representation directory", + ) - parser.add_argument('-r', - '--repr_dir', - type=str, - metavar='', - required=True, - help='Residue level representation directory') - return parser diff --git a/adopt/inference.py b/adopt/inference.py index 43465fd..3e8dea6 100644 --- a/adopt/inference.py +++ b/adopt/inference.py @@ -15,42 +15,58 @@ def create_parser(): - parser = argparse.ArgumentParser(description='Predict the intrinsic disorder (Z score)') - - parser.add_argument('-s', - '--train_strategy', - type=str, - metavar='', - required=True, - help='Training strategies') - - parser.add_argument('-m', - '--model_type', - type=str, - metavar='', - required=True, - help='pre-trained model we want to use') - - parser.add_argument('-f', - '--fasta_path', - type=str, metavar='', - required=True, - help='FASTA file containing the proteins for which you want to compute the intrinsic disorder') - - parser.add_argument('-r', - '--repr_dir', - type=str, metavar='', - required=True, - help='Residue level representation directory') - - parser.add_argument('-p', - '--pred_z_scores_path', - type=str, metavar='', - required=True, - help='Path where you want the Z scores to be saved') + parser = argparse.ArgumentParser( + description="Predict the intrinsic disorder (Z score)" + ) + + parser.add_argument( + "-s", + "--train_strategy", + type=str, + metavar="", + required=True, + help="Training strategies", + ) + + parser.add_argument( + "-m", + "--model_type", + type=str, + metavar="", + required=True, + help="pre-trained model we want to use", + ) + + parser.add_argument( + "-f", + "--fasta_path", + type=str, + metavar="", + required=True, + help="FASTA file containing the proteins for which you want to compute the intrinsic disorder", + ) + + parser.add_argument( + "-r", + "--repr_dir", + type=str, + metavar="", + required=True, + help="Residue level representation directory", + ) + + parser.add_argument( + "-p", + "--pred_z_scores_path", + type=str, + metavar="", + required=True, + help="Path where you want the Z scores to be saved", + ) return parser + class ZScorePred: def __init__(self, strategy, model_type): self.strategy = strategy @@ -136,13 +152,17 @@ def main(args): print(*constants.train_strategies, sep="\n") sys.exit(2) - if (args.model_type not in constants.model_types) and (args.model_type != "combined"): + if (args.model_type not in constants.model_types) and ( + args.model_type != "combined" + ): print("The pre-trained models are:") print(*constants.model_types, sep="\n") print("combined") sys.exit(2) - if (args.train_strategy != "train_on_cleared_1325_test_on_117_residue_split") and (args.model_type == "combined"): + if (args.train_strategy != "train_on_cleared_1325_test_on_117_residue_split") and ( + args.model_type == "combined" + ): print( "Only the train_on_cleared_1325_test_on_117_residue_split strategy" "is allowed with the model" @@ -158,4 +178,3 @@ def main(args): z_score_pred.get_z_score_from_fasta( args.fasta_file, args.repr_dir, args.pred_z_scores_file ) - diff --git a/adopt/training.py b/adopt/training.py index cdb9ee1..fe85322 100644 --- a/adopt/training.py +++ b/adopt/training.py @@ -16,42 +16,56 @@ # disorder predictor training def create_parser(): - parser = argparse.ArgumentParser(description='Train ADOPT') - - parser.add_argument('-s', - '--train_strategy', - type=str, - metavar='', - required=True, - help='Training strategies') - - parser.add_argument('-t', - '--train_json_file', - type=str, - metavar='', - required=True, - help='JSON file containing the proteins we want to use as training set') - - parser.add_argument('-e', - '--test_json_file', - type=str, metavar='', - required=True, - help='JSON file containing the proteins we want to use as test set') - - parser.add_argument('-r', - '--train_repr_dir', - type=str, metavar='', - required=True, - help='Training set residue level representation directory') - - parser.add_argument('-p', - '--test_repr_dir', - type=str, metavar='', - required=True, - help='Test set residue level representation directory') + parser = argparse.ArgumentParser(description="Train ADOPT") + + parser.add_argument( + "-s", + "--train_strategy", + type=str, + metavar="", + required=True, + help="Training strategies", + ) + + parser.add_argument( + "-t", + "--train_json_file", + type=str, + metavar="", + required=True, + help="JSON file containing the proteins we want to use as training set", + ) + + parser.add_argument( + "-e", + "--test_json_file", + type=str, + metavar="", + required=True, + help="JSON file containing the proteins we want to use as test set", + ) + + parser.add_argument( + "-r", + "--train_repr_dir", + type=str, + metavar="", + required=True, + help="Training set residue level representation directory", + ) + + parser.add_argument( + "-p", + "--test_repr_dir", + type=str, + metavar="", + required=True, + help="Test set residue level representation directory", + ) return parser + class DisorderPred: def __init__( self, @@ -379,7 +393,10 @@ def main(args): args = parser.parse_args() main(args) disorder_pred = DisorderPred( - args.train_sequences, args.test_sequences, args.train_repr_dir, args.test_repr_dir + args.train_sequences, + args.test_sequences, + args.train_repr_dir, + args.test_repr_dir, ) if args.train_strategy == "train_on_cleared_1325_test_on_117_residue_split": disorder_pred.cleared_residue() @@ -389,6 +406,3 @@ def main(args): disorder_pred.cleared_residue_cv() else: disorder_pred.cleared_sequence_cv() - - - From e7aaadf121d0537ac3ed6578b5e6a09c85b6493a Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 19:29:20 +0100 Subject: [PATCH 10/16] fix(parsing): arguments --- .github/workflows/linter.yml | 2 +- adopt/embedding.py | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/.github/workflows/linter.yml b/.github/workflows/linter.yml index 9c8086d..cead368 100644 --- a/.github/workflows/linter.yml +++ b/.github/workflows/linter.yml @@ -56,5 +56,5 @@ jobs: IGNORE_GENERATED_FILES: true VALIDATE_PYTHON_BLACK: false VALIDATE_PYTHON_ISORT: false - FILTER_REGEX_EXCLUDE: /esm, *.md + FILTER_REGEX_EXCLUDE: /esm diff --git a/adopt/embedding.py b/adopt/embedding.py index 12ecc87..1ca6edb 100644 --- a/adopt/embedding.py +++ b/adopt/embedding.py @@ -23,7 +23,6 @@ def create_parser(): required=True, help="FASTA file containing the proteins for which you want to compute the intrinsic disorder", ) - parser.add_argument( "-r", "--repr_dir", @@ -68,4 +67,4 @@ def get_representations(fasta_file, repr_dir): if __name__ == "__main__": parser = create_parser() args = parser.parse_args() - get_representations(args) + get_representations(args.fasta_path, args.repr_dir) From c0c2de98a49edf614f0112a74d4a95a1b9b768db Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 19:45:14 +0100 Subject: [PATCH 11/16] fix: arguments --- README.md | 20 ++++++++++---------- adopt/embedding.py | 1 - adopt/inference.py | 5 ----- adopt/training.py | 9 ++------- 4 files changed, 12 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 3926b57..ea00173 100644 --- a/README.md +++ b/README.md @@ -104,8 +104,8 @@ In order to predict the **Z score** related to each residue in a protein sequenc In the ADOPT directory run: ```bash -python embedding.py -f - -r +python embedding.py -f \ + -r ``` Where: @@ -122,10 +122,10 @@ Once we have extracted the residue level representations we can predict the intr In the ADOPT directory run: ```bash -python inference.py -s - -m - -f - -r +python inference.py -s \ + -m \ + -f \ + -r \ -p ``` @@ -155,10 +155,10 @@ Once we have extracted the residue level representations of the protein for whic In the ADOPT directory run: ```bash -python training.py -s - -t - -e - -r +python training.py -s \ + -t \ + -e \ + -r \ -p ``` diff --git a/adopt/embedding.py b/adopt/embedding.py index 1ca6edb..0cebf28 100644 --- a/adopt/embedding.py +++ b/adopt/embedding.py @@ -31,7 +31,6 @@ def create_parser(): required=True, help="Residue level representation directory", ) - return parser diff --git a/adopt/inference.py b/adopt/inference.py index 3e8dea6..df931f8 100644 --- a/adopt/inference.py +++ b/adopt/inference.py @@ -27,7 +27,6 @@ def create_parser(): required=True, help="Training strategies", ) - parser.add_argument( "-m", "--model_type", @@ -36,7 +35,6 @@ def create_parser(): required=True, help="pre-trained model we want to use", ) - parser.add_argument( "-f", "--fasta_path", @@ -45,7 +43,6 @@ def create_parser(): required=True, help="FASTA file containing the proteins for which you want to compute the intrinsic disorder", ) - parser.add_argument( "-r", "--repr_dir", @@ -54,7 +51,6 @@ def create_parser(): required=True, help="Residue level representation directory", ) - parser.add_argument( "-p", "--pred_z_scores_path", @@ -63,7 +59,6 @@ def create_parser(): required=True, help="Path where you want the Z scores to be saved", ) - return parser diff --git a/adopt/training.py b/adopt/training.py index fe85322..53b99f9 100644 --- a/adopt/training.py +++ b/adopt/training.py @@ -26,7 +26,6 @@ def create_parser(): required=True, help="Training strategies", ) - parser.add_argument( "-t", "--train_json_file", @@ -35,7 +34,6 @@ def create_parser(): required=True, help="JSON file containing the proteins we want to use as training set", ) - parser.add_argument( "-e", "--test_json_file", @@ -44,7 +42,6 @@ def create_parser(): required=True, help="JSON file containing the proteins we want to use as test set", ) - parser.add_argument( "-r", "--train_repr_dir", @@ -53,7 +50,6 @@ def create_parser(): required=True, help="Training set residue level representation directory", ) - parser.add_argument( "-p", "--test_repr_dir", @@ -62,7 +58,6 @@ def create_parser(): required=True, help="Test set residue level representation directory", ) - return parser @@ -393,8 +388,8 @@ def main(args): args = parser.parse_args() main(args) disorder_pred = DisorderPred( - args.train_sequences, - args.test_sequences, + args.train_json_file, + args.test_json_file, args.train_repr_dir, args.test_repr_dir, ) From f7cd4bbe14b8c4919cc68c1dcc7b7f3c8926f4fb Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 19:49:32 +0100 Subject: [PATCH 12/16] docs(readme): spaces --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index ea00173..404354f 100644 --- a/README.md +++ b/README.md @@ -89,7 +89,6 @@ The [scripts](scripts) directory contains: * [training](scripts/adopt_chezod_training.sh) script to train the ADOPT where you need to specify: - `TRAIN_STRATEGY` defining the training strategy you want to use - ### Notebooks The [notebooks](notebooks) directory contains: From 454408b947518cdfb034abc0135952f5df96e969 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 20:27:10 +0100 Subject: [PATCH 13/16] fix(readme): spaces --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 404354f..15a7e7e 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ ADOPT has been introduced in our paper [ADOPT: intrinsic protein disorder predic -Our disorder predictor is made up of two main blocks, namely: a **self-supervised encoder** and a **supervised disorder predictor**. We use [Facebook’s Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) library to extract dense residue evel representations, which feed the supervised machine learning based predictor. +Our disorder predictor is made up of two main blocks, namely: a **self-supervised encoder** and a **supervised disorder predictor**. We use [Facebook’s Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) library to extract dense residue evel representations, which feed the supervised machine learning based predictor. The ESM library exploits a set of deep Transformer encoder models, which processes character sequences of amino acids as inputs. From e40a5911f0e14afae8e0d900e440c6a703439239 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 20:32:16 +0100 Subject: [PATCH 14/16] fix(parser): argument --- README.md | 4 ++-- adopt/inference.py | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 15a7e7e..f4c7d81 100644 --- a/README.md +++ b/README.md @@ -93,12 +93,12 @@ The [scripts](scripts) directory contains: The [notebooks](notebooks) directory contains: -* [disorder prediction](notebooks/adopt_disorder_prediction.ipynb) notebook +* [disorder prediction](notebooks/adopt_disorder_prediction.ipynb) notebook * [multi-head attention weights visualisation](notebooks/adopt_attention_viz.ipynb) notebook ### Compute residue level representations -In order to predict the **Z score** related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model. +In order to predict the **Z score** related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model. In the ADOPT directory run: diff --git a/adopt/inference.py b/adopt/inference.py index df931f8..53264bf 100644 --- a/adopt/inference.py +++ b/adopt/inference.py @@ -171,5 +171,5 @@ def main(args): main(args) z_score_pred = ZScorePred(args.train_strategy, args.model_type) z_score_pred.get_z_score_from_fasta( - args.fasta_file, args.repr_dir, args.pred_z_scores_file + args.fasta_path, args.repr_dir, args.pred_z_scores_file ) From 47ee472acd5b56c8ba9aeb64733103af62ff9762 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 20:35:02 +0100 Subject: [PATCH 15/16] fix(parser): argument --- adopt/inference.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/adopt/inference.py b/adopt/inference.py index 53264bf..e67f94a 100644 --- a/adopt/inference.py +++ b/adopt/inference.py @@ -171,5 +171,5 @@ def main(args): main(args) z_score_pred = ZScorePred(args.train_strategy, args.model_type) z_score_pred.get_z_score_from_fasta( - args.fasta_path, args.repr_dir, args.pred_z_scores_file + args.fasta_path, args.repr_dir, args.pred_z_scores_path ) From 23dec763035017647b7381a276d1c2a24b2c5619 Mon Sep 17 00:00:00 2001 From: Carlo Fisicaro Date: Mon, 22 Nov 2021 20:40:29 +0100 Subject: [PATCH 16/16] build: package version --- CITATION.cff | 4 ++-- adopt/version.py | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/CITATION.cff b/CITATION.cff index aa94975..e4678ab 100644 --- a/CITATION.cff +++ b/CITATION.cff @@ -1,4 +1,4 @@ -cff-version: 0.2.1 +cff-version: 0.3.0 message: "If you use this software, please cite it as below." authors: - given-names: "Kamil Tamiola" @@ -7,7 +7,7 @@ authors: affiliation: "Peptone Ltd." orcid: "" title: "Attention based DisOrder PredicTor" -version: 0.2.1 +version: 0.3.0 doi: date-released: url: "https://github.com/PeptoneInc/ADOPT" diff --git a/adopt/version.py b/adopt/version.py index 7a1faf3..ef9fb5e 100644 --- a/adopt/version.py +++ b/adopt/version.py @@ -3,4 +3,4 @@ # This source code is licensed under the MIT license found in the # LICENSE file in the root directory of this source tree. -version = "0.2.1" +version = "0.3.0"