UDSMProt is an algorithm for the classification of proteins based on the sequence of amino acids alone. Its key component is a self-supervised pretraining step based on a language modeling task. The model is then subsequently finetuned to specific classification tasks. In our paper we considered enzyme class classification, gene ontology prediction and remote homology detection showcasing the excellent performance of UDSMProt.
For a detailed description of technical details and experimental results, please refer to our paper:
Universal Deep Sequence Models for Protein Classification
Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek
bioRxiv preprint 2019
This is the accompanying code repository, where we also provide links to pretrained language models.
for training/evaluation: pytorch
fastai
fire
for dataset creation: numpy
pandas
scikit-learn
biopython
sentencepiece
lxml
We recommend using conda as Python package and environment manager.
Either install the environment using the provided proteomics.yml
by running conda env create -f proteomics.yml
or follow the steps below:
- Create conda environment:
conda create -n proteomics
andconda activate proteomics
- Install pytorch:
conda install pytorch -c pytorch
- Install fastai:
conda install -c fastai fastai=1.0.52
- Install fire:
conda install fire -c conda-forge
- Install scikit-learn:
conda install scikit-learn
- Install Biopython:
conda install biopython -c conda-forge
- Install sentencepiece:
pip install sentencepiece
- Install lxml:
conda install lxml
Optionally (for support of threshold 0.4 clusters) install cd-hit and add cd-hit
to the default searchpath.
- Download and extract the desired Swiss-Prot release (by default we use 2017_03) from the UniProt ftp server. Save the contained
uniprot_sprot.xml
asuniprot_sprot_YEAR_MONTH.xml
in the./data
directory - Download and extract the desired UniRef release (by default we use 2017_03) from the UniProt ftp server. Save the contained
uniref50.xml
asuniref50_YEAR_MONTH.xml
in the./data
directory. As an alternative and for full reproducibility, we also provide pickled cluster filescdhit04_uniprot_sprot_2016_07.pkl
anduniref50_2017_03_uniprot_sprot_2017_03.pkl
to be placed under./tmp_data
that avoid downloading the full UniRef file or running cd-hit. - Or just call our provided script
./download_swissprot_uniref.sh 2017 03
which manages everything for you.
- Preprocessed versions of the DEEPre and ECPred datasets are already contained in the
./git_data
folder of the repository. - The custom EC40 and EC50 datasets will be created from Swiss-Prot data directly.
- Follow the instructions and use the preprocessing scripts provided by the DeeProtein repository (
download.sh
,datasets.sh
, anddatasets_up.sh
) to create the filestrain_cafa3_original.shuffled.csv
,filtered_sp_cdhitted_05.csv.shuffled
,test_cafa3.shuffled.csv
andtest_cafa3_deepgo_comparison.shuffled.csv
and place them in the./data/
folder
- Download the superfamily and fold datasets and extract them into the
./data
folder
- Run the data preparation script
cd code
./create_datasets.sh
- The output is structured as follows:
tok.npy
sequences as list of numerical indices (mapping is provided bytok_itos.npy
)label.npy
(if applicable) label as list of numerical indices (mapping is provided bylabel_itos.npy
)train_IDs.npy
/val_IDs.npy
/test_IDs.npy
numerical indices identifying training/validation/test set by specifying rows intok.npy
train_IDs_prev.npy
/val_IDs_prev.npy
/test_IDs_prev.npy
original non-numerical IDs for all entries that were ever assigned to the respective sets (used to obtain consistent splits for downstream tasks)ID.npy
original non-numerical IDs for all entries intok.npy
- The approach is easily extendable to further downstream classification or regression tasks. It only requires to implement a corresponding preprocessing method similar to the ones provided for the existing tasks in
preprocessing_proteomics.py
.
We provide some basic usage information for the most common tasks:
- Language Model Pretraining (or skip this step and use the provided pretrained LMs (forward and backward models trained on SwissProt 2017_03))
cd code
python modelv1.py language_model --epochs=60 --lr=0.01 --working_folder=datasets/lm/lm_sprot_dirty/ --export_preds=False --eval_on_val_test=True
- Finetuning for enzyme class classification (here for level 1 and EC50 dataset; assuming the pretrained folder is located at
datasets/lm/lm_sprot_uniref_fwd
)
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=30 --metrics=["accuracy","macro_f1"] --lr=0.001 --lr_fixed=True --bs=32 --lr_slice_exponent=2.0 --working_folder=datasets/clas_ec/clas_ec_ec50_level1 --export_preds=True --eval_on_val_test=True
- Finetuning for gene ontology prediction
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=30 --lr=0.001 --lr_fixed=True --bs=32 --lr_slice_exponent=2.0 --metrics=[] --working_folder=datasets/clas_go/clas_go_deeprotein_sp_train_deepgo_test --export_preds=True --eval_on_val_test=True
- Finetuning for remote homology detection (here for superfamily level and a single dataset)
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=10 --bs=128 --metrics=["binary_auc","binary_auc50","accuracy"] --eval_on_val_test=True --early_stopping=binary_auc --bs=64 --lr=0.05 --fit_one_cycle=False --working_folder=datasets/clas_scop/clas_scop0 --export_preds=True --eval_on_val_test=True
The output is logged in logfile.log
in the working directory, the final results are exported for convenience as result.npy
and individual predictions that can be used for example for ensembling forward and backward models are exported as preds_valid.npz
and preds_valid.npz
(in case export_preds
is set to true).