ProteoRift utlizes attention and multitask deep-network which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra. We demonstrate that ProteoRift can predict these properties with up to 97% accuracy resulting in search-space reduction by more than 90%. As a result, our end-to-end pipeline, utlizing Specollate as the underlying engine, is shown to exhibit 8x to 12x speedups with peptide deduction accuracy comparable to algorithmic techniques.
If you use ProteoRift in your work, please cite the following publications:
Full documentation and further functionality are still a work in progress. A step-by-step how-to for training or running our trained version of ProteoRift on your data is available below. Please check back soon for an updated tool!
- A Computer with Ubuntu 16.04 (or later) or CentOS 8.1 (or later).
- Cuda enabled GPU with at least 12 GBs of memory.
- OpenMS tool for creating custom peptide database. (Optional)
Step by Step Guide to Install Anaconda
- Fork the repository to your own account.
- Clone your fork to your machine.
cd ProteoRift
conda env create --file proteorift_env.yml
(It would take some minutes to install dependencis)
conda activate proteorift
Our end-to-end pipeline uses two models Specollate and ProteoRift.
- Use mgf files for spectra in
sample_data/spectra
. Or you can use your own spectra files in mgf format. - Use human peptidome subset in
sample_data/peptide_database
. You can provide your own peptide database file created using the Digestor tool provided by OpenMS. - Download the weights for specollate and proteorift model here under the Assets section.
- Set the following parameters in the [search] section of the
config.ini
file:model_name
: Absolute path to the proteorift model (called proteorift_model_weights.pt that you downloaded from here under the Assets section).specollate_model_path
: Absolute path to the specollate model (called specollate_model_weights.pt that you downloaded from here under the Assets section).mgf_dir
: Absolute path to the directory containing mgf files to be searched.prep_dir
: Absolute path to the directory where preprocessed mgf files will be saved.pep_dir
: Absolute path to the directory containing peptide database.out_pin_dir
: Absolute path to a directory where percolator pin files will be saved. The directory must exist; otherwise, the process will exit with an error.- Set database search parameters
- Run
python read_spectra.py -t u
. It would preprocess the spectra files and place in the prep_dir. - Run
python run_search.py
. It would generate the embeddings for spectra and peptides and it would predict the filters for spectra and perform the search. It would generate the output(e.g target.pin, decoy.pin).
The database search would output two files (target.pin, decoy.pin). target.pin
contains the information about Target Peptide Spectrum Match. decoy.pin
contains the information about Decoy Peptide Spectrum Match. Both .pin file would have the features given below for Peptide-Spectrum Match.
Once the search is complete and .pin are generated; you can analyze the percolator files using the crux percolator tool:
cd <out_pin_dir>
crux percolator target.pin decoy.pin --list-of-files T --overwrite T
Execution time is dependent on many factors including your machines, size of the data, size of the spectra, and what kind of search-space reduction was achieved. As an example, for a 3.9GB database our proposed method completes the search in 1.65 hours with filters.
To perform the uncertainty Analysis, open notebook uncertainty_analysis/uncertainty-analysis-specs.ipynb
.
- Set the following parameters in the notebook:
model_path
: Absolute path to the specollate model weights (called specollate_model_weights.pt that you downloaded from here under the Assets section)in_tensor_dir
: Path to your data folder, it should contain your data after preprocessing (Review the comments in the notebook to identify the files generated after preprocessing.)
- Install dependencies specified in the notebook
- Run the notebook
You can retrain the ProteoRift model if you wish.
- Prepare the spectra data (mgf format).
- Open the config.ini file in your favorite text editor and set the following parameters:
mgf_dir
: Absolute path of the mgf files.prep_dir
Absolute path to the directory where preprocessed mgf files will be saved.- other parameters in the [ml] section: You can adjust different hyperparameters in the [ml] section, e.g., learning_rate, dropout, etc.
- Setup the wandb account. Create a project name
proteorift
. Then login to the project usingwandb login.
It would store the logs for training. - Run
python read_spectra.py -t l
. It would preprocess the spectra files and split them (training, validation, test) and place in the prep_dir. - Run the specollate_train file
python run_train.py
. The model weights would be saved in an output dir.