This is the code repository for the paper Deep generative modeling of the human proteome reveals over a hundred novel genes involved in rare genetic disorders; A collaboration between the Marks Lab and Dias and Frazer Group. We also provide a website pop.evemodel.org for exploring model predictions on a protein by protein basis, as well as bulk downloads.
popEVE is a model designed to place missense variants on a proteome-wide, human-specific spectrum of pathogenicity. The figure below provides a summary of the full popEVE framework.
Genetic variation seen in the human population and across the tree of life provide complementary information for building a proteome-wide model of pathogenicity. Cross-species data enables missense resolution predictions, while variation seen in the human population can be used to obtain a proteome-wide, human-specific measure of constraint. The code provided here is designed to achieve this second step. It takes as input predictions for all single amino acid substitutions from a model trained on cross-species data, together with whether or not that variant has been seen in a given cohort of interest, and trains a new model to predict the presence or absence of a variant in that cohort, conditioned on the score from the input model.
In the paper, we used scores from EVE and ESM-1v as our cross-species scores, and UK Biobank data was used as our human cohort. Example training files can be found in the data
folder. However, this code can be used with any model and any human cohort.
An example bash script for running this code is here train_popEVE_models.sh
. All output will appear in the results
directory.
The entire codebase is written in python. Package requirements are as follows:
- python
- pytorch
- gpytorch
- pandas
- tqdm
The corresponding environment can be created via conda with the popeve_env_linux.yml (or popeve_env_macos.yml) file as follows:
conda env create -f popeve_env_linux.yml
conda activate popeve_env
A bash script for installing all dependencies in a clean Ubuntu 24.04 system is available here linux_setup.sh
.
This project is available under the MIT license.
If you use this code, please cite the following paper:
@article{orenbuch2023deep,
title={Deep generative modeling of the human proteome reveals over a hundred novel genes involved in rare genetic disorders.},
author={Orenbuch, Rose and Kollasch, Aaron W and Spinner, Hansen D and Shearer, Courtney A and Hopf, Thomas A and Franceschi, Dinko and Dias, Mafalda and Frazer, Jonathan and Marks, Debora S},
journal={medRxiv},
pages={2023--11},
year={2023},
publisher={Cold Spring Harbor Laboratory Press}
}