ProCyon is an open-source model for predicting protein phenotypes across scales. This repository provides the official implementation of the model as described in our overview page and our paper. Our associated HuggingFace collection containing model weights and datasets can be found at the following links:
- Dataset: ProCyon-Instruct
- Full model: ProCyon-Full
- Benchmarking model: ProCyon-Split
- Binding prediction model: ProCyon-Bind
We recommend installing with uv, but install can also be done via pip
alone. The procyon
package used to interact with pre-trained models or train new models can be installed via
cd /path/to/ProCyon
# OPTIONAL: create virtual environment
python3 -m venv ./procyon_venv
source ./procyon_venv/bin/activate
# RECOMMENDED: use uv to install
python3 -m pip install uv
python3 -m uv sync --extra build
python3 -m uv sync --extra build --extra compile
python3 -m uv pip install -e .
# OR if omitting uv
python3 pip install -e .
We encourage installation within a virtual environment. Installation with
uv
should take less than 10 minutes, depending on the speed of your internet
connection for downloading packages.
In addition to the package code, ProCyon also requires pre-trained weights for associated
models (e.g. Llama-3, ESM2) as well as access to the ProCyon-Instruct dataset. These dependencies
will all be stored in a single directory, which we denote DATA_DIR
.
DATA_DIR=/path/to/data
mkdir $DATA_DIR
cd $DATA_DIR
# Clone ProCyon-Instruct dataset from HuggingFace
git clone [email protected]:datasets/mims-harvard/ProCyon-Instruct
# Clone model weights for associated Llama models from HuggingFace
# Llama-3-8b for ProCyon-Full
cd model_weights/llama-3-8b
git clone [email protected]:meta-llama/Meta-Llama-3-8B
# Llama-2-7b for ProCyon-Split
cd ../llama-2-7b-hf
git clone [email protected]:meta-llama/Llama-2-7b-hf
# Add a `.env` file which the `procyon` package will use to find the `DATA_DIR`
cd /path/to/ProCyon
echo "DATA_DIR=\"$DATA_DIR\"" > .env
echo "HOME_DIR=\"$(pwd)\"" > .env
For the core capabilities of ProCyon models, please see the provided demo notebooks. Both examples should run in less than 5 minutes depending on the speed of your GPU.
- Additional notebooks with analysis examples
- Reproduction code from the manuscript
- Full training documentation and tutorial
@article {Queen2024.12.10.627665,
author = {Queen, Owen and Huang, Yepeng and Calef, Robert and Giunchiglia, Valentina and Chen, Tianlong and Dasoulas, George and Tai, LeAnn and Ektefaie, Yasha and Noori, Ayush and Brown, Joseph and Cobley, Tom and Hrovatin, Karin and Hartvigsen, Tom and Theis, Fabian and Pentelute, Bradley L. and Khurana, Vikram and Kellis, Manolis and Zitnik, Marinka},
title = {ProCyon: A multimodal foundation model for protein phenotypes},
elocation-id = {2024.12.10.627665},
year = {2024},
doi = {10.1101/2024.12.10.627665},
URL = {https://www.biorxiv.org/content/early/2024/12/15/2024.12.10.627665},
eprint = {https://www.biorxiv.org/content/early/2024/12/15/2024.12.10.627665.full.pdf},
journal = {bioRxiv}
}