A map of proteins for exploration and discovery. Specifically to explore many local parts of proteins all at once. Accompanied by the paper version explaining everything in depth.
Screen.Recording.2024-03-11.at.3.33.39.PM.mov
This code uses Foldseek's 3Di representation instead of amino acids to train a sequence model. The embeddings from the sequence model are then fed into UMAP for a global visualization.
What makes this system different? Here I explicitly model each protein as the interactions of it's internal 3D structure. I then compare across many different proteins for a global visualization.
If you want to reproduce these results check the training code in the training/
directory.
Note that UMAP transformation was does in python notebooks not in the python code.
The weights are saved in checkpoint-large-3.pt
in this Google Drive as well as additional training data.
See the paper protein-scatter.pdf for more references that aren't just code references.
-
Foldseek: van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023) to convert sequences into their 3Di representation for training.
-
nanoGPT: Andrej Karpathy for direct use and modification of causal self attention torch blocks.
-
USalign: Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang for use in the website backend to visualize pdb proteins superimposed on the query proteins.