-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature representations for new Proteins in DiG #184
Comments
Please use AlphaFold's representations. |
@sai-advaith Hi I assume you downloaded the datasets and checkpoints successfully, the token expired in May because of Microsoft policy. I wonder would you mind share what you have downloaded? Thanks very much! |
Same!!! @sai-advaith please share!!! Or @LifeWorks do you have it? |
I wrote a script (based on AlphaFlow) to extract Evoformer representations. This code will help you get the single and pair representations you'll need to run graphormer. https://github.com/sai-advaith/evoformer_representation Is this what you wanted @LifeWorks @amelie-iska ? (Feel free to star if it's relevant and let me know if you have any trouble running it) |
Thanks for the prompt reply. I wanted to get the checkpoints and dataset used by DiG to predict the distributions: https://github.com/microsoft/Graphormer/blob/main/distributional_graphormer/README.md I wonder did you happen to download all these datasets and checkpoints before the token expired? If so, would you mind kindly reshare the dataset and checkpoints through google share or something? Thanks very much! |
@LifeWorks and @sai-advaith if either of you have the datasets and checkpoints, please let me know. I think @sai-advaith has a very useful repo, but it's unclear to me at the moment if this is enough for running DiG. I think we need the dataset too no? And the checkpoint isn't available now too? 😢 Let me know if either of you have time to discuss how to get DiG running. I had it running a couple of months ago before they took down the datasets and checkpoints. |
The dataset consisted of protein fasta sequence (which you can get online) and evoformer representation (from the repo I shared). I will get back to you regarding the model weights. |
I see. Thanks very much! I'm looking forward to the model weights! |
Thanks so much @sai-advaith and @LifeWorks! I really appreciate the help getting the weights (and the excellent repo for getting the single and pair representations from EvoFormer)! I'd like the protein only weights, but also the protein-ligand weights if you have them or if either of you are able to get them. Please let me know how you would like to share the weights too. |
The model weights and data are still private, would anyone (@sai-advaith, @LifeWorks, @amelie-iska) be able to kindly share them with us? |
I wish I had them @pujaltes. If you get them, please let me know. I still don't have them. |
Hi @sai-advaith, thanks for creating this useful repo! As a sanity check, I tried generating the evoformer representations for one of the proteins (6lu7) for which the representations were already shared by the authors in this repo. I found that the representations produced by OpenFold are slightly different from those provided (for example for the single representations, the cosine similarity averaged across residues is around 0.995, and the ratio of the norms is on average 1.03). Did you find that these differences were minor enough to still yield good samples for the proteins that you tried on? |
Hi,
This is regarding protein generation in DiG.
I wanted to know how you obtained the features present in the protein pickle files. As per Appendix B.1 of the paper, the single and pair representations are simply outputs of a pre-trained Evoformer model from AlphaFold given the corresponding protein's Fasta sequence and MSAs.
I set up OpenFold on our systems and saved the representations from Evoformer in a pickle file for the corresponding protein. I used the
single
andpair
keys in theoutput
dictionary in this link. Also, to get the MSAs for the fasta sequence I queried the ColabFold server.Unfortunately, the representations I received from OpenFold's Evoformer and the representations in the dataset's pickle file were quite different.
Can you please let me know the exact method you used to obtain the single and pair representations for the respective protein fasta sequence?
The text was updated successfully, but these errors were encountered: