Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature representations for new Proteins in DiG #184

Open
sai-advaith opened this issue Apr 23, 2024 · 12 comments
Open

Feature representations for new Proteins in DiG #184

sai-advaith opened this issue Apr 23, 2024 · 12 comments

Comments

@sai-advaith
Copy link

sai-advaith commented Apr 23, 2024

Hi,

This is regarding protein generation in DiG.

I wanted to know how you obtained the features present in the protein pickle files. As per Appendix B.1 of the paper, the single and pair representations are simply outputs of a pre-trained Evoformer model from AlphaFold given the corresponding protein's Fasta sequence and MSAs.

I set up OpenFold on our systems and saved the representations from Evoformer in a pickle file for the corresponding protein. I used the single and pair keys in the output dictionary in this link. Also, to get the MSAs for the fasta sequence I queried the ColabFold server.

Unfortunately, the representations I received from OpenFold's Evoformer and the representations in the dataset's pickle file were quite different.

Can you please let me know the exact method you used to obtain the single and pair representations for the respective protein fasta sequence?

@zhengsx
Copy link
Contributor

zhengsx commented May 27, 2024

Please use AlphaFold's representations.

@LifeWorks
Copy link

@sai-advaith Hi I assume you downloaded the datasets and checkpoints successfully, the token expired in May because of Microsoft policy. I wonder would you mind share what you have downloaded? Thanks very much!

@amelie-iska
Copy link

amelie-iska commented Aug 2, 2024

Same!!! @sai-advaith please share!!! Or @LifeWorks do you have it?

@sai-advaith
Copy link
Author

sai-advaith commented Aug 2, 2024

I wrote a script (based on AlphaFlow) to extract Evoformer representations. This code will help you get the single and pair representations you'll need to run graphormer.

https://github.com/sai-advaith/evoformer_representation

Is this what you wanted @LifeWorks @amelie-iska ? (Feel free to star if it's relevant and let me know if you have any trouble running it)

@LifeWorks
Copy link

I wrote a script (based on AlphaFlow) to extract Evoformer representations. This code will help you get the single and pair representations you'll need to run graphormer.

https://github.com/sai-advaith/evoformer_representation

Is this what you wanted @LifeWorks @amelie-iska ? (Feel free to star if it's relevant and let me know if you have any trouble running it)

Thanks for the prompt reply.

I wanted to get the checkpoints and dataset used by DiG to predict the distributions: https://github.com/microsoft/Graphormer/blob/main/distributional_graphormer/README.md
in DiG's readme, they give a SAS token to download their DiG's trained model, but the token expired and the author didn't put any new share links yet.

I wonder did you happen to download all these datasets and checkpoints before the token expired? If so, would you mind kindly reshare the dataset and checkpoints through google share or something?

https://github.com/microsoft/Graphormer/tree/main/distributional_graphormer/protein#trained-parameters

Thanks very much!

@amelie-iska
Copy link

@LifeWorks and @sai-advaith if either of you have the datasets and checkpoints, please let me know. I think @sai-advaith has a very useful repo, but it's unclear to me at the moment if this is enough for running DiG. I think we need the dataset too no? And the checkpoint isn't available now too? 😢 Let me know if either of you have time to discuss how to get DiG running. I had it running a couple of months ago before they took down the datasets and checkpoints.

@sai-advaith
Copy link
Author

The dataset consisted of protein fasta sequence (which you can get online) and evoformer representation (from the repo I shared).

I will get back to you regarding the model weights.

@LifeWorks
Copy link

The dataset consisted of protein fasta sequence (which you can get online) and evoformer representation (from the repo I shared).

I will get back to you regarding the model weights.

I see. Thanks very much! I'm looking forward to the model weights!

@amelie-iska
Copy link

Thanks so much @sai-advaith and @LifeWorks! I really appreciate the help getting the weights (and the excellent repo for getting the single and pair representations from EvoFormer)! I'd like the protein only weights, but also the protein-ligand weights if you have them or if either of you are able to get them. Please let me know how you would like to share the weights too.

@pujaltes
Copy link

pujaltes commented Aug 9, 2024

The model weights and data are still private, would anyone (@sai-advaith, @LifeWorks, @amelie-iska) be able to kindly share them with us?

@amelie-iska
Copy link

I wish I had them @pujaltes. If you get them, please let me know. I still don't have them.

@jeevster
Copy link

jeevster commented Oct 18, 2024

Hi @sai-advaith, thanks for creating this useful repo! As a sanity check, I tried generating the evoformer representations for one of the proteins (6lu7) for which the representations were already shared by the authors in this repo. I found that the representations produced by OpenFold are slightly different from those provided (for example for the single representations, the cosine similarity averaged across residues is around 0.995, and the ratio of the norms is on average 1.03). Did you find that these differences were minor enough to still yield good samples for the proteins that you tried on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants