Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distance prediction treshold #4

Open
simonMoisselin opened this issue Mar 7, 2019 · 6 comments
Open

Distance prediction treshold #4

simonMoisselin opened this issue Mar 7, 2019 · 6 comments
Labels

Comments

@simonMoisselin
Copy link

Hello,

Nice work !

In the notebook predicting_distances,
Why did you want to predict classes of distances, instead of distances values directly ?

@hypnopump
Copy link
Owner

Hi,
First of all, thanks for the kind words.

Answering to your question: since we were training in frames of 200x200, I couldn't find a better way for the model to "ignore" the padding than converting it into a classification problem and giving that cclass a very little value.
Also, AlphaFold also used classes, not direct distances as predictions. The reason for that is that we don't want our model to output the exact distance between each pair of AAs since it's pretty impractical, but instead use the outputs as constraints for some folding algorithm such as Rosetta's one (I'm stil not exactly sure how to pass the outputs as constraints to this kind of systems, but the method is reported in this paper which got SOTA results: and it seems to me that it was referenced in the AlphaFold blog post).
I'll try to train a model for direct distance prediction with MSE (Mean Squared Error) as the loss function once I'll have the 64x64 crops system working.

@jgreener64
Copy link

To add my opinion to the mix. When you predict a distance you need a degree of uncertainty associated with the prediction to use it effectively as a constraint. Predicting distances in bins is a useful way to do this. It is unclear how you would train a system that predicted distance and an uncertainty value together.

Also, distance predictions above a certain threshold (perhaps 20 Angstrom) are not accurate when using covariation data, as they just tell you the residues are not close in the protein. You wouldn't want a strong constraint on that. Predicting into distance bins lets you have a catch-all last bin that takes account of this.

@simonMoisselin
Copy link
Author

Ok Thank you ! Now it make sense for me.
And how did you choose the threshold values ? I am guessing that it is derived from existing literature.

@hypnopump
Copy link
Owner

The threshold values are an arbitrary decision (although some constraints may apply), so they could be replaced with some other ones.
In general, predictions of distances >~20 Angstrom (A) may be inaccurate. Some papers use bins of 0.5-1A between 4A and 20A approximately. My problem was that classes are not equally represented in the data so in order for the model to output a "visually pleasant" image, I had to set weights for the classes. As you can imagine, an optimization problem with 7 variables is much easier than one with 20 of them.
In addition to that, I couldn't store such big tensors (with 20 classes) in memory if I wanted a decent (at least 100) amount of proteins to train the model) so I had to reduce the number of classes. Since my network was not very deep, I wanted it to be an "easy" problem so I reduced the number of them.
Right now I'm working on a method to load the dataset from disk instead of RAM so I expect to free some memory and perhaps increase the number of classes or the depth of the model.

@Guocanyong
Copy link

How do you choose the value of the weighted_categorical_crossentropy for the loss function used in the distance prediction model?

@hypnopump
Copy link
Owner

Trial and error with different valules. You're encouraged to share your weights if you find a combination which produces better results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants