Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Restructure and rewrite of structure and function predictions #1003

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

sacdallago
Copy link

@sacdallago sacdallago commented Mar 16, 2020

Please: consider this a draft to see the direction I'm taking. Help would be very welcome!

While writing, I noticed that the restructuring that was needed was more extensive than I had originally intended & stated in #1000

  1. I'm happy with the general introduction ("Protein structure and function predictions")
  2. I'm happy with "Secondary structure"
  3. "Contact and distance maps" seems also quite alright to me.
  4. "3D structure from sequence alone" is intended as a short section mentioning that most methods use contact/distance maps to fold proteins, but some newer methods (see in manuscript) try to directly go from sequence to structure (in an end-to-end fashion). I'm not too well read in this to have the confidence to write it yet, so I'm asking a colleague who does know
  5. "Quaternary structure and protein-protein interactions" I haven't really touched on yet, but also here: I might as two colleagues to look at this, they work in exactly this
  6. I would like to add another section on function prediction where I'd mention subcellular localization & GO annotation prediction (these could actually be 2 or more sections, but I want to keep it easy for now).

REF #1000

@sacdallago
Copy link
Author

Ping @j3xugit

@AppVeyorBot
Copy link

AppVeyor build 1.0.74 for commit c2e0e86 by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

@cgreene
Copy link
Member

cgreene commented Mar 16, 2020

Hi @sacdallago : I see that you've eliminated the one-sentence-per-line style. This makes it very hard to comment on individual sentences and to track changes in that way. Can you add the line breaks after each sentence back to your PR?

@sacdallago
Copy link
Author

Oh, I must have missed that in the instructions! Sorry @cgreene :) Will amend now...

@AppVeyorBot
Copy link

AppVeyor build 1.0.76 for commit 90f941f by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

@AppVeyorBot
Copy link

AppVeyor build 1.0.77 for commit a233b25 by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

@sacdallago
Copy link
Author

Just bumping this up again :) ( @j3xugit )

@j3xugit
Copy link
Contributor

j3xugit commented Mar 31, 2020 via email

@j3xugit
Copy link
Contributor

j3xugit commented Mar 31, 2020

The new version reads well, but I do want to do some minor revisions and possibly add a few more references. Do I have the write permission now? How can I do the revision?

@cgreene
Copy link
Member

cgreene commented Mar 31, 2020

@j3xugit great question! the GitHub suggest interface was designed for exactly this!

Mouse over the line you want to change and click the plus sign.

Screen Shot 2020-03-31 at 12 38 32 PM

Click the "suggest" button:
Screen Shot 2020-03-31 at 12 38 37 PM

Change the content within the backticks to what you want it to say:
Screen Shot 2020-03-31 at 12 38 48 PM

If you want to change one and only one line you can do "single comment". Otherwise, to batch them up make all the suggestions you want to make and then select "review changes" and "comment" and submit that.

Screen Shot 2020-03-31 at 12 40 58 PM

Thanks!

content/04.study.md Outdated Show resolved Hide resolved
content/04.study.md Outdated Show resolved Hide resolved
content/04.study.md Outdated Show resolved Hide resolved

The improvement obtained by these methods may be mainly due to the ability of convolutional neural fields to capture long-range information.
Top methods still heavily rely on the creation of profiles from Multiple Sequence Alignments (MSAs).
Models relying on LMs have yet to reach the accuracy of evolutionary based methods, but are able to deliver results for proteins for which MSAs can't be computed and in general execute at a fracion of the time with respect to evolutionary based approaches [@doi:10.1186/s12859-019-3220-8].
Copy link
Contributor

@j3xugit j3xugit Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Models relying on LMs have yet to reach the accuracy of evolutionary based methods, but are able to deliver results for proteins for which MSAs can't be computed and in general execute at a fracion of the time with respect to evolutionary based approaches [@doi:10.1186/s12859-019-3220-8].
Models relying on LMs have yet to reach the accuracy of evolutionary based methods even for proteins for which good MSAs cannot be built, but these LM methods in general execute at a fraction of the time with respect to evolutionary based approaches [@doi:10.1186/s12859-019-3220-8].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j3xugit

I think your addition has not yet been experimentally validated (even for proteins for which good MSAs cannot be built), and is, in large part, a repetition of the first part of the sentence (Models relying on LMs have yet to reach the accuracy of evolutionary based methods), which is the best (to my knowledge) one can say about the comparison in accuracy between evo vs. LMs.

My original intent (but are able to deliver results for proteins for which MSAs can't be computed) was meant in the sense that evo models (at least the ones I know) actually fail to produce results if MSAs can't be built.

If we can't agree, I'm happy to leave out this second part of the sentence. I would also rewrite the second part into its own sentence and make it clearer, here my suggestion:

Suggested change
Models relying on LMs have yet to reach the accuracy of evolutionary based methods, but are able to deliver results for proteins for which MSAs can't be computed and in general execute at a fracion of the time with respect to evolutionary based approaches [@doi:10.1186/s12859-019-3220-8].
Models relying on LMs have yet to reach the accuracy of evolutionary based methods.
On the upside, LMs require a fraction of the resources for inference compared to evolutionary based approaches [@doi:10.1186/s12859-019-3220-8].

Copy link
Contributor

@j3xugit j3xugit Aug 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is incorrect to say that evo models fail to produce results if MSAs can't be built. In fact my deep model (and other similar ones) works well on some proteins for which MSAs cannot be built. Baker group has also shown that the deep model trRosetta developed by his group (which is similar to mine) works well on a good portion of human-designed proteins although MSAs cannot be built for them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there specific references we should have in mind to support the discussion in these sentences? That might help make sure all of us are looking at the same models and results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I might have been too black and white here :) sorry about that!

My belief is that if a model bases predictions on only PSSMs as inputs (or evo couplings), but for a certain dark protein there simply isn't an MSA to start with, those predictions can't be trusted (I picture this as having an input matrix with all zeroes, just to give an idea). It's more of a "conceptual" point than reality ot things.
Models out there today do all sorts of things, including combining sequence-based features (often learned, e.g. via CNNs) with MSA extracted features, so mine would ba simplification.

I'm happy either way :) The more fundamental point of this sentence, for me, is that we still don't have a clear understanding on how well LMs work on those proteins for which also MSA-based methods perform arguably well, and how much we are buying ourselves coverage by using these models instead (or combinations of the two). But results for that will be out in Dec with CASP, I feel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sacdallago can you please edit these sentences to take into account @j3xugit's comment and the trRosetta results for de novo proteins? Or @j3xugit could suggest new text for @sacdallago to review.

https://www.pnas.org/content/117/3/1496 (doi:10.1073/pnas.1914677117) could also be a relevant reference to add.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's ok, I'll edit this the week of Aug 31st (I'm currently on holidays and it's hard to get into the writing headspace -- especially without a laptop & context overview)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending a brief reminder about these edits @sacdallago. It shouldn't require anything too extensive.

Copy link
Author

@sacdallago sacdallago Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agitter

  1. Thanks very much for the reminder -- unfortunately I had to attach 1 week leave of absence to my holidays due to some unfortunate family circumstances. AKA: sorry for the delay
  2. I don't quite know how to integrate the TrRosetta paper in exactly this section. TrRosetta doesn't use language models to extract additional features, it's still relies exclusively on MSAs (from my understanding). I did add it to an earlier sentence where it appeared to make more sense. Additionally, this section is about secondary structure, while TrRosetta is more about 3D structure prediction. I'll see where I can put it.
  3. I updated some other sections, which in the meantime have seen some new pre-prints and work.

I'll update the PR in about 10 min with the latest changes

content/04.study.md Outdated Show resolved Hide resolved
content/04.study.md Outdated Show resolved Hide resolved
content/04.study.md Outdated Show resolved Hide resolved
content/04.study.md Outdated Show resolved Hide resolved
@sacdallago
Copy link
Author

Thanks for the comments @j3xugit and explanation @cgreene ; I'll make time on the weekend to go over the changes and integrate them :)

@AppVeyorBot
Copy link

AppVeyor build 1.0.81 for commit 4ff4b51 by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

@AppVeyorBot
Copy link

AppVeyor build 1.0.82 for commit 204c23a by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

@AppVeyorBot
Copy link

AppVeyor build 1.0.83 for commit a75f45e by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

@AppVeyorBot
Copy link

AppVeyor build 1.0.84 for commit 45d6ed9 by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

Copy link
Author

@sacdallago sacdallago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j3xugit Thanks for your help :) I accepted most suggestions but have one open discussion item :)


The improvement obtained by these methods may be mainly due to the ability of convolutional neural fields to capture long-range information.
Top methods still heavily rely on the creation of profiles from Multiple Sequence Alignments (MSAs).
Models relying on LMs have yet to reach the accuracy of evolutionary based methods, but are able to deliver results for proteins for which MSAs can't be computed and in general execute at a fracion of the time with respect to evolutionary based approaches [@doi:10.1186/s12859-019-3220-8].
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j3xugit

I think your addition has not yet been experimentally validated (even for proteins for which good MSAs cannot be built), and is, in large part, a repetition of the first part of the sentence (Models relying on LMs have yet to reach the accuracy of evolutionary based methods), which is the best (to my knowledge) one can say about the comparison in accuracy between evo vs. LMs.

My original intent (but are able to deliver results for proteins for which MSAs can't be computed) was meant in the sense that evo models (at least the ones I know) actually fail to produce results if MSAs can't be built.

If we can't agree, I'm happy to leave out this second part of the sentence. I would also rewrite the second part into its own sentence and make it clearer, here my suggestion:

Suggested change
Models relying on LMs have yet to reach the accuracy of evolutionary based methods, but are able to deliver results for proteins for which MSAs can't be computed and in general execute at a fracion of the time with respect to evolutionary based approaches [@doi:10.1186/s12859-019-3220-8].
Models relying on LMs have yet to reach the accuracy of evolutionary based methods.
On the upside, LMs require a fraction of the resources for inference compared to evolutionary based approaches [@doi:10.1186/s12859-019-3220-8].

@sacdallago
Copy link
Author

Sorry about the delay & for committing the suggestions individually; I figured out only midway through that there was in fact a way of committing multiple changes via the "files" tab!

Thanks for the suggestions, anyway! :)

@AppVeyorBot
Copy link

AppVeyor build 1.0.85 for commit c77cc95 by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

@sacdallago
Copy link
Author

A ping @j3xugit

@agitter
Copy link
Collaborator

agitter commented Aug 9, 2020

@sacdallago it looks like these edits already went through one round of review and almost everything has been addressed. It there only one point of discussion to resolve before this is ready to merge?

I'll do a light review for style and copy editing after the scientific questions are all resolved.

@j3xugit
Copy link
Contributor

j3xugit commented Aug 9, 2020 via email

@agitter
Copy link
Collaborator

agitter commented Aug 9, 2020

@j3xugit I believe this comment above (#1003 (comment)) may be waiting for your feedback. @sacdallago proposed a change to line 321 to see if you agree with that rephrasing.

@j3xugit
Copy link
Contributor

j3xugit commented Aug 10, 2020 via email

@sacdallago
Copy link
Author

@agitter

I think this pass is fine by me. Thanks for pining and looking into this ;)

@sacdallago
Copy link
Author

Important realization: the 3D prediction section is still jsut in draft. I have no extensive expertize in this, I'm happy to contribute what I can in the following months.

If it's up to me: I will make this a quite short paragraph expanding on the bullet points which I laid out in this section.

@AppVeyorBot
Copy link

AppVeyor build 1.0.103 for commit 980f5d1 by @sacdallago is now complete. The rendered manuscript from this build is temporarily available for download at:

@agitter agitter mentioned this pull request Dec 9, 2020
@agitter agitter mentioned this pull request Jan 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants