-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sidechainnet for CASP 13 to CASP 15 #57
Comments
Hi Harsh, Thanks for your interest! This is something that I would love to do (and I'm sure other users would be interested in), but it's unfortunately delayed and I don't have info on when I can add this. I'm working on adding slightly different functionality to SidechainNet at the moment. Why? The trouble is that SidechainNet directly extends ProteinNet (and thereby uses ProteinNet's pretty sophisticated protein sequence clustering and filtering methods). Since ProteinNet does support CASPs newer than CASP 12 to my knowledge (specifically the clustering info), I am prevented from adding later CASP datasets to SidechainNet for now. I must either develop the code to split the training data in the same way as AlQuraishi et al. have done, ask for the authors access to that code, or hope that the authors would be willing to generate the same kind of dataset splits for CASPs > 12 and share them. The good news is that you can manually specify proteins for a custom SidechainNet dataset. See Section 5 of the Colab Walkthrough linked to in the README. You'd just need define a list of train, validation, and test set proteins using the SidechainNet naming scheme, and those protein chains will be acquired and parsed into SidechainNet's datastructures. For the CASP test set proteins, however, you would need identify the RCSB PDB IDs that they correspond to, so that SidechainNet can download them correctly from the RCSB PDB. Please let me know if you have any questions or concerns, and I'd be happy to help as much as I can. Best, |
Hey Jonathaking, |
I'm really glad it has been helpful to you! Let's see, let me try to break this down a bit. 1To begin, (apologies if you already know this) you should be aware that SidechainNet (as well as many other models and datasets like ProteinNet or even AlphaFold) treat proteins not as mutli-chain entities, but rather operate on each protein chain independently. So, in SidechainNet, we use a naming scheme that not only includes the 4-digit RSCB PDB ID, but also a "model number" (usually 1 is appropriate if you don't have a reason to use something else), as well as the very important chain ID. What you're effectively doing is trying to download model 1 and chain A from all of those proteins. Model 1 probably exists for all of them, as well as chain A, but neither are guaranteed. 2I'm not positive, but I think your code is not running on the Colab notebook because some of the IDs you've provided are not valid. To me it looks like your code doesn't bother downloading sidechainnet data for any of the items you requested (it says
If you want me to look closer at your error, can you please expand the error traceback where it says "3 Frames"? 3I think I understand what you want, but SidechainNet doesn't have all the tools to get you there at the moment. If you can come up with a way to generate all of the sidechainnet-formatted IDs that you need properly, then my code should be able to handle that. SidechainNet specifies proteins as being part of the validation or test sets by this naming convention (i.e. There is also functionality that's not fully tested where if you have the pdb file, you can load the protein into a SCNProtein. However, this doesn't work for proteins with gaps in their sequences, and the PDB file must only have a single chain. Please let me know if I can help any more! |
Hey, thanks for your reply. Here's a copy of the colab notebook: https://colab.research.google.com/drive/1X-Z7qcDUyQxIXnBYyWd042BcQr3UsF-z?usp=sharing. Kindly let me know if you can find the issue or suggest me how it can be fixed :) |
I get the same error when running your notebook. I think it's because of the reasons I mentioned above (improper sidechainnet ids). Please let me know if I can clarify further. |
Gotcha. Thanks. I'll try to fetch the correct Ids and post if I encounter any other issues. |
Hi!
I am trying to do Masked Modelling using sequential and structural data using your curated dataset. I was wondering if it's possible for you to add the data for CASP 13 to CASP 15 if that's possible or share how I can do the same on my own.
Kind regards,
Harsh
The text was updated successfully, but these errors were encountered: