-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating SidechainNet angles for arbitrary PDB files #44
Comments
Hi Andreea, Thank you for your interest in SidechainNet! I'm glad you found it could be of use. If you continue your work on MP-Nerf, I would be interested in learning how you used SidechainNet with it. I realize that the documentation on the master branch is a bit lacking with regard to your questions. I have just pushed an update to the master branch that should describe in detail how the data is organized. For your convenience, the file is here, and I have copied and pasted the relevant excerpt below. I believe that it should answer your first 3 questions.
To answer your final question, I am happy to push an update with the required code. But before I do, can you verify what you need the function to do? Here is my guess at what the function should do. Please let me know if I interpreted your request incorrectly. You could also directly use this function in your own code without me updating SidechainNet immediately if you wish. The code is longer than necessary because I wanted to make sure it could be easily understood. import prody as pr
import sidechainnet as scn
from sidechainnet.utils.download import get_resolution_from_pdbid
def process_pdb(filename, pdbid="", include_resolution=False):
"""Return a dictionary containing SidechainNet-relevant data for a given PDB file.
Args:
filename (str): Path to existing PDB file.
pdbid (str): 4-letter string representing the PDB Identifier.
include_resolution (bool, default=False): If True, query the PDB for the protein
structure resolution based off of the given pdb_id.
Returns:
scndata (dict): A dictionary holding the parsed data attributes of the protein
structure. Below is a description of the keys:
The key 'seq' is a 1-letter amino acid sequence.
The key 'coords' is a (L x NUM_COORDS_PER_RES) x 3 numpy array.
The key 'angs' is a L x NUM_ANGLES numpy array.
The key 'is_nonstd' is a L x 1 numpy array with binary values. 1 represents
that the amino acid at that position was a non-standard amino acid that
has been modified by SidechainNet into its standard form.
The key 'unmodified_seq' refers to the original amino acid sequence
of the protein structure. Some non-standard amino acids are converted into
their standard form by SidechainNet before measurement. In this case, the
unmodified_seq variable will contain the original (3-letter code) seq.
The key 'resolution' is the resolution of the structure as listed on the PDB.
"""
# First, use Prody to parse the PDB file
chain = pr.parsePDB(filename)
# Next, use SidechainNet to make the relevant measurements given the Prody chain obj
(dihedrals_np, coords_np, observed_sequence, unmodified_sequence,
is_nonstd) = scn.utils.measure.get_seq_coords_and_angles(chain, replace_nonstd=True)
scndata = {
"coords": coords_np,
"angs": dihedrals_np,
"seq": observed_sequence,
"unmodified_seq": unmodified_sequence,
"is_nonstd": is_nonstd
}
# If requested, look up the resolution of the given PDB ID
if include_resolution:
assert pdbid, "You must provide a PDB ID to look up the resolution."
scndata['resolution'] = get_resolution_from_pdbid(pdbid)
return scndata I am happy to help if you have any more thoughts or concerns! Cheers, |
Hi Jonathan, Many thanks for your prompt reply and for your explanations! Regarding the function that gets a PDB file and generates sidechainnet data -- I think it misses some of the fields (that I wasn't aware of before) that the official SidechainNet data has. MP-Nerf is using a SidechainNet Dataloader which is built on a SidechainNet ProteinDataset. The Do you maybe also have the code for extracting/generating these or, if not, could you kindly help us figure these out? Many thanks again! PS: I will follow up in a few months if we manage to get some results using SidechainNet and MP-Nerf :) |
Hey Andreea, You're welcome! Your question has actually raised a few questions on my end. I can partially answer your question, but I'm sorry to say that I've just realized that the function above is not completely correct. In particular, the function above does not understand things like PDB files with multiple peptide chains, or SEQRES records (which are necessary to determine missing residue locations and Would you mind sharing more information about the files you are using? Can you point me towards them or share an example? I'm curious if they contain SEQRES records and/or multiple chains. Best, Partial AnswerSome data fields are not generated from SidechainNet's own utilities. This is due to the fact that SidechainNet extends earlier work called ProteinNet. Two of the fields you mentioned are only included in SidechainNet by way of borrowing the data directly from ProteinNet:
I would like to have my own methods for generating these fields, but I have not created any at this time. The key The |
Hi Jonathan, Thanks again for your help! To answer your questions:
Actually, at a closer look, for MP-Nerf, I think the important missing field would be |
Thanks for the follow-up and for sharing some examples! Here are some comments. SidechainNet and ProteinNet treat all protein chains independently. 1 chain per data entry. Our proposed function would need to return multiple entries if it encounters multiple chains. That should not be too hard to implement. With regards to the files, here's the issue that perhaps we can figure out together. The You tell me that the files have missing atoms. How can one know this by looking at the files? If you know which atoms are missing a priori, then we can use this information to compute the binary missing residue mask that is required. Please let me know your thoughts! |
Hi,
I am interested in using MP-Nerf for a custom dataset and in order to do this I would need to have the functionality of transforming arbitrary PDB entries in the 'SidechainNet format'. Thus, I am looking to learn more about the data processing in SidechainNet, in particular about the angles.
From your documentation, I know that the angles for a protein with L residues will have the shape L x 12, where:
My questions would be:
Many thanks!
Andreea
The text was updated successfully, but these errors were encountered: