-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5.1.1 Creating a list of ProteinNet IDs - error #54
Comments
Thank you for your interest and your patience as I try to address your concerns. To begin, can you please provide me the code you are trying to run? Also, have you seen my example on creating a custom dataset in the Google Colab notebook linked in the README? |
Hello, Here are issues (not critical):
d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids, Working fine. b. Not working: training_ids =[list] Error: Traceback (most recent call last): File "test_test.py", line 35, in d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids, File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/create.py", line 354, in create_custom sc_only_data, sc_filename = download_sidechain_data( File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 130, in download_sidechain_data sc_data, pnids_errors = get_sidechain_data(new_pnids, limit) File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 189, in get_sidechain_data list( File "/home/groot/.local/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__ for obj in iterable: File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next raise value OSError: /home/groot.local/lib/python3.8/site-packages/sidechainnet/resources/proteinnet_parsed/targets/7PBC_1_A.pdb is not a valid filename or a valid PDB identifier. This error comes every time on all "TEST" dataset prefixes and any PDBid. This may be easy overridden by assigning test to some unused in training validation set.
d = scn.load(local_scn_path="sidechainnet_data/800p.pkl", with_pytorch="dataloaders", batch_size=32, dynamic_batching=False) training_d = d['train'] train_batch = next(iter(training_d)) print(train_batch.pids) return all the same first PDBid in train set batch_size times. same with any "valid-n" dataset return valid actual different PDBids. Question: Is there way to query batch 1,2,3,etc not just 1st to verify data is different every time.
Thank you! |
Train batches have single pdbid every time Returning pdid for epoch:
|
Thanks for following up! Okay, there are a lot of things here to address. I'm going to go through your messages and insert my comments inline.
I think what is happening here is that by providing an PDB ID with the prefix "TBM", this signifies that the protein is in fact a test set protein. These are targets used in the CASP competition and I have no programmatic way to download these via SidechainNet currently. Simply put - by putting this prefix, you are suggesting that you want to make a SidechainNet dataset containing a protein that SidechainNet does not know how to add.
Are you using a custom dataset? If so, how many proteins are in it?
Currently retrieving the i'th batch is not supported.
First, please note that I'm not exactly sure what you would like to know. PDB IDs and ProteinNet/SidechainNet IDs are simply two different ways to name protein entries. Can you please clarify your question?
Yes, though this is not implemented in the main branch. Please see this issue for an example function you can use for the time being (PDB files only). Note that if the structure has gaps, you will need to carefully handle the protein for several nuanced reasons.
SidechainNet does not support exporting proteins in the ProteinNet text format. It does, however, support exporting SCNProtein objects to pdb files (see SCNProtein.to_pdb).
I love this question, as it is very closely related to my current research :) However, I simply don't know the answer to any of these components. I wish I did!
At the moment, no, I'm sorry. The code is currently written to start from a SidechainNet dataset object and then make predictions from there. You can browse the examples directory for some model examples if you would like to know more than what the Colab notebook shows. I'm sorry that this is not more helpful, but at the moment SidechainNet has more functionality regarding the data handling and less functionality regarding specific models/training setups. Something like this would be the idea (assuming you've parsed the FASTA files into strings, and that you have a trained model that takes a SCNProtein as input and produces a protein as output): def make_scn_from_seq(seq, name):
return SCNProtein(seq=seq, id=name)
def predict(model, protein):
return model(protein)
my_proteins = [make_scn_from_seq(s, name) for (s, name) in my_sequences]
for p in my_proteins:
pred = predict(model, p)
pred.to_pdb(f"{p.id}.pdb}")
I think these issues may be related to your item 2. above. Can you share more of your code? It is possible that shuffling may not be working as expected. |
Thank you very much for detailed answers! Main issue now is batch shuffle, so here is the setup:
So despite "train" and "valid-10" are identical it produce diff batches. for epoch in range(1000): #print(f"Epoch {epoch}") #progress_bar = tqdm(total=len(d['train']), smoothing=0) i = 5 for batch in d['valid-10']: i -= 1 if i <= 0: break print(f"Model Input = {tuple(batch.seq_evo_sec.shape)}; Total residues = " f"{batch.seq_evo_sec.shape[0] * batch.seq_evo_sec.shape[1]}.") # Prepare variables and create a mask of missing angles (padded with zeros) # Note the mask is repeated in the last dimension to match the sin/cos represenation. seq_evo_sec = batch.seq_evo_sec.to(device) true_angles_sincos = scn.structure.trig_transform(batch.angs).to(device) mask = (batch.angs.ne(0)).unsqueeze(-1).repeat(1, 1, 1, 2) # Make predictions and optimize pred_angles_sincos = pssm_model(seq_evo_sec) loss = mse_loss(pred_angles_sincos[mask], true_angles_sincos[mask]) loss.backward() torch.nn.utils.clip_grad_norm_(pssm_model.parameters(), 2) optimizer.step() # Housekeeping batch_losses.append(float(loss)) #progress_bar.update(1) #progress_bar.set_description(f"\rRMSE Loss = {np.sqrt(float(loss)):.4f}") valid_d = d['valid-10'] valid10_batch = next(iter(valid_d)) print(valid10_batch._fields) print(valid10_batch.pids) print("Protein IDs\n ", batch.pids) Could sidechainnet batch loader based on seq. length be not used, especially if proteins in input are approx same size? How to shuffle custom dataset directly based on batch size, e.g. if batch size is 32, how to make batches going 1-32, 32-64, 64-96, etc Should code above show batch per epoch or just first batch every time? Probably there is a code error for this, but anyway "train" batches contain one single protein chain n times. |
Hello,
There is some issue with 5.1.1 instruction:
First - it has actual error:
"the testing set from CASP11."
but:
test_ids = scn.get_proteinnet_ids(casp_version=12, split="test")
Also after "d = scn.create_custom" no custom/additional proteins included in train dataset (following instruction).
Is there any actual instruction on inclusion of custom pdbid to train/validation/test sets or construction new sets from scratch using only new pdbids?
Thank you
The text was updated successfully, but these errors were encountered: