Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap asym ids #1773

Merged
merged 23 commits into from
Oct 17, 2023
Merged

Swap asym ids #1773

merged 23 commits into from
Oct 17, 2023

Conversation

jamesmkrieger
Copy link
Contributor

@jamesmkrieger jamesmkrieger commented Oct 12, 2023

Response to #1772

Other fixes include:

  • trimming away semicolons on the edges of mmcif entries
  • adjusting unite_chains (copying larger units stored in segnames to chids) to handle header and biomols
  • separating copy into another step in buildBiomolecules in case the selection is None
  • removing relabelling of segments based on biomolecule number, which is unneeded and interferes with unite_chains

@jamesmkrieger
Copy link
Contributor Author

Also needs tests

3o21 seems like a good system for this as it has 2 clear biomols (dimers) and is generally not that big

@jamesmkrieger jamesmkrieger requested a review from dkoes October 12, 2023 19:14
@jamesmkrieger
Copy link
Contributor Author

@dkoes, this and #1771 together fix the issue with 2VVJ

@jamesmkrieger
Copy link
Contributor Author

@dkoes, any comments on this or do you think it's ok?

@jamesmkrieger
Copy link
Contributor Author

@dkoes, this and #1771 together fix the issue with 2VVJ

this has been included here

@jamesmkrieger
Copy link
Contributor Author

Actually we need to update segments when there are duplicates. 3h5v is a good example of this with chain C getting copied in a dimer and tetramer

@jamesmkrieger jamesmkrieger marked this pull request as draft October 13, 2023 08:59
Copy link
Collaborator

@dkoes dkoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really am not familiar enough with CIF to understand what is going on or why it fixes 2VVJ.. but if it does, great!

prody/proteins/ciffile.py Show resolved Hide resolved
@jamesmkrieger
Copy link
Contributor Author

Ok. Thanks

@dkoes
Copy link
Collaborator

dkoes commented Oct 14, 2023

Okay, I'm still trying to figure out what is going on with the code. The overall issue is some PDBs have an "auth" chain id of "bb" (specifically, these PDBs: 7M57, 7VRT, 7ZAS, 7ATA, 5EXC, 5BKN, 7M2T, 5BKL, 7VS5, 7M3T, 8H2I, 7ANM,7M50).

Taking 7ZAS as an example (because it is relatively small), if I do

p,h = prody.parsePDB('7ZAS',header=True,biomol=True)

I get:
[<AtomGroup: 7ZAS biomolecule 1 (6488 atoms)>, <AtomGroup: 7ZAS biomolecule 2 (6575 atoms)>]

These are the correct bioassemblys (I wrote them out and compared to what is in PDB).
The header biomoltrans has replaced the auth chain ids with whatever the non-auth ids are called:

{'1': [['A', 'B', 'G', 'H', 'I', 'L', 'M', 'N', 'S', 'T'],
  '1.0000000000 0.0000000000 0.0000000000 0.0000000000',
  '0.0000000000 1.0000000000 0.0000000000 0.0000000000',
  '0.0000000000 0.0000000000 1.0000000000 0.0000000000'],
 '2': [['C', 'D', 'J', 'O', 'P'],
  '1.0000000000 0.0000000000 0.0000000000 0.0000000000',
  '0.0000000000 1.0000000000 0.0000000000 0.0000000000',
  '0.0000000000 0.0000000000 1.0000000000 0.0000000000',
  ['E', 'F', 'K', 'Q', 'R'],
  '-1.0000000000 0.0000000000 0.0000000000 65.6990000000',
  '0.0000000000 1.0000000000 0.0000000000 69.2070000000',
  '0.0000000000 0.0000000000 -1.0000000000 0.0000000000']}

However, the AtomGroups still have the auth chain ids:

set(p[0].getChids()),set(p[1].getChids())

({'A', 'D', 'aa', 'dd'}, {'B', 'C', 'bb', 'cc'})

I have yet to figure out how the bioassemblies were correctly created.

If you build the bioassemblies manually:

p,h = prody.parseMMCIF('7ZAS',header=True)
bm = prody.buildBiomolecules(h,p)

You get something different:
[<AtomGroup: 7ZAS biomolecule 1 (5913 atoms)>, <AtomGroup: 7ZAS biomolecule 2 (5930 atoms)>]
Note the different numbers of atoms. These are not correct (don't match what is in the PDB).

If you rename the bb chain you can also get the correct assemblies:

m,hm = prody.parseMMTF('7ZAS',header=True)
chs = p.getChids()
chs[chs == 'bb'] = '_bb'
p.setChids(chs)
hm['biomoltrans']['2'][0][1] = '_bb'
bm = prody.buildBiomolecules(hm,m)

[<AtomGroup: 7ZAS biomolecule 1 (6488 atoms)>, <AtomGroup: 7ZAS biomolecule 2 (6269 atoms)>]

I don't think there should be a mismatch in the result of calling parseMMCIF with biomol=True and manually building with buildBiomolecules. The fact that the parser doesn't understand that bb is not the backbone selector in the context of a chain selector is the fundamental bug. If that is too difficult to fix, renaming bb chains to _bb wouldn't be a horrible work around.

The best fix is to fix the parser to understand that in the context of a chain selection, bb does not mean backbone. A simpler fix would be to rename the chain to bb_ or something.

@jamesmkrieger
Copy link
Contributor Author

Ok, yes, it should know that bb is a chain id and not backbone. This would be a general bug and not just something affecting biomol assembly.

@dkoes
Copy link
Collaborator

dkoes commented Oct 14, 2023

Here is a fix for the parser. It is very specific to the chain problem. Really should identify all the high precedance operators that take an arbitrary string and avoid converting the string to a flag for all of them:

--- a/prody/atomic/select.py
+++ b/prody/atomic/select.py
@@ -1341,8 +1341,10 @@ class Select(object):
         isDataLabel = atoms.isDataLabel
         append = None
         wasand = False
+        token = ''
         while tokens:
             # check whether token is an array to avoid array == str comparison
+            last_was_chain = token == 'chain'
             token = tokens.pop(0)
             try:
                 dtype = token.dtype
@@ -1356,7 +1358,7 @@ class Select(object):
                     wasand = True
                     continue
 
-                elif isFlagLabel(token):
+                elif isFlagLabel(token) and not last_was_chain:
                     flags.append(token)
                     append = None

@dkoes
Copy link
Collaborator

dkoes commented Oct 14, 2023

I should point out this is just a partial band-aid fix. Something like "chain A bb" will still break. But I don't want to dive into the parser any more and need to stop spending time on this for now.

@jamesmkrieger
Copy link
Contributor Author

Thanks. I understand

I’ll try and look into it. That’s already a very helpful start

@dkoes
Copy link
Collaborator

dkoes commented Oct 14, 2023

Maybe what is needed it to check if firsttoken is chain? It's not clear to me what the precondition for and2 being called is.

@jamesmkrieger
Copy link
Contributor Author

Maybe what is needed it to check if firsttoken is chain? It's not clear to me what the precondition for and2 being called is.

Me either. I’ll have to look very carefully sometime

@jamesmkrieger
Copy link
Contributor Author

We may also want to look at brackets and see what it does with them and whether we can mimic it

@jamesmkrieger
Copy link
Contributor Author

I suppose we’ll also have problems with chains b, x, y and z that will be fixed by noticing that we have arrays of values for a flag like chain

@jamesmkrieger
Copy link
Contributor Author

Being able to build biomol assemblies afterwards is also a good reason for setting unite chains as False by default, unless we update the header.

It always seems like big structures and mmCIF format cause lots of complications

@jamesmkrieger
Copy link
Contributor Author

jamesmkrieger commented Oct 16, 2023

ok, now we have a fix for the parser. If there was a data label beforehand and we haven't had an "and" since then, then it doesn't check if it could be a flag. This should help with chain or segment bb and also ca and sc.

@jamesmkrieger jamesmkrieger marked this pull request as ready for review October 16, 2023 14:23
@jamesmkrieger
Copy link
Contributor Author

I think everything should be good now

@jamesmkrieger
Copy link
Contributor Author

Thanks

@jamesmkrieger jamesmkrieger merged commit a15eaf7 into prody:master Oct 17, 2023
4 checks passed
@jamesmkrieger jamesmkrieger deleted the swap_asym_id branch October 17, 2023 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants