-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] improved/corrected pdb_selaltloc
tool
#156
base: master
Are you sure you want to change the base?
Conversation
TODO: confirm tests
Actions will work after #157 |
altloc_lines = dict() | ||
res_per_loc = dict() | ||
# flush lines because we enter a new chain | ||
if chain != prev_chain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really avoid storing all residues in a chain, especially since this operation is residue based. You can trigger a flush when you see a new residue. That saves a lot of memory, not that it matters much because chains are tiny, but keeps with the idea of generators/iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding on this a little, I wonder if you can simplify the overall logic by doing the following. All ATOM/HETATM/ANISOU records have the same identifiers (chain, resnum, inscode). You can record these are you iterate since they must be in order. If they're not, it's really not our problem.
Once you hit a new residue, look back at the lines you accumulated for the residue. Track altlocs, if any in any of those lines. If there are any altlocs we want to keep/discard, do it then while flushing and ignoring those lines. This requires 3x iterations for a residue block if there are altlocs - once to store, once to process, once to flush - and you could cut this down to two by doing the processing as you store (basically keep a dict of {pdbname: {altloc: occupancy}}).
I think this might be simpler than what you have and has the added benefit that you also filter ANISOU lines that belong to discarded altlocs. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really avoid storing all residues in a chain, especially since this operation is residue based. You can trigger a flush when you see a new residue. That saves a lot of memory, not that it matters much because chains are tiny, but keeps with the idea of generators/iterators.
While this is formally true, it is also true that there are some cases where altlocs are organized by altloc instead of residues. See test example. I know this is not the official format but if we added it to tests it is because we have seen it at some point.
Expanding on this a little (...)
You are right; we may optimize the code further, but I don't know if it can be simplified. In this PR, I aimed to reach the most robust algorithm, disregarding fine-tuned performance. I am positive there won't be memory issues because lines will be flushed on every chain
, model
or terminal line. Because pdb-tools
is for PDBs and not for CIFs, there won't be more than 100k lines stored simultaneously in the dictionary; that's about 10MB of memory.
Considering the pressing nature of this issue, I suggest we merge this PR with [FEATURE]
so all bugs are addressed, and then we work calmly on performance issues. I quickly measured the performance on a large PDB:
· time 'pdb_selaltloc 7bqx.pdb'
pdb_selaltloc 7bqx.pdb: command not found
real 0m0,110s
user 0m0,090s
sys 0m0,025s
What do you think? 😉 🎾
hey there, I don't know whether this PR is closed, but I found a bug in the new implementation of the code. Take pdb file 2qxf: if you run the new selaltloc you'll get duplicate rows for all the residues that have 3 alternative locations, two of which with equal probability weight. An example is MET A 235:
becomes
I think that the problem lies here https://github.com/joaomcteixeira/pdb-tools/blob/0981c5a174e79490068bec327dc337cc3cb492d6/pdbtools/pdb_selaltloc.py#L287-L292 hope it helps! |
Improves
pdb_selaltloc
tool:I think we should bump
minor
because of the importance of this update.Thanks to everyone participating in the discussions 👏
I cannot add you as a reviewer @mgiulini; consider yourselves a reviewer 😉