-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about memory use with mzIdentML reader #145
Comments
its a stupid question i think, it's just coz the SpectrumIdentificationList is the biggest thing |
i'm looking for a way to iterate of over SpectrumIdentificationResults but not lose the information about which SpectrumIdentificationList they come from. |
This can be done using XPath: from pyteomics import mzid
reader = mzid.MzIdentML("path/to/file.mzid", use_index=True)
sil_ids = list(reader.index['SpectrumIdentificationList'].keys())
for sil_id in sil_ids:
reader.reset()
for sir in reader.iterfind(f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/*):
# do something If you need to also access the other attributes on the reader.reset()
sil_attributes = list(reader.iterfind("SpectrumIdentificationList", recursive=False)) This gives you a dictionary of the |
Hi, I can't yet get it to work... my code looks like this now:
First, I need to change the XPath selector because not all subelements of SpectrumIdentificationList are SpectrumIdentificationResults. However, when i replace the wildcard asterisk with SpectrumIdentificationResult it doesn't return anything. Which seems odd because i tested it with an online Xpath testing tool and i think it worked. (I'm far from being an expert on XPath.) But the other problem is this - if I keep the wild card so stuff is returned by iterfind(), the memory use is the same as before. Like the iterfind() also loads the whole SIL into memory? The exact changes to my code are here: Rappsilber-Laboratory/xi-mzidentml-converter@1fc8613 I changed the initialisation of the mzId reader to Does anyone have any further advice on either how to fix my Xpath issue or on whether .iterfind() will actually also load whole SIL into memory? Apologies in advance for silly mistakes, @mobiusklein p.s. sorry i missed your talk at PSI meeting. |
i made a separate GH issue for my xpath question - #146 i'll look for a better way to examine whats going on with memory and iterfind() |
@mobiusklein - i confirm it's like you say. iterfind() uses less memory.
but not actually that much less memory, and i think the increase in processing time will introduce more problems than the reduction in memory use will solve. Dunno if this makes sense as a third comparison, but
what would be needed to get performance like |
Yes, there's something wrong with the Here's a helper function that will make things easier: from pyteomics.xml import _local_name
from lxml import etree
def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
"""
Iteratively parse XML stream in ``source``, yielding XML elements
matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
a :class:`bool`.
Parameters
----------
source: file-like
A file-like object over an XML document
target_name: str
The name of the XML tag to parse until
condition_name: str
The name to start parsing at when `stack_predicate` evaluates to true on this element.
stack_predicate: callable
A function called with a single `etree.Element` that determines if the sub-tree should be parsed
**kwargs:
Additional arguments passed to :meth:`source._get_info_smart`
Yields
------
lxml.etree.Element
"""
g = etree.iterparse(source, ("start", "end"))
state = False
for event, tag in g:
lc_name = _local_name(tag)
if event == "start":
if lc_name == condition_name:
state = stack_predicate(tag)
if lc_name == target_name and state:
yield source._get_info_smart(tag, **kwargs)
else:
tag.clear()
else:
tag.clear()
from pyteomics import mzid
reader = mzid.MzIdentML(r"./tests/test.mzid", use_index=True)
for e in iterfind_when(
reader,
"SpectrumIdentificationResult",
"SpectrumIdentificationList",
lambda x: x.attrib["id"] == "SEQUEST_results",
retrieve_refs=False
):
print(e) It has the memory characteristic of RE PS - Thank you, but it was just re-hashing the need to factor the modification database out of the mzIdentML document and that I need to spend more time finding implementers. We did want to find out if someone in your group would be willing to clean up a few things in XLMOD though. |
@mobiusklein - i haven't integrated your function into my code yet, but from looking at it in test cases it appears to perform amazingly well. I'll let you know when i have it working in my larger piece of code. I'll mention following up those two things from PSI meetinging to Juan (XLMOD maintenance has been asked for before, I guess "refactor modification database out of mzIdentML document" means getting rid of/reducing need for the ModificationParam's). |
@mobiusklein - thanks btw, i didn't say that before. I think i see some odd behaviour when i'm using the I think this example code: The example code is sometimes printing out "Has no SpectrumIdentificationItem?" but when i look up the SIR id in the xml file, it seems it does? |
I see the issue. My tests were small, and contained in whatever buffering def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
"""
Iteratively parse XML stream in ``source``, yielding XML elements
matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
a :class:`bool`.
Parameters
----------
source: file-like
A file-like object over an XML document
target_name: str
The name of the XML tag to parse until
condition_name: str
The name to start parsing at when `stack_predicate` evaluates to true on this element.
stack_predicate: callable
A function called with a single `etree.Element` that determines if the sub-tree should be parsed
**kwargs:
Additional arguments passed to :meth:`source._get_info_smart`
Yields
------
lxml.etree.Element
"""
g = etree.iterparse(source, ("start", "end"))
state = False
for event, tag in g:
lc_name = _local_name(tag)
if event == "start":
if lc_name == condition_name:
state = stack_predicate(tag)
if not (lc_name == target_name and state):
tag.clear()
else:
if lc_name == target_name and state:
yield source._get_info_smart(tag, **kwargs)
tag.clear() |
thanks for looking at it again, for me, with the new function, code like:
is printing out:
|
After actually interacting with the test script in the debugger, I found the issue. I can't say I understand why it's behaving this way. Here is a functional version: def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
"""
Iteratively parse XML stream in ``source``, yielding XML elements
matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
a :class:`bool`.
Parameters
----------
source: file-like
A file-like object over an XML document
target_name: str
The name of the XML tag to parse until
condition_name: str
The name to start parsing at when `stack_predicate` evaluates to true on this element.
stack_predicate: callable
A function called with a single `etree.Element` that determines if the sub-tree should be parsed
**kwargs:
Additional arguments passed to :meth:`source._get_info_smart`
Yields
------
lxml.etree.Element
"""
g = etree.iterparse(source, ("start", "end"))
state = False
history = []
for event, tag in g:
lc_name = _local_name(tag)
if event == "start":
if lc_name == condition_name:
state = stack_predicate(tag)
else:
if lc_name == target_name and state:
value = source._get_info_smart(tag, **kwargs)
for t in history:
t.clear()
history.clear()
yield value
elif state:
history.append(tag)
elif not state:
tag.clear() I'll spend more time working out why the state dependency is behaving this way. |
@mobiusklein - yep, the above works. Awesome, thanks. There was a delay getting back to you because the tests on the project were broken and I wanted to see the tests passing before declaring it worked. They passed - it works! Can close this now, unless you're keeping it open whilst you try to work out "why the state dependency is behaving this way" |
Hi Joshua (@mobiusklein) - i think i've found a problem with the above code. there's a test case here: https://github.com/colin-combe/pyteomics-test/blob/master/test_iterfind_when.py you'll see it's reading this file: https://github.com/colin-combe/pyteomics-test/blob/master/multiple_spectra_per_id_1_3_0_draft.mzid which you'll recognise as an example file from mzId 1.3, so the obvious thing to think it that it doesn't work because it's MzIdentML 1.3.0. But I've changed it (schema reference and delete the cv param for extension document) so I think it should be valid 1.2.0. I don't think this is the problem, but could be mistaken. It crashes with error:
any ideas? |
This is because g = etree.iterparse(source, ("start", "end")) with g = etree.iterparse(source, ("start", "end"), remove_comments=True) and it should work. |
That does indeed fix it. |
Hi,
I'm using pyteomics' great mzIdentML parser to read crosslink data out of mzIdentML.
There's one particular point in the code where memory use suddenly climbs:-
https://github.com/PRIDE-Archive/xi-mzidentml-converter/blob/python3/parser/MzIdParser.py#L584
its when it executes the following (actually the second line):
why is it here the memory use climbs and is there a better way to iterate over SpectrumIdentificationLists?
Just wondering,
thanks,
Colin
The text was updated successfully, but these errors were encountered: