Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about memory use with mzIdentML reader #145

Open
colin-combe opened this issue Apr 5, 2024 · 16 comments
Open

question about memory use with mzIdentML reader #145

colin-combe opened this issue Apr 5, 2024 · 16 comments

Comments

@colin-combe
Copy link

Hi,
I'm using pyteomics' great mzIdentML parser to read crosslink data out of mzIdentML.
There's one particular point in the code where memory use suddenly climbs:-
https://github.com/PRIDE-Archive/xi-mzidentml-converter/blob/python3/parser/MzIdParser.py#L584
its when it executes the following (actually the second line):

        for sil_id in self.mzid_reader._offset_index["SpectrumIdentificationList"].keys():
            sil = self.mzid_reader.get_by_id(sil_id, tag_id='SpectrumIdentificationList')

why is it here the memory use climbs and is there a better way to iterate over SpectrumIdentificationLists?

Just wondering,
thanks,
Colin

@colin-combe
Copy link
Author

its a stupid question i think, it's just coz the SpectrumIdentificationList is the biggest thing

@colin-combe
Copy link
Author

i'm looking for a way to iterate of over SpectrumIdentificationResults but not lose the information about which SpectrumIdentificationList they come from.
I can't find a way to do this so I'm getting each SpectrumIdentificationList and iterating through the SpectrumIdentifcationResults within it. Then i know the SIL they come from.
But this means loading the whole SIL into memory and it becomes problematic for large files.
Is there a better way? A way to iterate the SIR for one SIL without loading the whole SIL?

@levitsky levitsky reopened this Apr 6, 2024
@mobiusklein
Copy link
Contributor

This can be done using XPath:

from pyteomics import mzid

reader = mzid.MzIdentML("path/to/file.mzid", use_index=True)

sil_ids = list(reader.index['SpectrumIdentificationList'].keys())

for sil_id in sil_ids:
    reader.reset()
    for  sir in reader.iterfind(f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/*):
        # do something

If you need to also access the other attributes on the SpectrumIdentificationList:

reader.reset()
sil_attributes = list(reader.iterfind("SpectrumIdentificationList", recursive=False))

This gives you a dictionary of the id attribute as well as whatever else the writer included.

@colin-combe
Copy link
Author

Hi,
thanks for the reply Joshua, this would be great!

I can't yet get it to work...

my code looks like this now:

        sil_ids = list(self.mzid_reader.index['SpectrumIdentificationList'].keys())
        # iterate over all the spectrum identification lists
        for sil_id in sil_ids:
            self.mzid_reader.reset()
            for sid_result in self.mzid_reader.iterfind(f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/SpectrumIdentificationResult"):

First, I need to change the XPath selector because not all subelements of SpectrumIdentificationList are SpectrumIdentificationResults. However, when i replace the wildcard asterisk with SpectrumIdentificationResult it doesn't return anything. Which seems odd because i tested it with an online Xpath testing tool and i think it worked. (I'm far from being an expert on XPath.)

But the other problem is this - if I keep the wild card so stuff is returned by iterfind(), the memory use is the same as before. Like the iterfind() also loads the whole SIL into memory?

The exact changes to my code are here: Rappsilber-Laboratory/xi-mzidentml-converter@1fc8613

I changed the initialisation of the mzId reader to mzid.MzIdentML(self.mzid_path, use_index=True) so it was just the same as yours and commented out some function calls that only work with retreiveRefs = False (which is a bit mysterious to me).

Does anyone have any further advice on either how to fix my Xpath issue or on whether .iterfind() will actually also load whole SIL into memory?

Apologies in advance for silly mistakes,
best wishes,
Colin

@mobiusklein p.s. sorry i missed your talk at PSI meeting.

@colin-combe
Copy link
Author

i made a separate GH issue for my xpath question - #146

i'll look for a better way to examine whats going on with memory and iterfind()

@colin-combe
Copy link
Author

@mobiusklein - i confirm it's like you say. iterfind() uses less memory.

Test 1 - iterfind()

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9     78.7 MiB     78.7 MiB           1   @profile
    10                                         def test1():
    11   1732.7 MiB      0.0 MiB           2       for sil_id in sil_ids:
    12     78.7 MiB      0.0 MiB           1           print("Processing SpectrumIdentificationList", sil_id)
    13     78.7 MiB      0.0 MiB           1           reader.reset()
    14     78.7 MiB      0.0 MiB           1           xpath = f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/*"
    15     78.7 MiB      0.0 MiB           1           print("XPATH: ", xpath)
    16   1732.7 MiB   1654.0 MiB        3032           for sir in reader.iterfind(xpath):
    17                                                     # do something
    18                                                     # print(sir)
    19   1731.3 MiB      0.0 MiB        3031               pass
    20   1732.7 MiB      0.0 MiB           1       print("Test 1 done!")

Time taken: 1613.2808635234833

Test 2 - get_by_id()
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9     80.1 MiB     80.1 MiB           1   @profile
    10                                         def test2():
    11   2098.2 MiB      0.0 MiB           2       for sil_id in sil_ids:
    12     80.1 MiB      0.0 MiB           1           print("Processing SpectrumIdentificationList", sil_id)
    13   2098.2 MiB   2018.0 MiB           1           sil = reader.get_by_id(sil_id, tag_id='SpectrumIdentificationList')
    14   2098.2 MiB      0.0 MiB        3031           for sid_result in sil['SpectrumIdentificationResult']:
    15                                                     # do something
    16                                                     # print(sir)
    17   2098.2 MiB      0.0 MiB        3030               pass
    18   2098.2 MiB      0.0 MiB           1       print("Test 2 done!")

Time taken: 210.73944902420044

but not actually that much less memory, and i think the increase in processing time will introduce more problems than the reduction in memory use will solve.

Dunno if this makes sense as a third comparison, but in reader performs best:

Test 3 - sid in reader

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9     79.0 MiB     79.0 MiB           1   @profile
    10                                         def test3():
    11                                             # for sil_id in sil_ids:
    12                                             #     print("Processing SpectrumIdentificationList", sil_id)
    13                                             #     sil = reader.get_by_id(sil_id, tag_id='SpectrumIdentificationList')
    14     90.9 MiB     11.9 MiB        3031       for sid_result in reader:
    15                                                 # do something
    16                                                 # print(sir)
    17     90.9 MiB      0.0 MiB        3030           pass
    18     90.9 MiB      0.0 MiB           1       print("Test 3 done!")

Time taken: 166.0890302658081

what would be needed to get performance like in reader but without losing the information about which SIL the SIR came from?

@mobiusklein
Copy link
Contributor

Yes, there's something wrong with the xpath matching, and because of how lxml implements xpath, it can't be used in a memory-efficient manner (https://lxml.de/xpathxslt.html).

Here's a helper function that will make things easier:

from pyteomics.xml import _local_name
from lxml import etree

def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
    """
    Iteratively parse XML stream in ``source``, yielding XML elements
    matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
    satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
    a :class:`bool`.

    Parameters
    ----------
    source: file-like
        A file-like object over an XML document
    target_name: str
        The name of the XML tag to parse until
    condition_name: str
        The name to start parsing at when `stack_predicate` evaluates to true on this element.
    stack_predicate: callable
        A function called with a single `etree.Element` that determines if the sub-tree should be parsed
    **kwargs:
        Additional arguments passed to :meth:`source._get_info_smart`

    Yields
    ------
    lxml.etree.Element
    """
    g = etree.iterparse(source, ("start", "end"))
    state = False
    for event, tag in g:
        lc_name = _local_name(tag)
        if event == "start":
            if lc_name == condition_name:
                state = stack_predicate(tag)
            if lc_name == target_name and state:
                yield source._get_info_smart(tag, **kwargs)
            else:
                tag.clear()
        else:
            tag.clear()

from pyteomics import mzid

reader = mzid.MzIdentML(r"./tests/test.mzid", use_index=True)
for e in iterfind_when(
    reader,
    "SpectrumIdentificationResult",
    "SpectrumIdentificationList",
    lambda x: x.attrib["id"] == "SEQUEST_results",
    retrieve_refs=False
):
    print(e)

It has the memory characteristic of iterfind, but lets you selectively filter on a single parent element similar to what you're looking for. More complex filtering could be done, but it's not clear if we need to go that far.

RE PS - Thank you, but it was just re-hashing the need to factor the modification database out of the mzIdentML document and that I need to spend more time finding implementers. We did want to find out if someone in your group would be willing to clean up a few things in XLMOD though.

@colin-combe
Copy link
Author

@mobiusklein - i haven't integrated your function into my code yet, but from looking at it in test cases it appears to perform amazingly well. I'll let you know when i have it working in my larger piece of code.

I'll mention following up those two things from PSI meetinging to Juan (XLMOD maintenance has been asked for before, I guess "refactor modification database out of mzIdentML document" means getting rid of/reducing need for the ModificationParam's).

@colin-combe
Copy link
Author

@mobiusklein - thanks btw, i didn't say that before.

I think i see some odd behaviour when i'm using the iterfind_when function.
Occasionally the SpectrumIdentificationResults are missing their SpectrumIdentificationItems?

I think this example code:
https://github.com/colin-combe/pyteomics-test/blob/master/test_iterfind_when.py
will illustrate it if pointed at this file:
https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/10/PXD021417/XLpeplib_Beveridge_QEx-HFX_DSS_R3.mzid
i can provide a shorter example file if you want (but its not short enough to upload to GH or put in GH project).

The example code is sometimes printing out "Has no SpectrumIdentificationItem?" but when i look up the SIR id in the xml file, it seems it does?

@mobiusklein
Copy link
Contributor

I see the issue. My tests were small, and contained in whatever buffering lxml was doing. The following version of iterfind_when covers 100% of cases in the example file you shared:

def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
    """
    Iteratively parse XML stream in ``source``, yielding XML elements
    matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
    satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
    a :class:`bool`.

    Parameters
    ----------
    source: file-like
        A file-like object over an XML document
    target_name: str
        The name of the XML tag to parse until
    condition_name: str
        The name to start parsing at when `stack_predicate` evaluates to true on this element.
    stack_predicate: callable
        A function called with a single `etree.Element` that determines if the sub-tree should be parsed
    **kwargs:
        Additional arguments passed to :meth:`source._get_info_smart`

    Yields
    ------
    lxml.etree.Element
    """
    g = etree.iterparse(source, ("start", "end"))
    state = False
    for event, tag in g:
        lc_name = _local_name(tag)
        if event == "start":
            if lc_name == condition_name:
                state = stack_predicate(tag)
            if not (lc_name == target_name and state):
                tag.clear()
        else:
            if lc_name == target_name and state:
                yield source._get_info_smart(tag, **kwargs)
            tag.clear()

@colin-combe
Copy link
Author

thanks for looking at it again,
the SpectrumIdentificationItems are all empty objects now?

for me, with the new function, code like:

        for e in iterfind_when(
            reader,
            "SpectrumIdentificationResult",
            "SpectrumIdentificationList",
            lambda x: x.attrib["id"] == sil_id,
            retrieve_refs=False
        ):
            try:
               e['SpectrumIdentificationItem']
               print(f"Has SpectrumIdentificationItem?: {e['SpectrumIdentificationItem']}")
            except KeyError:
                print (f"Has no SpectrumIdentificationItem?: {e}")

is printing out:

Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}]
etc.

@mobiusklein
Copy link
Contributor

After actually interacting with the test script in the debugger, I found the issue. I can't say I understand why it's behaving this way. Here is a functional version:

def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
    """
    Iteratively parse XML stream in ``source``, yielding XML elements
    matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
    satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
    a :class:`bool`.

    Parameters
    ----------
    source: file-like
        A file-like object over an XML document
    target_name: str
        The name of the XML tag to parse until
    condition_name: str
        The name to start parsing at when `stack_predicate` evaluates to true on this element.
    stack_predicate: callable
        A function called with a single `etree.Element` that determines if the sub-tree should be parsed
    **kwargs:
        Additional arguments passed to :meth:`source._get_info_smart`

    Yields
    ------
    lxml.etree.Element
    """
    g = etree.iterparse(source, ("start", "end"))
    state = False
    history = []
    for event, tag in g:
        lc_name = _local_name(tag)
        if event == "start":
            if lc_name == condition_name:
                state = stack_predicate(tag)
        else:
            if lc_name == target_name and state:
                value = source._get_info_smart(tag, **kwargs)
                for t in history:
                    t.clear()
                history.clear()
                yield value
            elif state:
                history.append(tag)
            elif not state:
                tag.clear()

I'll spend more time working out why the state dependency is behaving this way.

@colin-combe
Copy link
Author

@mobiusklein - yep, the above works. Awesome, thanks.

There was a delay getting back to you because the tests on the project were broken and I wanted to see the tests passing before declaring it worked. They passed - it works!

Can close this now, unless you're keeping it open whilst you try to work out "why the state dependency is behaving this way"

@colin-combe
Copy link
Author

Hi Joshua (@mobiusklein) - i think i've found a problem with the above code.

there's a test case here: https://github.com/colin-combe/pyteomics-test/blob/master/test_iterfind_when.py

you'll see it's reading this file: https://github.com/colin-combe/pyteomics-test/blob/master/multiple_spectra_per_id_1_3_0_draft.mzid

which you'll recognise as an example file from mzId 1.3, so the obvious thing to think it that it doesn't work because it's MzIdentML 1.3.0. But I've changed it (schema reference and delete the cv param for extension document) so I think it should be valid 1.2.0. I don't think this is the problem, but could be mistaken.

It crashes with error:

Processing SpectrumIdentificationList sil_HCD
Traceback (most recent call last):
  File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 80, in <module>
    test3()
  File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 61, in test3
    for e in iterfind_when(
  File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 41, in iterfind_when
    value = source._get_info_smart(tag, **kwargs)
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/mzid.py", line 158, in _get_info_smart
    return self._get_info(element,
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 435, in _get_info
    self._get_info_smart(child, ename=cname, **kwargs))
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/mzid.py", line 158, in _get_info_smart
    return self._get_info(element,
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 424, in _get_info
    cname = _local_name(child)
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 53, in _local_name
    if tag and tag[0] == '{':
TypeError: '_cython_3_0_10.cython_function_or_method' object is not subscriptable

any ideas?
cheers,
Colin

@mobiusklein
Copy link
Contributor

This is because lxml treats comments as first class elements of the document, but they don't have the same interface as etree.Element. The fix is easy, fortunately. In iterfind_when replace

g = etree.iterparse(source, ("start", "end"))

with

g = etree.iterparse(source, ("start", "end"), remove_comments=True)

and it should work.

@colin-combe
Copy link
Author

That does indeed fix it.
Thanks,
Colin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants