Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when using docx.Document to parse large word file #1428

Open
joyme123 opened this issue Sep 9, 2024 · 1 comment
Open

Memory leak when using docx.Document to parse large word file #1428

joyme123 opened this issue Sep 9, 2024 · 1 comment

Comments

@joyme123
Copy link

joyme123 commented Sep 9, 2024

similar issue: #1364

reproduce code:

import gc
import os

import psutil
from docx import Document
from memory_profiler import profile


@profile
def main():
    file = "test.docx"
    document = Document(file)

    del document
    gc.collect()

    print(
        "current process memory: %.4f GB"
        % (psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024 / 1024),
    )


if __name__ == "__main__":
    main()

reproduce file:
新建 DOCX 文档.docx

output:

❯ python test.py
current process memory: 2.6228 GB
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9     28.3 MiB     28.3 MiB           1   @profile
    10                                         def main():
    11     28.3 MiB      0.0 MiB           1       file = "test.docx"
    12   2685.7 MiB   2657.4 MiB           1       document = Document(file)
    13
    14   2685.7 MiB      0.0 MiB           1       del document
    15   2685.7 MiB      0.0 MiB           1       gc.collect()
    16
    17   2685.7 MiB      0.0 MiB           2       print(
    18   2685.7 MiB      0.0 MiB           2           "current process memory: %.4f GB"
    19   2685.7 MiB      0.0 MiB           1           % (psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024 / 1024),
    20                                             )

@joyme123
Copy link
Author

memory leak happens in lxml etree.fromstring()。use xml.etree.ElementTree is ok.

import gc
import os
import zipfile

import psutil
from docx import Document
from memory_profiler import profile

@profile
def xml_test():
    import xml.etree.ElementTree as ET

    docx = "test.docx"
    zipf = zipfile.ZipFile(docx)
    doc_xml = "word/document.xml"
    xml = zipf.read(doc_xml)
    root = ET.fromstring(xml)

    del root
    gc.collect()


@profile
def lxml_test():
    from lxml import etree

    docx = "test.docx"
    zipf = zipfile.ZipFile(docx)
    doc_xml = "word/document.xml"
    xml = zipf.read(doc_xml)
    root = etree.fromstring(xml)

    del root
    gc.collect()


if __name__ == "__main__":
    lxml_test()
    # xml_test()

lxml test case:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    48     39.1 MiB     39.1 MiB           1   @profile
    49                                         def lxml_test():
    50     39.1 MiB      0.0 MiB           1       from lxml import etree
    51
    52     39.1 MiB      0.0 MiB           1       docx = "test.docx"
    53     39.1 MiB      0.0 MiB           1       zipf = zipfile.ZipFile(docx)
    54     39.1 MiB      0.0 MiB           1       doc_xml = "word/document.xml"
    55    213.6 MiB    174.5 MiB           1       xml = zipf.read(doc_xml)
    56   2869.5 MiB   2655.9 MiB           1       root = etree.fromstring(xml)
    57
    58   2427.2 MiB   -442.3 MiB           1       del root
    59   2427.2 MiB      0.0 MiB           1       gc.collect()

xml test case:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    34     39.0 MiB     39.0 MiB           1   @profile
    35                                         def xml_test():
    36     39.0 MiB      0.0 MiB           1       import xml.etree.ElementTree as ET
    37
    38     39.0 MiB      0.0 MiB           1       docx = "test.docx"
    39     39.0 MiB      0.0 MiB           1       zipf = zipfile.ZipFile(docx)
    40     39.0 MiB      0.0 MiB           1       doc_xml = "word/document.xml"
    41    213.5 MiB    174.5 MiB           1       xml = zipf.read(doc_xml)
    42   2626.4 MiB   2413.0 MiB           1       root = ET.fromstring(xml)
    43
    44    232.8 MiB  -2393.7 MiB           1       del root
    45    221.1 MiB    -11.7 MiB           1       gc.collect()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant