Memory leak when using docx.Document to parse large word file #1428

joyme123 · 2024-09-09T08:41:44Z

similar issue: #1364

reproduce code:

import gc
import os

import psutil
from docx import Document
from memory_profiler import profile


@profile
def main():
    file = "test.docx"
    document = Document(file)

    del document
    gc.collect()

    print(
        "current process memory: %.4f GB"
        % (psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024 / 1024),
    )


if __name__ == "__main__":
    main()

reproduce file:
新建 DOCX 文档.docx

output:

❯ python test.py
current process memory: 2.6228 GB
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9     28.3 MiB     28.3 MiB           1   @profile
    10                                         def main():
    11     28.3 MiB      0.0 MiB           1       file = "test.docx"
    12   2685.7 MiB   2657.4 MiB           1       document = Document(file)
    13
    14   2685.7 MiB      0.0 MiB           1       del document
    15   2685.7 MiB      0.0 MiB           1       gc.collect()
    16
    17   2685.7 MiB      0.0 MiB           2       print(
    18   2685.7 MiB      0.0 MiB           2           "current process memory: %.4f GB"
    19   2685.7 MiB      0.0 MiB           1           % (psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024 / 1024),
    20                                             )

The text was updated successfully, but these errors were encountered:

joyme123 · 2024-09-12T03:24:35Z

memory leak happens in lxml etree.fromstring()。use xml.etree.ElementTree is ok.

import gc
import os
import zipfile

import psutil
from docx import Document
from memory_profiler import profile

@profile
def xml_test():
    import xml.etree.ElementTree as ET

    docx = "test.docx"
    zipf = zipfile.ZipFile(docx)
    doc_xml = "word/document.xml"
    xml = zipf.read(doc_xml)
    root = ET.fromstring(xml)

    del root
    gc.collect()


@profile
def lxml_test():
    from lxml import etree

    docx = "test.docx"
    zipf = zipfile.ZipFile(docx)
    doc_xml = "word/document.xml"
    xml = zipf.read(doc_xml)
    root = etree.fromstring(xml)

    del root
    gc.collect()


if __name__ == "__main__":
    lxml_test()
    # xml_test()

lxml test case:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    48     39.1 MiB     39.1 MiB           1   @profile
    49                                         def lxml_test():
    50     39.1 MiB      0.0 MiB           1       from lxml import etree
    51
    52     39.1 MiB      0.0 MiB           1       docx = "test.docx"
    53     39.1 MiB      0.0 MiB           1       zipf = zipfile.ZipFile(docx)
    54     39.1 MiB      0.0 MiB           1       doc_xml = "word/document.xml"
    55    213.6 MiB    174.5 MiB           1       xml = zipf.read(doc_xml)
    56   2869.5 MiB   2655.9 MiB           1       root = etree.fromstring(xml)
    57
    58   2427.2 MiB   -442.3 MiB           1       del root
    59   2427.2 MiB      0.0 MiB           1       gc.collect()

xml test case:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    34     39.0 MiB     39.0 MiB           1   @profile
    35                                         def xml_test():
    36     39.0 MiB      0.0 MiB           1       import xml.etree.ElementTree as ET
    37
    38     39.0 MiB      0.0 MiB           1       docx = "test.docx"
    39     39.0 MiB      0.0 MiB           1       zipf = zipfile.ZipFile(docx)
    40     39.0 MiB      0.0 MiB           1       doc_xml = "word/document.xml"
    41    213.5 MiB    174.5 MiB           1       xml = zipf.read(doc_xml)
    42   2626.4 MiB   2413.0 MiB           1       root = ET.fromstring(xml)
    43
    44    232.8 MiB  -2393.7 MiB           1       del root
    45    221.1 MiB    -11.7 MiB           1       gc.collect()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak when using docx.Document to parse large word file #1428

Memory leak when using docx.Document to parse large word file #1428

joyme123 commented Sep 9, 2024

joyme123 commented Sep 12, 2024

Memory leak when using docx.Document to parse large word file #1428

Memory leak when using docx.Document to parse large word file #1428

Comments

joyme123 commented Sep 9, 2024

joyme123 commented Sep 12, 2024