improve cleaner memory usage #6

lopuhin · 2024-08-18T11:41:24Z

don't hold removed elements in memory to reduce memory usage, as they are being merged into parent elements and element text is growing. On some pathalogical HTMLs (a few MB in size) this leads to hundreds of MB of a difference.

see pypa/twine#1125 (comment)

lopuhin · 2024-08-23T14:20:23Z

Thanks @Gallaecio , just before closing, let me see if I can add a standalone repro which demonstrates the improvement.

lopuhin · 2024-08-23T15:08:27Z

I was able to create a stand-alone repro for a few hundred MB difference but it needs a custom HTML5 parser and a custom HTML file. As for a completely stand-alone repro, the difference is much smaller, here it is:

$ cat ../clear-html/memory_usage.py 
import resource

from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html


def main():
    start_memory_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    print('memory usage at the start', start_memory_usage)
    for _ in range(3):
        depth = 250
        html = "<article>" + "\n".join(
            (" " * i) + "<div>" + (f"{i} some very long text" * 10000)
            for i in range(depth)
        ) + "</div>" * depth + "\n</article>"
        node = fromstring(html)
        cleaned_node = clean_node(node)
        cleaned_html = cleaned_node_to_html(cleaned_node)
    print('original html len:', len(html), 'cleaned html len:', len(cleaned_html))
    print('html', html[:100])
    print('cleaned html', cleaned_html[:100])
    end_memory_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    print('memory usage at the end', end_memory_usage)
    print('memory usage growth', end_memory_usage - start_memory_usage)


if __name__ == '__main__':
    main()

Running on main:

$ python ../clear-html/memory_usage.py 
memory usage at the start 20568
original html len: 56434144 cleaned html len: 56402271
html <article><div>0 some very long text0 some very long text0 some very long text0 some very long text0 
cleaned html <article>

<p>0 some very long text0 some very long text0 some very long text0 some very long text0 
memory usage at the end 501676
memory usage growth 481108

Running on this branch:

$ python ../clear-html/memory_usage.py 
memory usage at the start 20584
original html len: 56434144 cleaned html len: 56402271
html <article><div>0 some very long text0 some very long text0 some very long text0 some very long text0 
cleaned html <article>

<p>0 some very long text0 some very long text0 some very long text0 some very long text0 
memory usage at the end 459280
memory usage growth 438696

So the difference here is only around 42 MB, I guess to reproduce a larger difference a more elaborate HTML structure is required.

lopuhin added 2 commits August 18, 2024 11:38

improve cleaner memory usage

8a0a1e3

don't hold removed elements in memory to reduce memory usage, as they are being merged into parent elements and element text is growing. On some pathalogical HTMLs (a few MB in size) this leads to hundreds of MB of a difference.

update twine and build to fix licence key error

65d77a8

see pypa/twine#1125 (comment)

Gallaecio approved these changes Aug 23, 2024

View reviewed changes

lopuhin merged commit 8b97613 into main Aug 23, 2024
16 checks passed

lopuhin deleted the less-memory-pressure branch August 23, 2024 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve cleaner memory usage #6

improve cleaner memory usage #6

lopuhin commented Aug 18, 2024

lopuhin commented Aug 23, 2024

lopuhin commented Aug 23, 2024

improve cleaner memory usage #6

improve cleaner memory usage #6

Conversation

lopuhin commented Aug 18, 2024

lopuhin commented Aug 23, 2024

lopuhin commented Aug 23, 2024