Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve cleaner memory usage #6

Merged
merged 2 commits into from
Aug 23, 2024
Merged

improve cleaner memory usage #6

merged 2 commits into from
Aug 23, 2024

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented Aug 18, 2024

don't hold removed elements in memory to reduce memory usage, as they are being merged into parent elements and element text is growing. On some pathalogical HTMLs (a few MB in size) this leads to hundreds of MB of a difference.

don't hold removed elements in memory to reduce memory usage,
as they are being merged into parent elements and element text is
growing. On some pathalogical HTMLs (a few MB in size) this leads
to hundreds of MB of a difference.
@lopuhin
Copy link
Contributor Author

lopuhin commented Aug 23, 2024

Thanks @Gallaecio , just before closing, let me see if I can add a standalone repro which demonstrates the improvement.

@lopuhin
Copy link
Contributor Author

lopuhin commented Aug 23, 2024

I was able to create a stand-alone repro for a few hundred MB difference but it needs a custom HTML5 parser and a custom HTML file. As for a completely stand-alone repro, the difference is much smaller, here it is:

$ cat ../clear-html/memory_usage.py 
import resource

from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html


def main():
    start_memory_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    print('memory usage at the start', start_memory_usage)
    for _ in range(3):
        depth = 250
        html = "<article>" + "\n".join(
            (" " * i) + "<div>" + (f"{i} some very long text" * 10000)
            for i in range(depth)
        ) + "</div>" * depth + "\n</article>"
        node = fromstring(html)
        cleaned_node = clean_node(node)
        cleaned_html = cleaned_node_to_html(cleaned_node)
    print('original html len:', len(html), 'cleaned html len:', len(cleaned_html))
    print('html', html[:100])
    print('cleaned html', cleaned_html[:100])
    end_memory_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    print('memory usage at the end', end_memory_usage)
    print('memory usage growth', end_memory_usage - start_memory_usage)


if __name__ == '__main__':
    main()

Running on main:

$ python ../clear-html/memory_usage.py 
memory usage at the start 20568
original html len: 56434144 cleaned html len: 56402271
html <article><div>0 some very long text0 some very long text0 some very long text0 some very long text0 
cleaned html <article>

<p>0 some very long text0 some very long text0 some very long text0 some very long text0 
memory usage at the end 501676
memory usage growth 481108

Running on this branch:

$ python ../clear-html/memory_usage.py 
memory usage at the start 20584
original html len: 56434144 cleaned html len: 56402271
html <article><div>0 some very long text0 some very long text0 some very long text0 some very long text0 
cleaned html <article>

<p>0 some very long text0 some very long text0 some very long text0 some very long text0 
memory usage at the end 459280
memory usage growth 438696

So the difference here is only around 42 MB, I guess to reproduce a larger difference a more elaborate HTML structure is required.

@lopuhin lopuhin merged commit 8b97613 into main Aug 23, 2024
16 checks passed
@lopuhin lopuhin deleted the less-memory-pressure branch August 23, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants