Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using object pools for internal state keeping #8901

Open
fjetter opened this issue Oct 21, 2024 · 3 comments
Open

Consider using object pools for internal state keeping #8901

fjetter opened this issue Oct 21, 2024 · 3 comments
Labels
discussion Discussing a topic with no specific actions yet

Comments

@fjetter
Copy link
Member

fjetter commented Oct 21, 2024

Dask is creating many small objects and not just for very large graphs but this is pretty much a built in thing. This is one of not the most prominent reason why dask was originally built as a "tuple of tuple of tuple ..." machinery. Even the scheduler internally only adopted the usage of custom classes only a couple of years back because instantiation can be very costly. However, modern python versions have gotten much better at managing overhead so this is negligible for most classes.

The one type of objects we're still affected by, both in terms of memory but also in terms of runtime are sets. Yes, sets! (I might share profiles but this is mostly an issue to preseve an idea). One way to work around instantiation cost of sets is to use an object pool design. Effectively, we'd "disable" garbage collection and would resurrect objects on finalization in a way that would allow us to reuse them.

A minimal version of this would look like that

import sys

class SetPool:
    def __init__(self):
        self._sets = []

    def add(self, obj):
        obj.clear()
        self._sets.append(obj)

    def get(self):
        try:
            new = self._sets.pop()
            return new
        except:
            return None

    def stored_size(self):
        return sum(map(sys.getsizeof, self._sets))
    
globalpool = SetPool()

class PooledSet(set):

    def __new__(cls, *iterables):
        obj = globalpool.get()
        if obj is not None:
            return obj
        return super().__new__(cls, *iterables)
    
    def __del__(self):
        print(f"Resurrecting {id(self)}")
        globalpool.add(self)

image

This could be expanded on need, e.g. by hinting towards whether this would be an empty set, a very large set or a somewhat normal one.

Haven't tried out what the actual impact would be but it is a fun concept that could help if we actually want/need to optimize for this

@fjetter fjetter added the discussion Discussing a topic with no specific actions yet label Oct 21, 2024
@mrocklin
Copy link
Member

I'm surprised that set construction is a big deal. (although more like 1-sigma surprised rather than 3-sigma surprised). One thought: it might be worth asking a CPython person about this? Someone like Antoine? (intentionally not pinging his github in case this is a silly idea)

@fjetter
Copy link
Member Author

fjetter commented Oct 21, 2024

Yeah, it's funny and was surprised when I initially saw this (months ago). dicts, for instance, are way better optimized since they are used internally in CPython all over the place but sets are a bit meh...

For example, size of empty sets vs empty dicts

>>> import sys
>>> sys.getsizeof(set())
216
>>> sys.getsizeof(dict())
64

@fjetter
Copy link
Member Author

fjetter commented Oct 21, 2024

I wouldn't be surprised if we could make this go faster if one tuned the python allocator but I'm not sure if I want to go down that hole.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet
Projects
None yet
Development

No branches or pull requests

2 participants