Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm event cancellation increases memory usage indefinitely for long running systems #1812

Open
serkanturgut opened this issue Dec 23, 2024 · 4 comments
Labels
Realm Issues pertaining to Realm

Comments

@serkanturgut
Copy link

Hi,

We have been investigating some memory leaks in our system that uses Realm runtime events and user-space threads heavily. As part of the computation logic, we have to issue cancellation on certain events. We found out that due to the design choices how event objects are created and recycled, every event has a limit of 16 poisoned generation limit, beyond that, the event object can't be recycled anymore.

Here is the simple program that reproduces the behavior we observed;

#include <iostream>
#include <realm/runtime.h>

using namespace Realm;

int main() {

    Runtime rt;
    int c = 0;
    rt.init(&c, nullptr);
    printf("Realm initialized!\n");

    for (int i = 0; i < 100000000; i++) {
        UserEvent e = UserEvent::create_user_event();
        e.cancel();
        printf("Event create/cancel iteration [%i]\n", i);
    }

}

And I am attaching the malloc profile report (jemalloc) as a demonstration of unbounded memory usage.


realm_eventcancellation_jemalloc

Please only focus on DynamicTable expansion code paths and the event cancellation in the picture. Event cancellation triggers the allocations of the poisoned_generations array to record the poisoned generations and DynamicTable expansion basically due to the fact that after 16 poisons per event, it does not get free'd anymore so new event creation eventually allocated from newly create table leafs.

I understand that it is a design choice that events have to keep a record of the poisoned generations to be able to answer to the inquiries whether a particular event generation was poisoned or not. If this is the case, I feel like that Realm runtime would be unsuitable for systems that rarely goes offline such as database systems.

In our integration with Realm, the events actually have a lifespan and we do not actually care about poisoned generations after the event life ended (by our system logic). So potentially we can keep track of cancelled events in our system and instruct Realm to force-free them to be recycled. However, to my understanding, Realm currently does not have a mechanism to "reset" an event object (free the poisoned_generations array) and reset poisoned generation number, once max_poisons = true; object remains in this state forever (please confirm this).

Would it be an acceptable approach to introduce "reset" API to event class to achieve such objective so the application can decide to end a lifespan of a particular event object?

I may be mistaken with my findings above, please let me know your thoughts and suggestions.

Thank you,
Serkan

@eddy16112 eddy16112 added the Realm Issues pertaining to Realm label Dec 24, 2024
@lightsighter
Copy link
Contributor

lightsighter commented Dec 24, 2024

Hi @serkanturgut. Thank you for the very comprehensive write up of the issue. I think your diagnosis of the problem is correct given the current Realm implementation.

Realm currently does not have a mechanism to "reset" an event object (free the poisoned_generations array) and reset poisoned generation number, once max_poisons = true; object remains in this state forever (please confirm this).

To date this is accurate. For many users tracking event lifetimes is really hard (especially in distributed systems) and Realm tries to provide the illusion that you don't have to do that at the moment. We don't currently have a way to say that you can free up an event and never refer to it again. We effectively allow you to query the status of any event indefinitely, which as you have noticed, can lead to a very small memory "leak" where generational event data structures continue to accumulate and become immutable after they've exhausted their generations. I'll say this is more an artifact of the implementation and the origins of Realm's use in HPC (where jobs are usually 24-48 hours) and not necessarily something inherent to the Realm programming model. We've probably over-fit a bit in our current implementation because of this. More on this in a moment.

Event cancellation triggers the allocations of the poisoned_generations array to record the poisoned generations and DynamicTable expansion basically due to the fact that after 16 poisons per event, it does not get free'd anymore so new event creation eventually allocated from newly create table leafs.

For most Realm applications, cancellations are pretty rare (e.g., in Legion predicated execution often resolves true and doesn't result in cancellation except when predication results false because we're often doing iterative convergence). However, if cancellations are going to be more frequent, then we do need a better way of coping with it rather than defaulting to a maximum of N poisoned generations. If generations are going to be more frequently cancelled I feel like the first step would probably be to start run-length encoding of poisoned generations to ensure a compressed storage and fast query to see if a generation is poisoned. We could probably implement that without too much difficultly. Then the "leak" would be proportional to the number of generations each generational event data structure can hold (default: 2^20) rather than 16 which would buy us some time to implement a more comprehensive "event recycling" infrastructure.

So potentially we can keep track of cancelled events in our system and instruct Realm to force-free them to be recycled.
Would it be an acceptable approach to introduce "reset" API to event class to achieve such objective so the application can decide to end a lifespan of a particular event object?

A comment and three questions: I think we could introduce such a feature, but I think we would want it to be at the granularity of events and not the underlying generational event data structure to avoid exposing implementation details (existence of generational event data structures) through the API. Let's say we instead had an API where you could call free on an event to signal to Realm that you would never query the event or launch anything that depends on it ever again. Realm could then maintain a count of how many generations had been freed in each generational event data structure and once all the generations had been freed, then Realm could implicitly recycle the generational event data structure and start handing out events from it again. (You could even imagine a more aggressive recycling where we run-length encode "freed" generations and start handing them out again in a circular-queue-like style so you wouldn't have to wait for them all to be freed before you start recycling them.) This naturally poses three questions for the client application:

  1. How precise would your "freeing" of events be? Would you be precise in all of them or just some of them? For example, would there be any long-lived events that you might very infrequently or never free for the lifetime of the application because you don't track their lifetime or do so imprecisely? If so, we'll probably need to be more aggressive in tracking freed generations, otherwise we can just do the simple counter approach.
  2. Do you think you can safely free events exactly one time? I want to make sure there are no double deletes or need to query if an event has been freed or not (which can be racy).
  3. Would it be acceptable if the free method actually took an event precondition itself and you would have to chain any dependences from operations that used the event being freed as a precondition? For example, let's say you spawned a task with an event e1 as a precondition for the task to run and this returned e2 as a completion event for the task. Realm's implementation can technically query e1 any time during the process of executing the task, so you would need to do something like e1.free(e2) to make sure that e1 is not freed prematurely before all the queries performed by Realm internally are completed. Do you think you'd be able to track such dependences and chain them into the freeing of events?

If this is the case, I feel like that Realm runtime would be unsuitable for systems that rarely goes offline such as database systems.

At the moment, I agree with you, but I don't think it has to be this way. As mentioned earlier, I think we've probably overfit for HPC a bit, but the programming model is amenable to such long running executions and I believe can be easily modified to run indefinitely, presuming the client is willing to take on some responsibility for recycling events.

@serkanturgut
Copy link
Author

Hi @lightsighter thank you for your detailed response and happy new year!

Realm currently does not have a mechanism to "reset" an event object (free the poisoned_generations array) and reset poisoned generation number, once max_poisons = true; object remains in this state forever (please confirm this).

I'll say this is more an artifact of the implementation and the origins of Realm's use in HPC (where jobs are usually 24-48 hours) and not necessarily something inherent to the Realm programming model. We've probably over-fit a bit in our current implementation because of this. More on this in a moment.

I understand, things made more sense now, I also reread your original paper now I have a better understanding of the event lifetimes.

Event cancellation triggers the allocations of the poisoned_generations array to record the poisoned generations and DynamicTable expansion basically due to the fact that after 16 poisons per event, it does not get free'd anymore so new event creation eventually allocated from newly create table leafs.

For most Realm applications, cancellations are pretty rare (e.g., in Legion predicated execution often resolves true and doesn't result in cancellation except when predication results false because we're often doing iterative convergence). However, if cancellations are going to be more frequent, then we do need a better way of coping with it rather than defaulting to a maximum of N poisoned generations. If generations are going to be more frequently cancelled I feel like the first step would probably be to start run-length encoding of poisoned generations to ensure a compressed storage and fast query to see if a generation is poisoned. We could probably implement that without too much difficultly. Then the "leak" would be proportional to the number of generations each generational event data structure can hold (default: 2^20) rather than 16 which would buy us some time to implement a more comprehensive "event recycling" infrastructure.

I think this was a mistake on our side which is designing computations around cancellation actions (we use that as a signal to waiters that particular operation completed but there was an abnormal completion of the task). However, there is a tight coupling with cancellation actions therefore refactoring the application code poses a big risk, that's why I was exploring the alternatives.

  1. How precise would your "freeing" of events be? Would you be precise in all of them or just some of them? For example, would there be any long-lived events that you might very infrequently or never free for the lifetime of the application because you don't track their lifetime or do so imprecisely? If so, we'll probably need to be more aggressive in tracking freed generations, otherwise we can just do the simple counter approach.

I think at the event instance level fits best for our use case. Let me elaborate the use case a bit. Let's say the system has 2 computations in it, these 2 computations disjoint with each but they are running on the same Realm runtime in the same linux process. These 2 computations may or may not overlap in terms of their start and finish times. Every computation has 100+ events created, at the end of every computation, we know that we will never need to inquire these event generations ever again therefore we can potentially pass a list of events to Realm to say these event and their generations can be recycles.

A comment and three questions: I think we could introduce such a feature, but I think we would want it to be at the granularity of events and not the underlying generational event data structure to avoid exposing implementation details (existence of generational event data structures) through the API. Let's say we instead had an API where you could call free on an event to signal to Realm that you would never query the event or launch anything that depends on it ever again. Realm could then maintain a count of how many generations had been freed in each generational event data structure and once all the generations had been freed, then Realm could implicitly recycle the generational event data structure and start handing out events from it again. (You could even imagine a more aggressive recycling where we run-length encode "freed" generations and start handing them out again in a circular-queue-like style so you wouldn't have to wait for them all to be freed before you start recycling them.)

I think that makes sense and implementation towards this model should be on the mainline of the project if anybody else need a feature like this but it sounds little bit more complex then what we need. Currently, we're only interested in recycling the events hit the cancellation limit but what you describe actually solves the problem for generation counters as well which is great. There is 20 bits for generation counter if I am not wrong, so every event data structure can be re-used 1M times (without cancellation), every DT node is 36MB, event data structure size is 536 bytes, so it can host 70K event data structures every one of which can be reused 1M times. Let's just say every computation takes 100 events, that means we have ~700M computations runway before DT is forced to allocate another block of 36MB.

Do you think you can safely free events exactly one time? I want to make sure there are no double deletes or need to query if an event has been freed or not (which can be racy).

Yes, we can safely guarantee that "free" called only once for that event generation. Thanks to RAII :)

Would it be acceptable if the free method actually took an event precondition itself and you would have to chain any dependences from operations that used the event being freed as a precondition? For example, let's say you spawned a task with an event e1 as a precondition for the task to run and this returned e2 as a completion event for the task. Realm's implementation can technically query e1 any time during the process of executing the task, so you would need to do something like e1.free(e2) to make sure that e1 is not freed prematurely before all the queries performed by Realm internally are completed. Do you think you'd be able to track such dependences and chain them into the freeing of events?

We can track such dependencies however our current use of Realm, we don't necessarily need such complexity with the "free" API. There is no merging or preconditions once we know that a computation finished.

We actually prototyped something like following and tested, I think it lacks some sanity asserts such as checking whether the event already recycled with free_entry. But other than that, it serves the purposes of our application logic. We record the events we cancelled during a computation. Once we know the computation finished, we iterate and call reset() on them which basically resets the ones that reached to poisoned generation limit.

void GenEventImpl::reset() {
      AutoLock<> a(mutex);
      if (num_poisoned_generations.load() == POISONED_GENERATION_LIMIT) {
        if (poisoned_generations) {
          delete[] poisoned_generations;
        }
        poisoned_generations = nullptr;
        num_poisoned_generations.store(0);
        local_triggers.clear();
        get_runtime()->local_event_free_list->free_entry(this);
      }
    }

@lightsighter
Copy link
Contributor

I think this was a mistake on our side which is designing computations around cancellation actions (we use that as a signal to waiters that particular operation completed but there was an abnormal completion of the task).

I actually think that is a fine reason to use event poisoning. We've just overfit again a bit in the implementation for what we needed from Legion. There's no fundamental reason that you shouldn't be able to have lots of cancelled events.

Every computation has 100+ events created, at the end of every computation, we know that we will never need to inquire these event generations ever again therefore we can potentially pass a list of events to Realm to say these event and their generations can be recycles.

That makes sense.

I think that makes sense and implementation towards this model should be on the mainline of the project if anybody else need a feature like this but it sounds little bit more complex then what we need.
We actually prototyped something like following and tested, I think it lacks some sanity asserts such as checking whether the event already recycled with free_entry. But other than that, it serves the purposes of our application logic.

By all means you should do whatever is simplest and keeps your work progressing for now. I think we'll try to prototype a more robust solution for the master branch of Realm, but that should not preclude you from hacking in whatever you need for now.

There is 20 bits for generation counter if I am not wrong

It's configurable in id.h (e.g. if you wanted to support fewer nodes since 2^16 nodes is a lot and use more bits for encoding generations for each generational event data structure you can do that), but yes that is the default configuration.

Let's just say every computation takes 100 events, that means we have ~700M computations runway before DT is forced to allocate another block of 36MB.

That envelope math looks right to me.

We can track such dependencies however our current use of Realm, we don't necessarily need such complexity with the "free" API. There is no merging or preconditions once we know that a computation finished.

I mainly just wanted to make sure we had a clear programming model of what users would need to do before recycling events. It's just a bit tricky because you have to make sure all the operations that used the event as a precondition before the freeing of the event are also done running before the event can actually be freed. It's a non-obvious mental model so I just wanted to call attention to it.

@lightsighter
Copy link
Contributor

Also @serkanturgut would it be possible to have a meeting with you to hear more about your use case for Realm? I think the Realm team would value knowing about another use case and some of its requirements. We also have some news about Realm we'd like to share with you as well. If that's amenable to you, go ahead and shoot me an email at my address from the Realm paper and I'll respond back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

3 participants