-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regent: memory leak #711
Comments
I'll be monitoring it, but assigning to @magnatelee for now. |
this is preventing soleil-x from running to completion. After 4-5 hours of running the hit_to_openchannel test case on 5 nodes of Piz Daint the program falls over with out of memory error. I instrumented legion_c.cc to trace the creation and deletion of Futures. Here is the output for a typical Future ~/PSAAP/soleil-x-master/src> grep 0x2ab828693f40 run.log Notice there is an extra new that is not matched by a delete. After running 500 steps of soleil-x (master branch) there were 19103 Futures that followed this pattern. That seems like an excessive number. |
Current best guess is that the culprit is an expression like
And then we'll follow the usual rules for Having said that, it would really help to have a more information about where this happens in the original source code, this is a complete guess at the moment in terms of where the problem may be. Leaving @magnatelee assigned to fix the normalizer issue. |
Attaching a reproducer:
Reports:
|
I think it would be good if @mariodirenzo can also confirm the fix in the HTR. |
Is |
The fix is now available in |
I'm afraid soleil-x is still failing after less than five hours of running. Out of memory. |
What's the error mode? Can you run it with Legion GC enabled and send me a log file? At least the Soleil-X in master was not leaking any futures. Which branch of Soleil-X are you using? To enable Legion GC, you need to re-install Regent with |
-rw-r--r-- 1 aheirich d108 194391853 Feb 10 21:19 gc_0.log |
Legion GC founds a cycle in Alan's case, so I suspect this is a runtime bug. Assigning @lightsighter to this issue as well to have him take a look at the issue. |
While I agree that there is a cycle in the reference counting graph, it is for an index space and a partition, not a future and is therefore highly unlikely to be the source of the persistent memory leak. I don't see any cyclic references on future objects. |
Did we confirm that the futures are not being leaked any longer? |
Here is the output from Legion GC:
Apparently, there are still some futures that are leaked, but leaks on other handles, such as equivalence sets, seem more severe. |
If the run crashed then those numbers are meaningless. Legion GC only soundly reports leaks if the runtime shuts down correctly, otherwise you're just recording how many live objects there were when the runtime died. |
The reference cycle occurs when a partition is made with a color space that is identical to its parent index space. I've pushed a fix with b188ec6. I guarantee you that it is not the source of the leak. Assigning back to @magnatelee and @aheirich to investigate further. |
The run did not crash, it exited cleanly. So the numbers are not meaningless. Previously @manopapad and myself confirmed the problem is with a legion_future_from_untyped_pointer that is not matched by a delete. Confirming that the problem still exists. |
Are you deleting all your regions at the end of the run? Leaked regions are guaranteed to cause leaked index spaces, field spaces, partitions, equivalence sets, views, managers, and constraints. At a minimum it would be good to have a new log that confirms the cycle fix. |
Yes |
The stencil leak was application specific. If Soleil has any code like: Then you may need to apply this patch: https://gitlab.com/StanfordLegion/legion/-/commit/d007c1cb748dff1286f9e00349cb5f392522e6c3 |
I'm attach valgrind logs for 5, 10 and 20 iterations of Soleil The leak does increase with the duration of the run. Logs seem to point to |
The only |
I fixed the issue by destroying the PhysicalRegion in detach, and leaks are down, but still scale with run duration. |
Ok, fixed another couple of leaks. It's still growing, and the top leaks don't currently make sense to me. |
Already told this to @elliottslaughter but for everyone else's benefit the growing |
Confirmed that we still have a future leak.
|
I've confirmed the future leak with |
@mariodirenzo Can you please confirm whether you're still seeing this issue, and if so, should it be listed at #1032 ? |
I believe this was fixed a long time ago, but if it's still here, someone would need to confirm and provide a reproducer. |
My solver that is written in Regent tends to leak a lot of memory during its execution.
For instance, I've seen runs that leak 3.5GB of memory per node every 5 minutes.
Following an initial inspection of the issue done with @lightsighter and @magnatelee, the problem should be related to an accumulation of futures that are not deleted during the run.
@magnatelee and maybe @lightsighter should still have some logs related to this issue.
The text was updated successfully, but these errors were encountered: