-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak with Julia 1.11's GC (discovered in SymbolicRegression.jl) #56759
Comments
Does this reproduce without Python in the loop? |
Just checked and, yes, it seems to continually to increase in memory, albeit slowly. It's slow enough that the OOM error only happens after 11 hours of production runtime. using SymbolicRegression
X = randn(Float32, 5, 10_000)
y = 2 * cos.(X[4, :]) + X[1, :] .^ 2 .- 2
options = SymbolicRegression.Options(;
binary_operators=[+, *, /, -],
unary_operators=[cos, exp],
)
hall_of_fame = equation_search(X, y; niterations=1000000000, options=options, parallelism=:multiprocessing) To clarify, Python is only used to call a Julia method that runs for 12 hours without passing control back to Python. I don't expect Python to factor into things. The new Julia processes are spawned from Julia and do not interact with Python. Here are some plots @GoldenGoldy made, with memory usage per Julia process over the course of the 12-hour run: Note that I automatically set a conservative |
If possible can you take heap snapshots at the start and after a while to see what is being allocated? |
What's the best way to do that on macOS? (I think @GoldenGoldy is on Linux though FYI. I'm not sure if the memory leak is OS dependent or not) |
|
Ok. I made some heap snapshots. Do you want the full files? (250 MB each). Not sure what I am supposed to look for though. |
Are the start one and the end one the same size? 🤔 But yeah, could you zip them and send them to me? |
Maybe it's hard to see much a difference in ~5 minutes (it might even go down - see plot above), it's only increased slightly over that time. Also, does One more question. Would the heap snapshot even record the memory leak? Or do you just want to see if it's a real leak or actual allocations somewhere? |
The heap snapshot will see everything that is live. I want to see if it's an actual leak i.e a pointer we forgot to call free on or extra live objects around for some reason. |
So the heap snapshots are about the same size, even over a 10 minute interval. And I am looking at the memory breakdown and they seem quite similar in where memory is allocated too. But despite this, the process's allocated memory continues to increase. So does this mean it's a real leak? |
Might be useful to check whether the memory increase/possible leak is stemming from an increase in pool allocated pages or an increase in mallocd memory. I.e. #55794 (comment). Also, does this only reproduce when using multiple GC threads or also reproduces with a single GC thread? |
Over a 30 minute interval, the heap snapshots are still pretty much the same. But the memory usage of the actual Julia process is greater. @d-netto would it track that? If it doesn't show up in the live objects then I would have thought that means the GC is unaware of it existing? It should be easy to reproduce locally with the code above if you want to poke at certain things. |
Copying @GoldenGoldy's comment from here. Basically the use of multiprocessing seems to not relevant for the leak; it seems it happens regardless of processing mode:
(They are using a VM with 240 GB of memory which is why it climbs so high before an OOM) Note the two other colors are kernel and disk data, which remain flat in both cases. |
Do you see a leak on macos as well? Because on my laptop It's running for around 30 minutes and it's still around 1.2GB memory usage |
I'm on macOS, yes. Note that if you run the original script above it launches additional Julia processes - those ones have the blow-up in memory while the head worker is fairly flat. But you could run the following to just have multithreading instead: using SymbolicRegression
X = randn(Float32, 5, 10_000)
y = 2 * cos.(X[4, :]) + X[1, :] .^ 2 .- 2
options = SymbolicRegression.Options(;
binary_operators=[+, *, /, -],
unary_operators=[cos, exp],
)
hall_of_fame = equation_search(X, y; niterations=1000000000, options=options) The memory fluctuates a lot for me, but it still does trend higher. Is yours constant at 1.2 GiB? However I wouldn't expect something identical to @GoldenGoldy because they use slightly different settings. They couldn't share the full script - presumably due to it being company code - but I can definitely reproduce a memory leak on my machine with this code. |
It looks to be using multiprocessing in that example; can you try the multithreading instead? i.e.,: using SymbolicRegression
X = randn(Float32, 5, 10_000)
y = randn(Float32, 10_000)
options = SymbolicRegression.Options(binary_operators=[+, *, /, -], unary_operators=[cos, exp])
hall_of_fame = equation_search(X, y; niterations=1000000000, options=options) Are you using top? You could try btop which lets you monitor it over time and I find is a bit more accurate |
Im using both htop and Instruments (the Xcode tool). Will give the multithreaded version a go. Ok I do see a slight increase over time. Slower than what that server is showing but it is there. |
Can you run it with |
I think their server run just accentuated the problem due to larger compute power - i.e., they can generate garbage quicker.
I don't have the background to know what this means so forgive my naiveté – do you mean this is a memory leak? Or it is something else? |
I'm not sure yet. It looks to me julia thinks there are more and more objects alive. At least it maps more and more memory. |
(Just sent the heap snapshots on slack by the way) |
Ping on this. Is there anything I can help with or look at? This is a major bug for downstream users so I want to fix it ASAP if at all possible |
One more experiment to the roster: I confirmed that |
Just want to add that using:
the memory usage increases much faster than with:
See MilesCranmer/PySR#764 (comment) for more details and graphs. |
Ran this for 5min on my M2: using SymbolicRegression
X = randn(Float32, 2, 2500)
y = randn(Float32, 2500)
options = SymbolicRegression.Options(binary_operators=[+, *, /, -], unary_operators=[cos, exp])
GC.enable_logging(true)
hall_of_fame = equation_search(X, y; niterations=1000000000, options=options) Could reproduce the fast memory increase. I noticed that the vast majority of the memory reported by
FWIW, another user opened https://discourse.julialang.org/t/dont-understand-why-code-runs-out-of-memory-and-crashes/123559/22. After running their reproducer with Didn't investigate further to know whether it's related, but seems suspicious... |
Thanks @d-netto. Via that thread I also found |
I don't know. But malloc_trim would probably not be useful here. malloc_trim is useful when libc is holding into pages of memory that have been freed, and not giving them back to the OS. What seems to be happening here is that a bunch of mallocd memory is being considered alive by the GC when it shouldn't, so free doesn't even get a chance to run. |
Does a heap-snapshot show the memory? If so, it should tell us why we think it is being rooted. |
I think I might also be running into this problem as described in EarthyScience/RQADeforestation.jl#36. I am currently producing heap snapshots with julia 1.11. Is there something specific that I should be looking for in them? |
@vchuravy I don't know what to look for but with devtools you can also diff two heapsnapshots |
A bisection using using StatsBase
function Simulate()
Simulations=Int(1e7)
Size=1000
result = Array{Float64}(undef, Simulations, 1)
Threads.@threads for i = 1:Simulations
x = randn(Size)
s = sort(x)
result[i, 1] = s[1]
end
println(median(result))
end
for i in 1:1000
println(i)
Simulate()
GC.gc(true)
# Print live_bytes
println("live_bytes in MB: ", Base.gc_live_bytes() / 1024^2)
sleep(10) # sleep for 10 seconds
end shows the memory PR 909bcea |
Interesting. Is the multithreading is required to reproduce? If so, does it mean there’s a race condition in the GC? |
Multithreading is not needed |
Is it any object, or only |
So the issue Gabriel found is #55223, but that doesn't seem to be the whole problem. |
Confirmed #56801 fixes this. |
We're seeing memory leaks in PySR/SymbolicRegression.jl that appear related to Julia 1.11's parallel GC. The user (@GoldenGoldy) tried various solutions including heap size hints and other parameter adjustments, but memory usage would steadily climb until OOM crashes occurred after 8-11 hours. The issue vanishes completely when switching to Julia 1.10 - no other changes needed. While we don't yet have a minimal working example in pure Julia, I wanted to raise this as it's causing OOM crashes in production workloads.
Full reproduction steps and details in: MilesCranmer/PySR#764, including detailed diagnostics on the memory usage `
The text was updated successfully, but these errors were encountered: