Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running rqatrend seems to accumulate memory #36

Open
felixcremer opened this issue Dec 1, 2024 · 14 comments
Open

Running rqatrend seems to accumulate memory #36

felixcremer opened this issue Dec 1, 2024 · 14 comments

Comments

@felixcremer
Copy link
Collaborator

Havent investigated but when running the rqatrend for different years and Orbits it fails after a few iterations with an out of memory error.

@felixcremer
Copy link
Collaborator Author

When I am running rqatrend with 10 threads on a single worker after the run time there is 2 GB more used memory reported by htop.

I tried looking into where this extra memory might be, but I haven't found the culprit yet.

@felixcremer
Copy link
Collaborator Author

This might be a threading issue. I found this issue on the Julia repo JuliaLang/julia#40626 and I am currently testing with 14 workers but they are scratching on the total RAM and are also a lot slower than the combination of 3 workers and 10 threads each apparently a factor of 4

@felixcremer
Copy link
Collaborator Author

After letting it run for a few hours the memory usage accumulates to 11.7 GB per worker and it throws an error indicating that the worker ran out of memory.

I can find the memory when I check the memory usage from the different workers but I can't free it by running GC.gc()
I am still not sure, whether this is the thread issue or a distributed issue or a combination of both.

@danlooo
Copy link
Collaborator

danlooo commented Dec 4, 2024

Have you tried Profile.Allocs.@profile from PProf.jl, e.g. see here?

@felixcremer
Copy link
Collaborator Author

Running it with Julia 1.11.2 still sees the same problem maybe with a but slower increase and the @time macro before the rqatrend does not indicate any gc time which might be a bad sign:

Progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:36:29
2189.929099 seconds (227.82 k allocations: 13.929 MiB, 0.00% gc time, 28 lock conflicts)
  0.191193 seconds (13.23 k allocations: 1.426 MiB)
path = "/mnt/felix1/worldmap/data/E048N015T3_rqatrend_VH_146_thresh_3.0_year_2018"
outpath = "/mnt/felix1/worldmap/data/E048N015T3_rqatrend_VH_146_thresh_3.0_year_2018.zarr"

@felixcremer
Copy link
Collaborator Author

I tested it shortly on julia 1.10 because it might also be a regression. This might be related to this issue:
JuliaLang/julia#56759
I will start the process for a longer time with julia 1.10 to see, whether it also slowly fills up the memory.

One suggestions in the linked issue was to take heap_snapshots via Profile.take_heap_snapshot

@MilesCranmer
Copy link

MilesCranmer commented Dec 6, 2024

@felixcremer did switching to 1.10 help? If you think it's the same issue I was seeing in my package, please share on JuliaLang/julia#56759

@felixcremer
Copy link
Collaborator Author

felixcremer commented Dec 9, 2024

I think this is not Might be related to JuliaLang/julia#56759 because also in julia 1.10 I get a similar behaviour and after multiple mapCube computations the memory is filled up and the julia process fails with an out of memory error. Even though the memory is accumulating a bit slower than with julia 1.11 and then it does not fail but manages to hover at 95% usage. I am going to rerun it with julia 1.11 as well to see when it is going to run out of memory.

Edit: I spoke to soon and assumed it is shortly going to fail with OOM but it didn't.

@felixcremer
Copy link
Collaborator Author

The strong memory usage might also be related to JuliaLang/julia#55794 to conclude that I am planning to run the mapcube with an on memory array so that we don't have to deal with IO at all.

@felixcremer
Copy link
Collaborator Author

Poking at this a bit more I realized that we are not freeing the IRasterBand pointers from the BufferGDALBand and that might be the memory that is not available to the GC anymore.
I changed the ArchGDAL finalizer for the IRasterBand to print when it is finalized and this is not shown when setting the gdalcube to nothing and running GC afterwards.

We might be able to register a finalizer function for every cube we open to close the gdalbands after the mapcube usage.

There is also JuliaGeo/GDAL.jl#77 which might be related, but is most likely not the main culprit.

@lazarusA
Copy link

But that is the whole point of this approach. Otherwise you could simply use the Yaxarraybase approach.

@meggart
Copy link
Collaborator

meggart commented Dec 13, 2024

I changed the ArchGDAL finalizer for the IRasterBand to print when it is finalized and this is not shown when setting the gdalcube to nothing and running GC afterwards.

I think this does not work in general, printing to terminal from within finalizers does not work. In the past there was usually a warning saying something like "Task switch from finalizers not allowed". So the missing print is not a proof that the finalizer is not run. However, I agree with Lazaro, just testing this with the YAXArrayBase GDALBand would be a low-effort test to check if this is about a defect IRasterBand finalizer

@felixcremer
Copy link
Collaborator Author

I am using '@async' for the printing, so that it is done, and actually sometimes this is actually printing in between.

@felixcremer
Copy link
Collaborator Author

It seems that this is much better in the current master of julia.
I am not yet sure, whether this is fully fixing it but I am hopeful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants