-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak Fixes #28
Conversation
34c7b5b
to
2288e90
Compare
I just cloned the PR code and did a run using Bushland and Phillipsburg initial configs.
For Phillipsburg:
Don't understand why the initial config causes the difference. |
I put the full report here for convenience:
|
Good work!
…On Thu, Jun 13, 2024 at 9:41 AM Nels ***@***.***> wrote:
@stcui007 <https://github.com/stcui007> That config triggered a slightly
different code path which had some temporary allocations that weren't
free'd correctly. I ran the config and was able to fix up that leak as well
in e652bb2
<e652bb2>
.
—
Reply to this email directly, view it on GitHub
<#28 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA4SRI5AAAMU5AVOXLW3IDZHGVRPAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVHA3TMNRSHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work everyone! Great that the memory leak issues have been so thoroughly identified / addressed. In future developments, we'll make sure we free dynamically allocated memory.
I've tested the changed code in standalone and in ngen, running multiple catchments from the same realization, and for the most part things seem to be good. The one change I would suggest is removing global_mass_balance from bmi_main_lgar.cxx because it is also called in Finalize (see my comment), so the simulation results are currently printed twice (at least in standalone mode). Please also note that there has been an accepted PR since this one, so I think rebasing might be necessary. Aside from these minor things, this looks great.
This reminds me a similar thing. It seems that in lgar.cxx, it prints our
"Mass balance" info every time step. For multi-processing jobs it does this
for every process. This becomes an issue for large scale processing, on
CONUS, for example, not only slows down the computation but also write out
a big file if you use nohup. Just wonder if this part can be optionally
made inactive when running in the framework.
…On Wed, Jul 3, 2024 at 2:13 PM PeterLaFollette-NOAA < ***@***.***> wrote:
***@***.**** commented on this pull request.
Awesome work everyone! Great that the memory leak issues have been so
thoroughly identified / addressed. In future developments, we'll make sure
we free dynamically allocated linked lists.
I've tested the changed code in standalone and in ngen, running multiple
catchments from the same realization, and for the most part things seem to
be good. The one change I would suggest is removing global_mass_balance
from bmi_main_lgar.cxx because it is also called in Finalize (see my
comment), so the simulation results are currently printed twice. Please
also note that there has been an accepted PR since this one, so I think
rebasing might be necessary. Aside from these minor things, this looks
great.
—
Reply to this email directly, view it on GitHub
<#28 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA4SRPWZBRQZBMCYO53UW3ZKRENNAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNJXGE3TSOJUGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm wondering, what is the verbosity set to in the lgar configs you're
using? The options are:
verbosity=none
verbosity=low
verbosity=high
and if verbosity is set to none then there should only be printing when the
model completes (and not for every time step). The verbosity is set to none
in the Phillipsburg and Bushland examples in the configs folder.
On Wed, Jul 3, 2024 at 1:01 PM Shengting Cui ***@***.***>
wrote:
… This reminds me a similar thing. It seems that in lgar.cxx, it prints our
"Mass balance" info every time step. For multi-processing jobs it does
this
for every process. This becomes an issue for large scale processing, on
CONUS, for example, not only slows down the computation but also write out
a big file if you use nohup. Just wonder if this part can be optionally
made inactive when running in the framework.
On Wed, Jul 3, 2024 at 2:13 PM PeterLaFollette-NOAA <
***@***.***> wrote:
> ***@***.**** commented on this pull request.
>
> Awesome work everyone! Great that the memory leak issues have been so
> thoroughly identified / addressed. In future developments, we'll make
sure
> we free dynamically allocated linked lists.
>
> I've tested the changed code in standalone and in ngen, running multiple
> catchments from the same realization, and for the most part things seem
to
> be good. The one change I would suggest is removing global_mass_balance
> from bmi_main_lgar.cxx because it is also called in Finalize (see my
> comment), so the simulation results are currently printed twice. Please
> also note that there has been an accepted PR since this one, so I think
> rebasing might be necessary. Aside from these minor things, this looks
> great.
>
> —
> Reply to this email directly, view it on GitHub
> <#28 (review)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ACA4SRPWZBRQZBMCYO53UW3ZKRENNAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNJXGE3TSOJUGE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#28 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOI6MQHNXH25XF7YLFAZ5HDZKRKCZAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGEZDQOJVGQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
The verbosity is set to none. The config was generated for me by Ahmad. One
example is as follows.
verbosity=none
soil_params_file=inputsNL/lasam/vG_default_params.dat
layer_thickness=200.0[cm]
initial_psi=2000.0[cm]
timestep=300[sec]
endtime=1000[hr]
forcing_resolution=3600[sec]
ponded_depth_max=0[cm]
use_closed_form_G=false
layer_soil_type=4
wilting_point_psi=15495.0[cm]
field_capacity_psi=340.9[cm]
If what you said is true, I should take another look before saying too
much. But I am pretty sure when I commented out these print statements, the
output file becomes much smaller.
On Wed, Jul 3, 2024 at 3:23 PM PeterLaFollette-NOAA <
***@***.***> wrote:
… I'm wondering, what is the verbosity set to in the lgar configs you're
using? The options are:
verbosity=none
verbosity=low
verbosity=high
and if verbosity is set to none then there should only be printing when
the
model completes (and not for every time step). The verbosity is set to
none
in the Phillipsburg and Bushland examples in the configs folder.
On Wed, Jul 3, 2024 at 1:01 PM Shengting Cui ***@***.***>
wrote:
> This reminds me a similar thing. It seems that in lgar.cxx, it prints
our
> "Mass balance" info every time step. For multi-processing jobs it does
> this
> for every process. This becomes an issue for large scale processing, on
> CONUS, for example, not only slows down the computation but also write
out
> a big file if you use nohup. Just wonder if this part can be optionally
> made inactive when running in the framework.
>
> On Wed, Jul 3, 2024 at 2:13 PM PeterLaFollette-NOAA <
> ***@***.***> wrote:
>
> > ***@***.**** commented on this pull request.
> >
> > Awesome work everyone! Great that the memory leak issues have been so
> > thoroughly identified / addressed. In future developments, we'll make
> sure
> > we free dynamically allocated linked lists.
> >
> > I've tested the changed code in standalone and in ngen, running
multiple
> > catchments from the same realization, and for the most part things
seem
> to
> > be good. The one change I would suggest is removing
global_mass_balance
> > from bmi_main_lgar.cxx because it is also called in Finalize (see my
> > comment), so the simulation results are currently printed twice.
Please
> > also note that there has been an accepted PR since this one, so I
think
> > rebasing might be necessary. Aside from these minor things, this looks
> > great.
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
#28 (review)>,
>
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/ACA4SRPWZBRQZBMCYO53UW3ZKRENNAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNJXGE3TSOJUGE>
>
> > .
> > You are receiving this because you were mentioned.Message ID:
> > ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <#28 (comment)>,
or
> unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AOI6MQHNXH25XF7YLFAZ5HDZKRKCZAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGEZDQOJVGQ>
> .
> You are receiving this because your review was requested.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#28 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA4SRODOD67WRC72TPUZKLZKRMSVAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGE4DEOJSHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Co-authored-by: Phil Miller - NOAA <[email protected]>
Co-authored-by: Phil Miller - NOAA <[email protected]>
* adding the capability for an adaptive time step. If not provided in the config file, it defaults to false. * updating readme * updating readme * updating giuh fxn with adaptive time step * giuh now uses a fixed time step that is hardcoded to be the the same as the smallest adaptive time step for the rest of the model. If the adaptive time step is off, the giuh time step will be the same as the timestep input in the config file. * updating readme in configs * cleaned up code that was used for debugging * updating Phillipsburg config file * cleaned up code that was used for debugging * After chatting with Ahmad, we determined that it is best if the adaptive time step strategy does not alter the function in giuh.c because this function is used in multiple models. So, the adaptive time step method has been altered such that the giuh is computed at the hourly time step rather than at the sub timestep. This allows for the giuh fxn to be unaltered, and has the added bonus of not reqiring resampling for the giuh based on time step. * updating in response to comments, mostly focused on removing hardcoding from min and max time step values * more changes in response to comments, including formatting of giuh.c, removing hardcoding for internal time steps and replacing with user specified minimum time step, and establishing a maximum time step equal to the forcing resolution * removing code that was commented out --------- Co-authored-by: Peter La Follette <[email protected]> Co-authored-by: Peter La Follette <[email protected]>
This should be tracked in a separate issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again great work all. I've tested the updated code in ngen mode (running 1 or multiple catchments from one realization file) and in standalone mode, and the results are all as expected. From my perspective, the branch can be merged to master.
Serious memory leaks exist in this implementation, as noted in #27. Running through the llvm address-sanitizer with leak detection, the total memory leaked by the bmi test using
is nearly 30 MB!
SUMMARY: AddressSanitizer: 26907277 byte(s) leaked in 512522 allocation(s).
This happens predominantly because the linked list data structure supporting the layered wetting fronts dynamically allocates list elements using malloc, but these are never freed/deleted. These lists are then copied using
listCopy
, between each time step to track previous state and current state, but once copied, the old memory is inaccessible and "leaks". This is extremely problematic when used in the NextGen model engine (ngen) since we potentially initialize over 800k models, and then run each for a period of time ranging from hours to decades! These memory leaks at that scale quickly exhaust system resources (as evident by the 10 day CONUS tests where #27 was first identified!).Back of the envelop math implies this could leak up to 24 TB!!!! of memory in only a year long simulation (the length of time of the simulation used in the test above)!
This PR address the bulk of the memory leaks, especially those occurring in the BMI implementation. With these changes, the memory leaks are now minimal and seem to be related mostly to the use of
fprintf
andscanf
and shouldn't be leaking memory for every simulation time step!SUMMARY: AddressSanitizer: 949 byte(s) leaked in 11 allocation(s).
Care should be taken in future developments and work with this customized linked list to ensure that dynamic memory allocations are properly freed. This PR also shows where other model states and temporaries have dynamic memory allocation and had no corresponding cleanup code. Particularly with dynamic allocation in loops (such as in Update which had nested loops in a function intended to be called in a loop) dynamic memory use must be handled correctly.
Additions
Changes
Testing
Notes
Todos
Checklist
Target Environment support
Merging this PR closes #27.