Memory leak issue with ESPEI #262

guannant · 2024-11-26T16:52:28Z

Hi there,

I was running the tutorial example for Cu-Mg on our HPC system and noticed a significant increase in memory usage as the iterations progressed. Specifically, the memory usage reached approximately 700GB after 1,713 iterations (see the attached screenshot). This resulted in our system flagging the job due to excessive memory consumption.

It appears that this high memory demand may stem from one or both of the following:

Retention of All Walker Positions: The emcee sampler in ESPEI incrementally retains references to all walker positions.
Accumulation of Intermediate Results: The storage of self.sampler.chain and self.sampler.lnprobability may contribute to the memory growth.

To address this, I believe ESPEI could benefit from a mechanism to periodically save results to disk (e.g., every 100 iterations) and reset the emcee sampler to free memory.

I am happy to contribute by developing an HDF5 output module for ESPEI to replace the current use of numpy.save(). This would enable periodic pruning of the emcee sampler and provide a more memory-efficient workflow.

Let me know your thoughts on this!

bocklund · 2024-11-26T17:03:59Z

Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: #230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak.

That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well
as archive the phase models, datasets, input YAML, and the numpy rng state for reproducibility and smoother restarts

guannant · 2024-11-26T22:46:31Z

Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: #230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak.

That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well as archive the phase models, datasets, input YAML, and the numpy rng state for reproducibility and smoother restarts

I see. What would be the recommended temporary fix here? restarting scheduler or cache the symbols?

I can help out with the HDF5 output to combine at least the trace and lnprob arrays and make a pull request here. My project at ANL relies heavily on ESPEI, and we are also exploring the integration of different MCMC engines with ESPEI. Hopefully, this can be a good add-on feature to ESPEI in the future.

bocklund · 2024-12-06T06:06:07Z

Restarting the cache is easy and nice because it's entirely controlled in ESPEI. Symbol caching is a little more intrusive because it requires a change in PyCalphad.

guannant · 2024-12-10T17:46:46Z

Restarting the cache is easy and nice because it's entirely controlled in ESPEI. Symbol caching is a little more intrusive because it requires a change in PyCalphad.

I’m not sure if I handled this correctly, but restarting the scheduler inside do_sampling only prevents memory growth in the child process. The memory usage in the parent process appears to continue to increase. I had to switch to a high-memory node with 1TB of RAM just to allow ESPEI to complete the 2000 iterations set by default in the YAML file. By any chance, do you have any suggestions on how we might be able to control the memory leak in the parent process?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak issue with ESPEI #262

Memory leak issue with ESPEI #262

guannant commented Nov 26, 2024

bocklund commented Nov 26, 2024

guannant commented Nov 26, 2024

bocklund commented Dec 6, 2024

guannant commented Dec 10, 2024 •

edited

Loading

Memory leak issue with ESPEI #262

Memory leak issue with ESPEI #262

Comments

guannant commented Nov 26, 2024

bocklund commented Nov 26, 2024

guannant commented Nov 26, 2024

bocklund commented Dec 6, 2024

guannant commented Dec 10, 2024 • edited Loading

guannant commented Dec 10, 2024 •

edited

Loading