Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak issue with ESPEI #262

Open
guannant opened this issue Nov 26, 2024 · 4 comments
Open

Memory leak issue with ESPEI #262

guannant opened this issue Nov 26, 2024 · 4 comments

Comments

@guannant
Copy link

Hi there,

I was running the tutorial example for Cu-Mg on our HPC system and noticed a significant increase in memory usage as the iterations progressed. Specifically, the memory usage reached approximately 700GB after 1,713 iterations (see the attached screenshot). This resulted in our system flagging the job due to excessive memory consumption.

It appears that this high memory demand may stem from one or both of the following:

  1. Retention of All Walker Positions: The emcee sampler in ESPEI incrementally retains references to all walker positions.
  2. Accumulation of Intermediate Results: The storage of self.sampler.chain and self.sampler.lnprobability may contribute to the memory growth.

To address this, I believe ESPEI could benefit from a mechanism to periodically save results to disk (e.g., every 100 iterations) and reset the emcee sampler to free memory.

I am happy to contribute by developing an HDF5 output module for ESPEI to replace the current use of numpy.save(). This would enable periodic pruning of the emcee sampler and provide a more memory-efficient workflow.

Let me know your thoughts on this!

image
@bocklund
Copy link
Member

Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: #230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak.

That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well
as archive the phase models, datasets, input YAML, and the numpy rng state for reproducibility and smoother restarts

@guannant
Copy link
Author

Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: #230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak.

That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well as archive the phase models, datasets, input YAML, and the numpy rng state for reproducibility and smoother restarts

I see. What would be the recommended temporary fix here? restarting scheduler or cache the symbols?

I can help out with the HDF5 output to combine at least the trace and lnprob arrays and make a pull request here. My project at ANL relies heavily on ESPEI, and we are also exploring the integration of different MCMC engines with ESPEI. Hopefully, this can be a good add-on feature to ESPEI in the future.

@bocklund
Copy link
Member

bocklund commented Dec 6, 2024

Restarting the cache is easy and nice because it's entirely controlled in ESPEI. Symbol caching is a little more intrusive because it requires a change in PyCalphad.

@guannant
Copy link
Author

guannant commented Dec 10, 2024

Restarting the cache is easy and nice because it's entirely controlled in ESPEI. Symbol caching is a little more intrusive because it requires a change in PyCalphad.

I’m not sure if I handled this correctly, but restarting the scheduler inside do_sampling only prevents memory growth in the child process. The memory usage in the parent process appears to continue to increase. I had to switch to a high-memory node with 1TB of RAM just to allow ESPEI to complete the 2000 iterations set by default in the YAML file. By any chance, do you have any suggestions on how we might be able to control the memory leak in the parent process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants