-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak issue with ESPEI #262
Comments
Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: #230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak. That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well |
I see. What would be the recommended temporary fix here? restarting scheduler or cache the symbols? I can help out with the HDF5 output to combine at least the trace and lnprob arrays and make a pull request here. My project at ANL relies heavily on ESPEI, and we are also exploring the integration of different MCMC engines with ESPEI. Hopefully, this can be a good add-on feature to ESPEI in the future. |
Restarting the cache is easy and nice because it's entirely controlled in ESPEI. Symbol caching is a little more intrusive because it requires a change in PyCalphad. |
I’m not sure if I handled this correctly, but restarting the scheduler inside do_sampling only prevents memory growth in the child process. The memory usage in the parent process appears to continue to increase. I had to switch to a high-memory node with 1TB of RAM just to allow ESPEI to complete the 2000 iterations set by default in the YAML file. By any chance, do you have any suggestions on how we might be able to control the memory leak in the parent process? |
Hi there,
I was running the tutorial example for Cu-Mg on our HPC system and noticed a significant increase in memory usage as the iterations progressed. Specifically, the memory usage reached approximately 700GB after 1,713 iterations (see the attached screenshot). This resulted in our system flagging the job due to excessive memory consumption.
It appears that this high memory demand may stem from one or both of the following:
To address this, I believe ESPEI could benefit from a mechanism to periodically save results to disk (e.g., every 100 iterations) and reset the emcee sampler to free memory.
I am happy to contribute by developing an HDF5 output module for ESPEI to replace the current use of numpy.save(). This would enable periodic pruning of the emcee sampler and provide a more memory-efficient workflow.
Let me know your thoughts on this!
The text was updated successfully, but these errors were encountered: