Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call .destroy() on DeepSpeedEngine somewhere post training #28178

Closed
1 of 4 tasks
chiragjn opened this issue Dec 21, 2023 · 2 comments
Closed
1 of 4 tasks

Call .destroy() on DeepSpeedEngine somewhere post training #28178

chiragjn opened this issue Dec 21, 2023 · 2 comments
Labels

Comments

@chiragjn
Copy link

System Info

transformers==4.36.2
accelerate==0.25.0
deepspeed==0.12.5

Who can help?

I was using deepspeed stage 2 with Trainer and accelerate and at the end of training when the Trainer has been garbage collected, I noticed my GPU VRAM was not clearing even after aggressively calling gc.collect() and torch.cuda.empty_cache()

I spent some time debugging and narrowed it down to deepspeed optimizer not removing hooks on pytorch tensors.
I have submitted a PR on Deepspeed: microsoft/DeepSpeed#4858
But to invoke this logic engine.destroy() must be called in some place post-training

For now, I am manually calling it outside the trainer post-training and can confirm it works, would be nice if Trainer can take care of it or there is some note in the docs.

@pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  • Train any model with Zero 2 + gradient accumulation, delete and let the trainer garbage collect, model parameters would still linger around in the GPU memory

Expected behavior

GPU memory should be reclaimable post training

@huggingface huggingface deleted a comment from github-actions bot Jan 22, 2024
@huggingface huggingface deleted a comment from github-actions bot Feb 16, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @pacman100 for your thoughts on this feature addition

@chiragjn
Copy link
Author

This was resolved in huggingface/accelerate#2716

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants