Call .destroy()
on DeepSpeedEngine
somewhere post training
#28178
Labels
.destroy()
on DeepSpeedEngine
somewhere post training
#28178
System Info
transformers==4.36.2
accelerate==0.25.0
deepspeed==0.12.5
Who can help?
I was using deepspeed stage 2 with Trainer and accelerate and at the end of training when the Trainer has been garbage collected, I noticed my GPU VRAM was not clearing even after aggressively calling
gc.collect()
andtorch.cuda.empty_cache()
I spent some time debugging and narrowed it down to deepspeed optimizer not removing hooks on pytorch tensors.
I have submitted a PR on Deepspeed: microsoft/DeepSpeed#4858
But to invoke this logic
engine.destroy()
must be called in some place post-trainingFor now, I am manually calling it outside the trainer post-training and can confirm it works, would be nice if Trainer can take care of it or there is some note in the docs.
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
GPU memory should be reclaimable post training
The text was updated successfully, but these errors were encountered: