Call `.destroy()` on `DeepSpeedEngine` somewhere post training #28178

chiragjn · 2023-12-21T09:46:34Z

System Info

transformers==4.36.2
accelerate==0.25.0
deepspeed==0.12.5

Who can help?

I was using deepspeed stage 2 with Trainer and accelerate and at the end of training when the Trainer has been garbage collected, I noticed my GPU VRAM was not clearing even after aggressively calling gc.collect() and torch.cuda.empty_cache()

I spent some time debugging and narrowed it down to deepspeed optimizer not removing hooks on pytorch tensors.
I have submitted a PR on Deepspeed: microsoft/DeepSpeed#4858
But to invoke this logic engine.destroy() must be called in some place post-training

For now, I am manually calling it outside the trainer post-training and can confirm it works, would be nice if Trainer can take care of it or there is some note in the docs.

@pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Train any model with Zero 2 + gradient accumulation, delete and let the trainer garbage collect, model parameters would still linger around in the GPU memory

Expected behavior

GPU memory should be reclaimable post training

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-02-16T11:51:12Z

Gentle ping @pacman100 for your thoughts on this feature addition

chiragjn · 2024-10-30T13:41:31Z

This was resolved in huggingface/accelerate#2716

huggingface deleted a comment from github-actions bot Jan 22, 2024

huggingface deleted a comment from github-actions bot Feb 16, 2024

amyeroberts added trainer Feature request Request for a new feature DeepSpeed labels Feb 16, 2024

chiragjn mentioned this issue Apr 28, 2024

Fixup free_memory to deal with garbage collection huggingface/accelerate#2716

Merged

5 tasks

muellerzr mentioned this issue Apr 29, 2024

Include passthrough to free_memory #30549

Closed

5 tasks

chiragjn closed this as completed Oct 30, 2024

muellerzr mentioned this issue Oct 30, 2024

Add in free_memory passthrough #34519

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call `.destroy()` on `DeepSpeedEngine` somewhere post training #28178

Call `.destroy()` on `DeepSpeedEngine` somewhere post training #28178

chiragjn commented Dec 21, 2023

amyeroberts commented Feb 16, 2024

chiragjn commented Oct 30, 2024

Call .destroy() on DeepSpeedEngine somewhere post training #28178

Call .destroy() on DeepSpeedEngine somewhere post training #28178

Comments

chiragjn commented Dec 21, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Feb 16, 2024

chiragjn commented Oct 30, 2024

Call `.destroy()` on `DeepSpeedEngine` somewhere post training #28178

Call `.destroy()` on `DeepSpeedEngine` somewhere post training #28178