Torch gradient_checkpoint_scope potential memory leak #1582

albertz · 2024-07-12T22:10:16Z

Training was running fine for 29 subeochs but then crashed with CPU OOM.

While I sometimes see CPU OOMs in my setup, this is usually after longer trainings. So it was a bit unexpected to me to get the CPU OOM so early. I'm not really sure whether this is due to the gradient_checkpoint_scope or sth else, but the usage of gradient_checkpoint_scope is the sole and main difference to my earlier setups. But still it could be some random hiccup. So let's see if I get this more frequently now.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch gradient_checkpoint_scope potential memory leak #1582

Torch gradient_checkpoint_scope potential memory leak #1582

albertz commented Jul 12, 2024

Torch gradient_checkpoint_scope potential memory leak #1582

Torch gradient_checkpoint_scope potential memory leak #1582

Comments

albertz commented Jul 12, 2024