Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory when loading checkpoint #246

Merged
merged 17 commits into from
Nov 25, 2024

Conversation

NouamaneTazi
Copy link
Member

@NouamaneTazi NouamaneTazi commented Nov 21, 2024

Offload memory states to CPU to avoid memory peak when loading checkpoint, and only move to GPU after first fwd-bwd

Details

It seems that the peak reserved memory is higher when we load a checkpoint because when we do the 1st fwd-bwd, which is the most heavy in terms of memory allocation, we already have the optimizer’s state loaded.
Whereas when we train from scratch we do fwd-bwd (optimizer’s state doesnt load yet), then free some memory, then reallocate the optimizer’s state

case of from scratch:
Memory usage: 9793.18MiB. Peak allocated 10465.18MiB. Peak reserved: 10972.00MiB # before fwd-bwd
Memory usage: 9927.32MiB. **Peak allocated 21428.54MiB**. Peak reserved: 21926.00MiB # after fwd-bwd
Memory usage: 9927.32MiB. Peak allocated 10695.32MiB. Peak reserved: 21926.00MiB # before optim step
Memory usage: 16456.16MiB. Peak allocated 16648.16MiB. Peak reserved: 21926.00MiB # after optim step + zero_grad
>>iter2
Memory usage: 16456.16MiB. Peak allocated 17128.16MiB. Peak reserved: 21926.00MiB # before fwd-bwd


case of resume training:
Memory usage: 16321.23MiB. Peak allocated 17153.23MiB. Peak reserved: 17392.00MiB # before fwd-bwd
Memory usage: 16455.37MiB. **Peak allocated 27956.59MiB**. Peak reserved: 28474.00MiB # after fwd-bwd
Memory usage: 16455.37MiB. Peak allocated 17223.37MiB. Peak reserved: 28474.00MiB # before optim step
Memory usage: 16456.16MiB. Peak allocated 16647.37MiB. Peak reserved: 28474.00MiB # after optim step + zero_grad
>>iter2
Memory usage: 16456.16MiB. Peak allocated 17128.16MiB. Peak reserved: 28428.00MiB # before fwd-bwd

TLDR; The memory demand when loading a checkpoint is larger by Optimizer’s state compared when training from scratch

Peak reserved (when loading checkpoint) = Peak reserved (when training from scratch) + Optimizer's state

In example above: 
28GB = 22GB + 6GB
optimizer's state is 6GB = 0.856*10**9 (local params) * 2 (adam's state) * 4 bytes / 1024 / 1024

@NouamaneTazi NouamaneTazi force-pushed the nouamane/optim-state-cpu-offload branch from 7299fdc to bc25a35 Compare November 21, 2024 16:15
Copy link
Member

@xrsrke xrsrke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a small comment.

@NouamaneTazi NouamaneTazi merged commit e694f6d into main Nov 25, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants