Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeedZeroOptimizer: refactor bit16 flattening to support more accelerators #4833

Merged
merged 11 commits into from
Jan 11, 2024

Conversation

nelyahu
Copy link
Contributor

@nelyahu nelyahu commented Dec 18, 2023

The approach till today use the practice where the torch.nn.parameter data is being replaced with a new cpu data storage, to offload device memory.
All params are being flatenned on the host and moved to the device.
in some accelerators torch.nn.parameter which is a device parameter cannot be assigned with a cpu storage.
This PR copy the param data into a new cpu tensor, and shrinks the device storage.
Later when the flat buffer is moved to the device param.data will be a view to the flat buffer.

nelyahu and others added 2 commits December 12, 2023 13:40
Today DeepSpeedZeroOptimizer flatten the FP16 weight (which are on the
device) by moving the param.data to cpu, while maintaining the same
param object. This practice cannot work with all device types.
This commit introduces a new approach for doing it without
data sharing.
1. copy each param.data to a new CPU tensor
2. keep param onject on device, and shrink the storage.
3. flatten the CPU storages to a host flat tensor
4. move to device
5. resize device params back to their original shape
6. point to offset in the flat buffer.
@nelyahu nelyahu changed the title DeepSpeedZeroOptimizer: refactor bit16 flatenning to support more accelerators DeepSpeedZeroOptimizer: refactor bit16 flattening to support more accelerators Dec 18, 2023
@tjruwase tjruwase self-assigned this Dec 18, 2023
@nelyahu
Copy link
Contributor Author

nelyahu commented Jan 10, 2024

Hi @tjruwase - i fixed a failing UT issue, can you please re-trigger workflow?

@tjruwase tjruwase added this pull request to the merge queue Jan 10, 2024
Merged via the queue into microsoft:master with commit ade9836 Jan 11, 2024
13 checks passed
@nelyahu nelyahu deleted the zeroOptParamsFlatenning branch February 4, 2024 11:58
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
…elerators (microsoft#4833)

The approach till today use the practice where the torch.nn.parameter
data is being replaced with a new cpu data storage, to offload device
memory.
All params are being flatenned on the host and moved to the device.
in some accelerators torch.nn.parameter which is a device parameter
cannot be assigned with a cpu storage.
This PR copy the param data into a new cpu tensor, and shrinks the
device storage.
Later when the flat buffer is moved to the device param.data will be a
view to the flat buffer.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
envsp added a commit to envsp/DeepSpeed that referenced this pull request Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants