OptimizedLinear updates #5791

jeffra · 2024-07-23T05:58:43Z

This is a refresh of of OptimizedLinear with the following features to improve performance and usability:

More efficient sharing of base weights using all_gather_into_tensor
Flattened sharded weights
Selectively offload frozen weights to cpu
deepspeed.linear.Init that allows injecting OptimizedLinear during model construction (similar to zero.Init)
Support for load state dict directly in OptimizedLinear, this allows loading HF model weights correctly into sharded params
Various bug fixes for the LoRA implementation introduced previously
Several new unit tests

Builds on-top of @RezaYazdaniAminabadi's previous FP8 updates (#5764) to support dense model fp8 quantization.

Example usage of this to fine-tune llama-3.1-405B on a single node: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main/training/llama3.1

…cales

deepspeed/linear/optimized_linear.py

jeffra · 2024-08-10T22:49:24Z

nv-accelerate-v100 results on H100 (torch 2.4 + cu12.2):

nv-torch-latest-v100 results on H100 (torch 2.4 + cu12.2):

pytest --forked -n 4 unit/ --torch_ver="2.4" --cuda_ver="12.1" &> run1.log
pytest --forked -m 'sequential' unit/ --torch_ver="2.4" --cuda_ver="12.1" &> run2.log

jeffra · 2024-08-13T00:08:48Z

I'm able to get both the nv-accelerate-v100 and nv-torch-latest-v100 workflows to pass with this branch on my local H100 node (see previous comment). /cc @tjruwase @HeyangQin @loadams. Okay to force merge?

loadams · 2024-08-13T22:46:35Z

I'm able to get both the nv-accelerate-v100 and nv-torch-latest-v100 workflows to pass with this branch on my local H100 node (see previous comment). /cc @tjruwase @HeyangQin @loadams. Okay to force merge?

I believe I've fixed our runners @jeffra - I'll monitor it today to be sure it gets merged.

sfc-gh-reyazda and others added 12 commits July 11, 2024 02:04

Add fp8-fused gemm kernel

4c3b8fd

add get_scale function

c0e97f1

fix a few things to run the test

cb0e0a6

Merge branch 'master' into add-fp8-gemm

a11f9c5

fixes for optim linear

4169b13

progress

5896743

lora fixes + initial ckpt signal

e600a38

base_weight -> weight

ef52cd1

use flattened tensors for BWS

a170fdd

fix illegal memory corner cases with an extra condition for reading s…

390a984

…cales

reduce memory pressure

b43c242

more changes

40add9e

jeffra requested review from tjruwase, awan-10, arashb and loadams as code owners July 23, 2024 05:58

small fix for fp16 quantization

057ce52

winglian reviewed Jul 25, 2024

View reviewed changes

deepspeed/linear/optimized_linear.py Outdated Show resolved Hide resolved

jeffra and others added 7 commits July 31, 2024 16:41

ds lora injection api support (#8)

966ebd4

Merge branch 'master' into ds-llama

6ec4ead

various clean-up

fe6b082

updates for tests

c163c21

Merge branch 'master' into ds-llama

527cc23

Merge branch 'master' into ds-llama

2bf3290

Merge branch 'master' into ds-llama

cbfd54d

HeyangQin approved these changes Aug 13, 2024

View reviewed changes

loadams enabled auto-merge August 13, 2024 23:04

loadams added this pull request to the merge queue Aug 13, 2024

Merged via the queue into microsoft:master with commit 6e5d58d Aug 14, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OptimizedLinear updates #5791

OptimizedLinear updates #5791

jeffra commented Jul 23, 2024 •

edited

Loading

jeffra commented Aug 10, 2024 •

edited

Loading

jeffra commented Aug 13, 2024 •

edited

Loading

loadams commented Aug 13, 2024

OptimizedLinear updates #5791

OptimizedLinear updates #5791

Conversation

jeffra commented Jul 23, 2024 • edited Loading

jeffra commented Aug 10, 2024 • edited Loading

jeffra commented Aug 13, 2024 • edited Loading

loadams commented Aug 13, 2024

jeffra commented Jul 23, 2024 •

edited

Loading

jeffra commented Aug 10, 2024 •

edited

Loading

jeffra commented Aug 13, 2024 •

edited

Loading