Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zero++] Synchronize at the end of secondary partitioning and simplify the logic #5216

Merged
merged 5 commits into from
Mar 1, 2024

Conversation

ByronHsu
Copy link
Contributor

@ByronHsu ByronHsu commented Mar 1, 2024

1. Why?

We have a very long thread investigating the issue. To summarize, this is because

a. The 2nd partitioning is asynchronous because it copies device-to-device from full tensor to 2nd tensor
b. When using prefetching, the all-gather of 2nd tensor can happen before 2nd partitioning ends. At that moment, the value of 2nd tensor might contain bad values.

image

Also, we found that the logic of copying is wrong and lengthy, so we simplified it to only two lines.

Kudos to @yundai424, Haowen Ning, @samadejacobs for the investigation effort.

2. What?

After multiple careful tests, we found patching get_accelerator().synchronize() to ensure all cuda stream finished before 2nd partitioning can prevent the issue

3. Tests

I validated the correctness of the simplification of 2nd partition logic. The loss is "exactly" the same before and after simplification under the same random seed.

Before

[
  {"loss": 2.0731},
  {"loss": 2.0288},
  {"loss": 1.927},
  {"loss": 1.8347},
  {"loss": 1.8347},
  {"loss": 1.7896},
  {"loss": 1.602},
  {"loss": 1.766},
  {"loss": 1.8751},
  {"loss": 1.6776}
]

After

[
  {"loss": 2.0731},
  {"loss": 2.0288},
  {"loss": 1.927},
  {"loss": 1.8347},
  {"loss": 1.8347},
  {"loss": 1.7896},
  {"loss": 1.602},
  {"loss": 1.766},
  {"loss": 1.8751},
  {"loss": 1.6776}
]


4. TODO

We need further investigation on the issue @samadejacobs

  1. Revisit ZeRO-3 prefetch design
  2. Refactor hpz to reuse primary tensor for secondary partition.

@tjruwase
Copy link
Contributor

tjruwase commented Mar 1, 2024

@GuanhuaWang, FYI!

@ByronHsu ByronHsu changed the title [zero++] Synchronize at the end of secondary partitioning [zero++] Synchronize at the end of secondary partitioning and simplify the logic Mar 1, 2024
@samadejacobs samadejacobs added this pull request to the merge queue Mar 1, 2024
Merged via the queue into microsoft:master with commit 4578c24 Mar 1, 2024
12 checks passed
ShellyNR pushed a commit to ShellyNR/DeepSpeed that referenced this pull request Mar 11, 2024
…y the logic (microsoft#5216)

## 1. Why?

We have a very long thread investigating [the
issue](microsoft#5059). To
summarize, this is because

a. The 2nd partitioning is asynchronous because it copies
device-to-device from full tensor to 2nd tensor
b. When using prefetching, the all-gather of 2nd tensor can happen
before 2nd partitioning ends. At that moment, the value of 2nd tensor
might contain bad values.


![image](https://github.com/microsoft/DeepSpeed/assets/24364830/ad6ee6a2-8e1e-4214-a0d2-ee5314b252b8)

Also, we found that the logic of copying is wrong and lengthy, so we
simplified it to only two lines.

Kudos to @yundai424, Haowen Ning, @samadejacobs for the investigation
effort.

## 2. What? 

After multiple careful tests, we found patching
`get_accelerator().synchronize()` to ensure all cuda stream finished
before 2nd partitioning can prevent the issue

## 3. Tests

I validated the correctness of the simplification of 2nd partition
logic. The loss is "exactly" the same before and after simplification
under the same random seed.

Before

```
[
  {"loss": 2.0731},
  {"loss": 2.0288},
  {"loss": 1.927},
  {"loss": 1.8347},
  {"loss": 1.8347},
  {"loss": 1.7896},
  {"loss": 1.602},
  {"loss": 1.766},
  {"loss": 1.8751},
  {"loss": 1.6776}
]

```

After

```
[
  {"loss": 2.0731},
  {"loss": 2.0288},
  {"loss": 1.927},
  {"loss": 1.8347},
  {"loss": 1.8347},
  {"loss": 1.7896},
  {"loss": 1.602},
  {"loss": 1.766},
  {"loss": 1.8751},
  {"loss": 1.6776}
]


```

## 4. TODO

We need further investigation on the issue @samadejacobs 
1) Revisit ZeRO-3 prefetch design 
2) Refactor hpz to reuse primary tensor for secondary partition.

---------

Signed-off-by: byhsu <[email protected]>
Co-authored-by: byhsu <[email protected]>
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
…y the logic (microsoft#5216)

## 1. Why?

We have a very long thread investigating [the
issue](microsoft#5059). To
summarize, this is because

a. The 2nd partitioning is asynchronous because it copies
device-to-device from full tensor to 2nd tensor
b. When using prefetching, the all-gather of 2nd tensor can happen
before 2nd partitioning ends. At that moment, the value of 2nd tensor
might contain bad values.


![image](https://github.com/microsoft/DeepSpeed/assets/24364830/ad6ee6a2-8e1e-4214-a0d2-ee5314b252b8)

Also, we found that the logic of copying is wrong and lengthy, so we
simplified it to only two lines.

Kudos to @yundai424, Haowen Ning, @samadejacobs for the investigation
effort.

## 2. What? 

After multiple careful tests, we found patching
`get_accelerator().synchronize()` to ensure all cuda stream finished
before 2nd partitioning can prevent the issue

## 3. Tests

I validated the correctness of the simplification of 2nd partition
logic. The loss is "exactly" the same before and after simplification
under the same random seed.

Before

```
[
  {"loss": 2.0731},
  {"loss": 2.0288},
  {"loss": 1.927},
  {"loss": 1.8347},
  {"loss": 1.8347},
  {"loss": 1.7896},
  {"loss": 1.602},
  {"loss": 1.766},
  {"loss": 1.8751},
  {"loss": 1.6776}
]

```

After

```
[
  {"loss": 2.0731},
  {"loss": 2.0288},
  {"loss": 1.927},
  {"loss": 1.8347},
  {"loss": 1.8347},
  {"loss": 1.7896},
  {"loss": 1.602},
  {"loss": 1.766},
  {"loss": 1.8751},
  {"loss": 1.6776}
]


```

## 4. TODO

We need further investigation on the issue @samadejacobs 
1) Revisit ZeRO-3 prefetch design 
2) Refactor hpz to reuse primary tensor for secondary partition.

---------

Signed-off-by: byhsu <[email protected]>
Co-authored-by: byhsu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants