get_grad_norm_direct: fix a case of empty norm group (#5148)

fix for [#5145 ](#5145) empty norm group create a norm tensor with shape=[1], while other norms will be shapeless. torch.stack does not support such case. Fixing empty group norm to be shapless as well, instead of shape=[1]. --------- Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Lev Kurilenko <[email protected]>
microsoft · Feb 20, 2024 · f062a1b · f062a1b
1 parent fa3662f
commit f062a1b
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/deepspeed/runtime/zero/stage_1_and_2.py b/deepspeed/runtime/zero/stage_1_and_2.py
@@ -1686,7 +1686,7 @@ def get_grad_norm_direct(self, gradients, params, norm_type=2):
             if len(all_norms) > 0:
                 total_norm = torch.stack(all_norms).square().sum().float()
             else:
-                total_norm = torch.FloatTensor([0.0]).to(self.device)
+                total_norm = torch.tensor(0.0, dtype=torch.float32).to(self.device)
             # Sum across all model parallel Device.
             dist.all_reduce(total_norm, op=dist.ReduceOp.SUM, group=self.dp_process_group)