Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniform deepspeed overflow check #5424

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

GuanhuaWang
Copy link
Member

@GuanhuaWang GuanhuaWang commented Apr 16, 2024

Before: Overflow check is scattered and duplicated in all places.

This PR:

  • Single interface as CheckOverflow class, which abstract and uniform overflow check among ZeRO, ZeRO-Offload, Pipeline Parallelism, BF16_optimizer.
  • Skip step() operation if detect gradients overflow in BF6_optimizer. (avoid polluting checkpoint, etc)

cc @tjruwase

@Anhelor
Copy link

Anhelor commented Apr 20, 2024

Why not using tensor.isnan() and tensor.isinf()?

@@ -181,12 +181,13 @@ def get_norm_with_moe_layers_fast(all_groups_norm, group):
class CheckOverflow(object):
'''Checks for overflow in gradient across parallel process'''

def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=False, deepspeed=None):
def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=False, deepspeed=None, partition_grads=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing deepseed engine into a submodule is not a good design and create all sorts of cyclic reference issues. It is better to pass the specific attributes that are needed, such as enable_backward_allreduce

@@ -1473,6 +1473,7 @@ def _configure_bf16_optimizer(self, optimizer):
timers = self.timers if self.wall_clock_breakdown() else NoopTimer()
optimizer = BF16_Optimizer(optimizer,
self.param_names,
deepspeed=self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't pass self into submodules.

@@ -92,6 +94,10 @@ def __init__(self,
if self.using_real_optimizer:
self._setup_for_real_optimizer()

# Overflow check init
self.overflow = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should self.overflow be a class member since it seems to be only used once?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants