The GradScaler scale the loss with zero! #1023

Closed Unanswered

gaopinghai asked this question in Q&A

gaopinghai
Mar 28, 2023

Hi everyone, I am using AmpOptimWrapper for auto-mixing precision training. But I get nan after about ten epochs of training.

I tried to debug and found that the GradScaler in AmpOptimWrapper scale the loss with 0, leading to nan weights update. What should I do?

The optimizer wrapper I use is:

optim_wrapper = dict(
    type="AmpOptimWrapper",
    optimizer=dict(type='Adam', lr=lr))

The zero scale is caculated by the subfunction in GradScaler as following:

    def _lazy_init_scale_growth_tracker(self, dev):
        assert self._growth_tracker is None, "_growth_tracker initialized before _scale"
        self._scale = torch.full((1,), self._init_scale, dtype=torch.float32, device=dev)
        self._growth_tracker = torch.full((1,), self._init_growth_tracker, dtype=torch.int32, device=dev)

where the self._init_scale is initialized to 2.**16 but is 0.0 in function _lazy_init_scale_growth_tracker.

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment