You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It might be desirable to monitor/detect vanishing gradients during training. Note that I of course mean "stochastic gradient" here, as estimated by the training samples used in the current epoch (maybe the current batch size is too small to excite all king/piece positions, so preferably the mean or max abs over a window of multiple epochs).
This would have detected the anomalies in the input layer (dead weights for some king positions) in vondele's run84run3, see #53.
Note that with GC (gradient centralization), we cannot resort to investigating a mere difference of two checkpoints, as the centralized gradient by definition contains a contribution equal to the mean of all gradient vectors over all neurons of a layer (see equation (1) of https://arxiv.org/pdf/2004.01461v2).
As a "work-around", continued training without GC (use_gc=False in Ranger) on a checkpoint and then comparing/visualizing the difference between a later checkpoint should also do the trick I think.
It might be desirable to monitor/detect vanishing gradients during training. Note that I of course mean "stochastic gradient" here, as estimated by the training samples used in the current epoch (maybe the current batch size is too small to excite all king/piece positions, so preferably the mean or max abs over a window of multiple epochs).
This would have detected the anomalies in the input layer (dead weights for some king positions) in vondele's run84run3, see #53.
Note that with GC (gradient centralization), we cannot resort to investigating a mere difference of two checkpoints, as the centralized gradient by definition contains a contribution equal to the mean of all gradient vectors over all neurons of a layer (see equation (1) of https://arxiv.org/pdf/2004.01461v2).
As a "work-around", continued training without GC (
use_gc=False
in Ranger) on a checkpoint and then comparing/visualizing the difference between a later checkpoint should also do the trick I think.See also https://discuss.pytorch.org/t/how-to-check-for-vanishing-exploding-gradients/9019
The text was updated successfully, but these errors were encountered: