XLA_DISABLE_FUNCTIONALIZATION=0
with ZeRO-1 diverges for Mistral on NxD
#26
Labels
bug
Something isn't working
It seems that the loss is not converging or that we OOM depending on the
XLA_DISABLE_FUNCTIONALIZATION
flag and ZeRO-1.System info
I ran the same training job with 4 settings:
XLA_DISABLE_FUNCTIONALIZATION = 0 | 1
and ZeRO-1 enabled / disabled:XLA_DISABLE_FUNCTIONALIZATION=0
and ZeRO-1In this case the loss is diverging.
Note: Since I am using Optimum Neuron, I am not sure if this is my integration of the ZeroRedundancyOptimizer or if it is an actual bug on your end and / or
torch_xla
.XLA_DISABLE_FUNCTIONALIZATION=1
and ZeRO-1In this case the loss diverges to
inf
.XLA_DISABLE_FUNCTIONALIZATION=0
and regular optimizerIn this case we OOM.
XLA_DISABLE_FUNCTIONALIZATION=1
and regular optimizerThe loss converges.
The text was updated successfully, but these errors were encountered: