Replies: 2 comments 6 replies
-
Hi @GorkemP, This does seem like an exploding gradient issue. We will investigate to see if there is anything we can put in for the short term to address this. Tagging @alexey-gruzdev and @psfoley Cheers, |
Beta Was this translation helpful? Give feedback.
-
I ran some experiments, I observed if there are 4 participating colabs in a federation round and the local_tensors values are0like : [array([118.], dtype=float32), array([118.], dtype=float32), array([118.], dtype=float32), array([118.], dtype=float32)] then weights returned are nan like : [nan, nan, nan, nan]. Can you verify? what can be a potential reason for getting local_tensors values as zero and how can we resolve this issue? @GorkemP @psfoley @sbakas @sarthakpati |
Beta Was this translation helpful? Give feedback.
-
Dear organizers,
We sometimes get the following error just after a new round starts:
It seems like an exploding gradient problem. In our aggregation method, we check returned aggregated values and there is no nan value but we know that it can be still possible during the forward pass. What we would like to ask you is that is there anything we can do in our side to prevent this? Normally, this issue can be prevented with gradient/value clipping, we applied those but did not work.
We always get this nan value error in the output tensor; therefore, we also suspect that there might be some function in the final output layer that may cause this behaviour.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions