Task-1: ValueError: A model output tensor was found to have nan values. #113

GorkemP · 2021-07-21T15:05:09Z

GorkemP
Jul 21, 2021

Dear organizers,

We sometimes get the following error just after a new round starts:

It seems like an exploding gradient problem. In our aggregation method, we check returned aggregated values and there is no nan value but we know that it can be still possible during the forward pass. What we would like to ask you is that is there anything we can do in our side to prevent this? Normally, this issue can be prevented with gradient/value clipping, we applied those but did not work.

We always get this nan value error in the output tensor; therefore, we also suspect that there might be some function in the final output layer that may cause this behaviour.

Thank you.

sarthakpati · 2021-07-21T18:07:05Z

sarthakpati
Jul 21, 2021
Maintainer

Hi @GorkemP,

This does seem like an exploding gradient issue. We will investigate to see if there is anything we can put in for the short term to address this. Tagging @alexey-gruzdev and @psfoley

Cheers,
Sarthak

6 replies

sarthakpati Feb 7, 2022
Maintainer

Not yet.

GorkemP Feb 7, 2022
Author

Hi, we have solved this issue on our side. We were trying to implement SGD with momentum and during calculations, we were using all tensors (whatever the tensor_name is in the aggregator function). Later, we have realized that there are parameters related to batchnorm (running_mean, running_var , num_batches_tracked). We excluded them in the momentum part (we have only used tensors that ends with weight and bias in momentum calculations) and nan problem disappeared. Hope it helps.

sarthakpati Feb 7, 2022
Maintainer

Thanks @GorkemP!

dskhanirfan Feb 8, 2022

@GorkemP Thanks. Is your code available to review?

eceisik Feb 8, 2022

@dskhanirfan, you can check here

https://github.com/eceisik/FeTS_Challenge_METU_FL_Team/blob/db0db894b971dc4aba7f20fdcb3585e2f4e0ece1/FeTS_METU_FL_ALL.py#L531

dskhanirfan · 2022-02-07T17:00:05Z

dskhanirfan
Feb 7, 2022

I ran some experiments, I observed if there are 4 participating colabs in a federation round and the local_tensors values are0like : [array([118.], dtype=float32), array([118.], dtype=float32), array([118.], dtype=float32), array([118.], dtype=float32)] then weights returned are nan like : [nan, nan, nan, nan]. Can you verify? what can be a potential reason for getting local_tensors values as zero and how can we resolve this issue? @GorkemP @psfoley @sbakas @sarthakpati

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task-1: ValueError: A model output tensor was found to have nan values. #113

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Task-1: ValueError: A model output tensor was found to have nan values. #113

GorkemP Jul 21, 2021

Replies: 2 comments · 6 replies

sarthakpati Jul 21, 2021 Maintainer

sarthakpati Feb 7, 2022 Maintainer

GorkemP Feb 7, 2022 Author

sarthakpati Feb 7, 2022 Maintainer

dskhanirfan Feb 8, 2022

eceisik Feb 8, 2022

dskhanirfan Feb 7, 2022

GorkemP
Jul 21, 2021

Replies: 2 comments 6 replies

sarthakpati
Jul 21, 2021
Maintainer

sarthakpati Feb 7, 2022
Maintainer

GorkemP Feb 7, 2022
Author

sarthakpati Feb 7, 2022
Maintainer

dskhanirfan
Feb 7, 2022