gradient of `Flux.normalise` return NaN when `std` is zero #2096

chengchingwen · 2022-11-01T15:49:08Z

$\epsilon$ argument of Flux.normalise only prevent the forward value from division by zero, but there is also an division by $\sigma$ in pullback of std. We might need a custom rrule for Flux.normalise.

julia> Zygote.gradient(x->sum(sin.(Flux.normalise(x;dims=1))), ones(3,3))
([NaN NaN NaN; NaN NaN NaN; NaN NaN NaN],)

The text was updated successfully, but these errors were encountered:

mcabbott · 2022-11-01T19:16:49Z

Xref JuliaML/MLUtils.jl#123 (about moving & renaming) and #1992 (about NaN from batch of 1).

ToucheSir · 2022-11-01T20:47:27Z

Other frameworks have implemented this using sqrt + var + the eps instead of using std directly.

chengchingwen · 2022-11-02T05:56:26Z

and #1992 (about NaN from batch of 1).

Not sure if #1992 is related. Batchnorm don't use normalise and this issue is not caused by batch size.

Other frameworks have implemented this using sqrt + var + the eps instead of using std directly.

FWIW, PyTorch layernorm add the eps to var and then store the var with eps and use it directly in the pullback.
One quick and dirty solution can be inplace updating the std value with eps (with AD ignoring it).

This also brings up an issue about the error between real value and the value with eps. Since we are dividing by $(\sqrt{var} + \epsilon)$ and they are dividing by $(\sqrt{var + \epsilon})$, the resulting value could have $\epsilon$ times difference. I'm not sure which is better (IMPO it should be handled by a branch)

ToucheSir · 2022-11-02T19:09:14Z

I think we'd have to go with $\sqrt{var + \epsilon}$ because the rule for sqrt(x) divides by 2*sqrt(x).

chengchingwen · 2022-11-02T21:10:31Z

You can replace 2sqrt(x) with 2sqrt(x) + ϵ if you wrap everything in a single rrule

ToucheSir · 2022-11-03T00:53:20Z

For sure, but I'm loath to create a rrule just for this. I actually have a WIP PR bringing the norm functions to NNlib, so @chengchingwen if you want to continue this design discussion I can publish it.

chengchingwen · 2022-11-03T09:01:17Z

I would be interested. I actually have a function for computing the gradient of a layer norm directly in NAlib. This is the best (in terms of both performance and memory efficient) I can get without writing cuda kernel. The gradient of normalisecan be easily split out from it. So let's see if we can get even more performance from the new design.

There is a problem in normalise that if `std(x) \approx 0`, then the chain rule evaluates to NaN. See e.g. here: [FluxML/Flux.jl#2096]. We tried to fix this here by adding some noise to x, although that might not be the best solution. We also fix in a later commit that all images actually have some noise in the background.

ToucheSir · 2024-11-06T01:01:13Z

#2421 has been merged now.

ToucheSir mentioned this issue Apr 13, 2024

Epsilon change in normalise for stability #2421

Merged

ToucheSir closed this as completed Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradient of `Flux.normalise` return NaN when `std` is zero #2096

gradient of `Flux.normalise` return NaN when `std` is zero #2096

chengchingwen commented Nov 1, 2022

mcabbott commented Nov 1, 2022

ToucheSir commented Nov 1, 2022

chengchingwen commented Nov 2, 2022

ToucheSir commented Nov 2, 2022

chengchingwen commented Nov 2, 2022

ToucheSir commented Nov 3, 2022

chengchingwen commented Nov 3, 2022

ToucheSir commented Nov 6, 2024

gradient of Flux.normalise return NaN when std is zero #2096

gradient of Flux.normalise return NaN when std is zero #2096

Comments

chengchingwen commented Nov 1, 2022

mcabbott commented Nov 1, 2022

ToucheSir commented Nov 1, 2022

chengchingwen commented Nov 2, 2022

ToucheSir commented Nov 2, 2022

chengchingwen commented Nov 2, 2022

ToucheSir commented Nov 3, 2022

chengchingwen commented Nov 3, 2022

ToucheSir commented Nov 6, 2024

gradient of `Flux.normalise` return NaN when `std` is zero #2096

gradient of `Flux.normalise` return NaN when `std` is zero #2096