-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ladder nets #75
base: master
Are you sure you want to change the base?
Ladder nets #75
Conversation
Cool! Had a quick look only, some minor comments:
|
Thanks! The idea behind the double batchnormalization comes from the noise injection in the dirty part of the encoder, i.e. first normalize the batch by its mean and std, then add noise and shift and scale the output by trainable parameters afterwards. My idea was not to use deterministic=True, because it would turn off the noise injection (as in dropout) and I still need to calculate the running stats for the clean encoder. So for the second (learnable) batchnormalization I hardcoded the mean (0) and inv_std (1) using constant variables and set alpha=1 to not update them with the previous stats values. It worked, so I didnt explore anything else. I might have misunderstood the behavior intended by alpha settings though, so it might not be even required since I am using the constant variables. I will check what it does and update the notebook later. |
You can pass Wait, reading again, do you mean to use the second batch normalization merely for scaling and shifting? It will still do batch normalization during training even if mean and inv_std are set to constant values (those are used in the |
So it turns out that something is going wrong after using scaling and bias layer instead of the second batchnormalization (which made everything smooth by normalizing the batch again). The reconstruction costs of the latent and classfication layers become huge, so I have to figure out why is that. My guess is that I am not using correctly the batch mean and inv_std from the first batchnormalization even after keeping only the mini-batch stats by setting alpha=1 instead of using running stats. I need to use the dirty encoder minibatch stats in the dirty decoder for normalizing the denoised output of the combinator layers, so there must big differences among those values which in turn give rise to huge reconstruction costs. Do you have any idea what can be the reason for the theano function output causing a floating point exception? |
Let me point out that I was just guessing on what you were trying to achieve, so please take my advice with a grain of salt. To get things straight, I can see that you've got three networks: The dirty encoder, the decoder, and the clean encoder. All of them share their weight matrices. A given minibatch Now where does batch normalization come into play? Which batch normalization parts do you want to share between networks, and between which ones exactly? Where does the noise injection take place?
I think ideally they're meant to be caught by Theano... did your process get terminated with SIGFPE? Some possible causes are listed in this slightly dubious ("there is no way to represent complex numbers in computers") source: https://www.quora.com/What-might-be-the-possible-causes-for-floating-point-exception-error-in-C++ |
I was trying to say that your point was right, as there should not be a second mini-batch normalization. The dirty and clean encoders share the weights and batchnormalization parameters beta and gamma, while the dirty decoder shares the batchnormalization means and standard deviations of the clean encoder
The normalization part (using the minibatch mean and std) of the batchnormalization follows right after the affine transformation (i.e. dense layer), then the noise is injected and afterwards, the scaling (using learnable beta and gamma) follows. The beta and gamma parameters are shared between dirty and clean encoders, while the means and std's of the clean encoder are shared by the dirty decoder. Now, sharing weights and beta/gamma parameters between encoders is straightforward, but the question is how to share those means and std's in a direction: clean encoder -> dirty decoder or if it's possible at all without using an additional layer (like custom normalization layer). I will try the additional layer approach to see if anything changes. |
We did our best!
There's no direct way to access the class UndoBatchNormLayer(lasagne.layers.MergeLayer):
def __init__(self, incoming, bn_layer, **kwargs):
super(UndoBatchNormLayer, self).__init__([incoming, bn_layer.input_layer]), **kwargs)
self.axes = bn_layer.axes
self.epsilon = bn_layer.epsilon
def get_output_shape_for(self, input_shapes):
return input_shapes[0]
def get_output_for(self, inputs, **kwargs):
input, bn_input = inputs
mean = bn_input.mean(self.axes)
var = bn_input.var(self.axes)
std = T.sqrt(var + self.epsilon)
return input * std + mean I'm assuming that you want the decoder to undo the transformations of the encoder here. |
Yes, that's exactly what I did yesterday and it worked! I will finish some extra adjustments and push it later today. Thanks for the feedback! It helped a lot. |
|
||
to_stats_l = clean_net[enc_bname] | ||
to_norm_l = dirty_net[comb_name] | ||
dirty_net[bname] = SharedNormLayer(to_stats_l, to_norm_l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this removes the mean and divides by the standard deviation that was also used in the encoding step -- does this make sense? Shouldn't the decoder be doing the reverse? You also have standard BatchNormLayers in the decoder, maybe you don't need the SharedNormLayers at all? (Disclaimer: I haven't looked at your code or the paper in detail, I'm just wondering. I'm happy to learn why it is implemented the way it is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does. The output of the denoising layer should be comparable with the output of the corresponding encoder layer in order to calculate the reconstruction cost. If you skim through the algorithm on page 5 in http://arxiv.org/pdf/1507.02672v2.pdf, you will find that the decoder first calculates the affine transformation with batchnormalization (i.e. my dense layer and bachnorm layer withou learning beta and gamma) then feeds the output to the denoising function and subsequently normalizes it with the stats from the clean encoder. I need those standard batchnorm layers to learn the beta and gamma parameters in the dirty encoder and shared them afterwards with clean encoder.
Just wanted to say this looks great, and thanks for contributing! :) |
Sorry for the delay, github doesn't notify about changes, only about comments. Is this ready to merge from your side, @AdrianLsk? |
Hi @f0k, not yet. Although this version is working, I still need to push my latest changes. I refactored the code and fixed some pooling-layer inconsistencies with the original ladder nets code. I will do it this weekend and let you know when it's ready. |
@f0k I think, it's ready for merge. |
My ladder net implementation. Comments are more than welcome.