Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empirical variances don't match Table 1 #6

Open
gngdb opened this issue Oct 9, 2015 · 5 comments
Open

Empirical variances don't match Table 1 #6

gngdb opened this issue Oct 9, 2015 · 5 comments

Comments

@gngdb
Copy link
Owner

gngdb commented Oct 9, 2015

The results in table 1 are very bad, because it actually looks like the variance is higher for the local parameterization versus the single weight samples. Possible reasons for this:

  • Single and separate weight sample implementations scale the variance on the noise on each weight to match the effective variance on the locally reparameterized noise on each unit. Specifically, this line in the code. This might not be a valid thing to do, and would certainly cause problems if there is a mistake around there.
  • Errors in reproducing the local reparameterization: pre-linear Gaussian dropout from Srivastava #2 or other
  • Using the wrong version of the local reparameterization; we're using Variational Dropout A in this experiment, but it seems like it would be equally valid to use B. Although, that really ought not to make a difference (there is a proof in the appendix showing this).
@gngdb
Copy link
Owner Author

gngdb commented Oct 9, 2015

Oh yeah, and I could just be calculating the gradient variances wrong; am currently using:

T.var(T.grad(expressions.loss_train,l_hidden.W))

Should probably be:

T.mean(T.var(T.grad(expressions.loss_train, l_hidden.W), axis=0))

Repeating with this instead, will take a while to get the results.

@gngdb
Copy link
Owner Author

gngdb commented Oct 9, 2015

Unsurprisingly, initial results suggest that's not going to fix it.

@gngdb
Copy link
Owner Author

gngdb commented Oct 12, 2015

Comment from talk: we could be having problems with normalisation in this calculation; should normalise in some way.

@gngdb
Copy link
Owner Author

gngdb commented Oct 21, 2015

Looking at the paper's code Tim sent me, have found the following differences:

  • Their architecture uses 4 dense layers in total, with 1024 hidden units in each (apart from the output layer which has 10).
  • They use T.square where I used T.pow.
  • They store alpha in log space, not logit space, so alpha could exceed 1. To deal with this they explicitly set it to clip at 0.0 in log space.
  • Instead of scaling up the xentropy loss term they scale down the kl divergence term, which is probably better for ADAM defaults.
  • The independent case has per-theta alphas, while we only have per-unit in both cases.
  • They train only two networks: correlated and independent adaptive. Then, they use these networks to evaluate the variance, for example turning off dropout over the test set.
  • They take a running mean of the gradient wrt each parameter over minibatches, then take the mean of this at the end.

@gngdb
Copy link
Owner Author

gngdb commented Oct 24, 2015

Have implemented these changes in the code, and the results match a lot better. Unfortunately, there are still some problems; mostly that the variance is increasing after training for 100 epochs. Will have to look at the code again to figure out why this is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant