Empirical variances don't match Table 1 #6

gngdb · 2015-10-09T09:58:07Z

The results in table 1 are very bad, because it actually looks like the variance is higher for the local parameterization versus the single weight samples. Possible reasons for this:

Single and separate weight sample implementations scale the variance on the noise on each weight to match the effective variance on the locally reparameterized noise on each unit. Specifically, this line in the code. This might not be a valid thing to do, and would certainly cause problems if there is a mistake around there.
Errors in reproducing the local reparameterization: pre-linear Gaussian dropout from Srivastava #2 or other
Using the wrong version of the local reparameterization; we're using Variational Dropout A in this experiment, but it seems like it would be equally valid to use B. Although, that really ought not to make a difference (there is a proof in the appendix showing this).

gngdb · 2015-10-09T10:02:52Z

Oh yeah, and I could just be calculating the gradient variances wrong; am currently using:

T.var(T.grad(expressions.loss_train,l_hidden.W))

Should probably be:

T.mean(T.var(T.grad(expressions.loss_train, l_hidden.W), axis=0))

Repeating with this instead, will take a while to get the results.

gngdb · 2015-10-09T10:10:13Z

Unsurprisingly, initial results suggest that's not going to fix it.

gngdb · 2015-10-12T13:00:01Z

Comment from talk: we could be having problems with normalisation in this calculation; should normalise in some way.

gngdb · 2015-10-21T15:09:32Z

Looking at the paper's code Tim sent me, have found the following differences:

Their architecture uses 4 dense layers in total, with 1024 hidden units in each (apart from the output layer which has 10).
They use T.square where I used T.pow.
They store alpha in log space, not logit space, so alpha could exceed 1. To deal with this they explicitly set it to clip at 0.0 in log space.
Instead of scaling up the xentropy loss term they scale down the kl divergence term, which is probably better for ADAM defaults.
The independent case has per-theta alphas, while we only have per-unit in both cases.
They train only two networks: correlated and independent adaptive. Then, they use these networks to evaluate the variance, for example turning off dropout over the test set.
They take a running mean of the gradient wrt each parameter over minibatches, then take the mean of this at the end.

gngdb · 2015-10-24T10:29:01Z

Have implemented these changes in the code, and the results match a lot better. Unfortunately, there are still some problems; mostly that the variance is increasing after training for 100 epochs. Will have to look at the code again to figure out why this is the case.

gngdb added the replication label Oct 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empirical variances don't match Table 1 #6

Empirical variances don't match Table 1 #6

gngdb commented Oct 9, 2015

gngdb commented Oct 9, 2015

gngdb commented Oct 9, 2015

gngdb commented Oct 12, 2015

gngdb commented Oct 21, 2015

gngdb commented Oct 24, 2015

Empirical variances don't match Table 1 #6

Empirical variances don't match Table 1 #6

Comments

gngdb commented Oct 9, 2015

gngdb commented Oct 9, 2015

gngdb commented Oct 9, 2015

gngdb commented Oct 12, 2015

gngdb commented Oct 21, 2015

gngdb commented Oct 24, 2015