Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed variational output in 2.28.0 #1049

Closed
WardBrian opened this issue Oct 6, 2021 · 29 comments
Closed

Changed variational output in 2.28.0 #1049

WardBrian opened this issue Oct 6, 2021 · 29 comments

Comments

@WardBrian
Copy link
Member

WardBrian commented Oct 6, 2021

Summary:

A CmdStanPy unit test is now failing

Reproducible Steps:

transformed data {
    vector[2]   y[2];
    vector[2]   mu0;
    matrix[2,2] sigma;

    y[1,1]  = 40000.0;
    y[1,2]  = 30000.0;

    y[2,1]  = 30000.0;
    y[2,2]  = 20000.0;

    mu0[1]  = 0.0;
    mu0[2]  = 0.0;

    sigma[1,1] = 10000.0;
    sigma[1,2] = 00000.0;
    sigma[2,1] = 00000.0;
    sigma[2,2] = 10000.0;
}
parameters {
    vector[2] mu;
}
model {
    mu ~ multi_normal(mu0,sigma);

    for (n in 1:2)
        y[n] ~ multi_normal(mu,sigma);
}

The test is here:
https://github.com/stan-dev/cmdstanpy/blob/8596085d31be60cf6a24562b7db92bc1aab50eca/test/test_variational.py#L171
relevant info: algorithm='meanfield', seed=12345
The error:

self.assertAlmostEqual(
            variational.variational_params_np[3], 31.0418, places=2
        )
E       AssertionError: 309.196 != 31.0418 within 2 places (278.1542 difference)

@rok-cesnovar - do you know if this would be expected with this version change?

Current Version:

v2.28.0

@rok-cesnovar
Copy link
Member

With sampling I get the following (all default args, seed = 12345):

2.27.0:

                 Mean     MCSE   StdDev     5%    50%    95%    N_Eff  N_Eff/s    R_hat

lp__           -66668    0.046     0.92 -66670 -66667 -66667      404    26953     1.00
accept_stat__    0.90  3.8e-03  1.2e-01   0.63   0.94    1.0  1.0e+03  6.9e+04  1.0e+00
stepsize__       0.95      nan  3.8e-15   0.95   0.95   0.95      nan      nan      nan
treedepth__       1.8  1.5e-02  4.4e-01    1.0    2.0    2.0  8.1e+02  5.4e+04  1.0e+00
n_leapfrog__      3.1  4.0e-02  1.3e+00    1.0    3.0    7.0  9.9e+02  6.6e+04  1.0e+00
divergent__      0.00      nan  0.0e+00   0.00   0.00   0.00      nan      nan      nan
energy__        66669  6.7e-02  1.3e+00  66667  66668  66671  4.0e+02  2.7e+04  1.0e+00

mu[1]           23328      1.9       56  23236  23327  23418      909    60632     1.00
mu[2]           16664      1.6       55  16573  16664  16753     1178    78516     1.00

2.28.0:

                 Mean     MCSE   StdDev     5%    50%    95%    N_Eff  N_Eff/s    R_hat

lp__           -66668    0.050     0.97 -66669 -66667 -66667      378    18889      1.0
accept_stat__    0.92  2.6e-03  9.1e-02   0.73   0.96    1.0  1.3e+03  6.3e+04  1.0e+00
stepsize__       0.83      nan  4.8e-15   0.83   0.83   0.83      nan      nan      nan
treedepth__       2.0  1.9e-02  6.1e-01    1.0    2.0    3.0  1.0e+03  5.2e+04  1.0e+00
n_leapfrog__      4.0  6.1e-02  2.0e+00    1.0    3.0    7.0  1.0e+03  5.2e+04  1.0e+00
divergent__      0.00      nan  0.0e+00   0.00   0.00   0.00      nan      nan      nan
energy__        66669  7.5e-02  1.4e+00  66667  66668  66671  3.6e+02  1.8e+04  1.0e+00

mu[1]           23335      2.0       56  23241  23334  23428      745    37257     1.00
mu[2]           16665      2.0       57  16576  16666  16762      840    42023     1.00

This would suggest it looks good?

Diagnose looks like this (seed=12345, init=1):

2.27.0:

TEST GRADIENT MODE

 Log probability=-189994

 param idx           value           model     finite diff           error
         0        0.992205          6.9997          6.9997    -1.71696e-06
         1         -0.1117         5.00003         5.00002     9.98942e-06

2.28.0:

TEST GRADIENT MODE

 Log probability=-190004

 param idx           value           model     finite diff           error
         0       -0.955454         7.00029         7.00029     5.04136e-07
         1        0.528053         4.99984         4.99985    -7.31341e-06

Gradients seems to match, the values do not though, that seems suspect?

@rok-cesnovar
Copy link
Member

Ok, this seems to be a real issue. When this was first reported I was focused mostly on the gradient which looks fine but missed the value completely.

Will move this issue to Math once I have a reproducible example for it.

@rok-cesnovar
Copy link
Member

False alarm on this being a bug in Math. Sorry. This is an issue because the default id was changed to 1 in #987

It was previously 0.

Running variational or diagnose with specified seed=12345, init = 1 AND id = 0 I get the exact same results (the entire CSV is the same).

@rok-cesnovar
Copy link
Member

So the question now is: is this an issue at all? Can we change default argument values between versions (without a major bump)? Do we avoid changing the default and limit this change only for the case of running multi chain with a single executable?

Given that we are doing a 2.28.1 anyways this is the best time to discuss this.

@WardBrian
Copy link
Member Author

Why was it changed? If there wasn't a strong motivating reason I would consider this a bug that should be reverted

@wds15
Copy link
Contributor

wds15 commented Oct 6, 2021

No bug... fully intentional. Under multi-threaded, multi-chain mode in Stan, we wanted to have the first chain to be chain "1" instead of "0". This aligns it with things counting from 1 onwards in Stan world. You are still allowed to run explicitly with chain 0, but the default starts at 1 now. It's more confusing to users to see chain 0 to 3 when sampling 4 chains...that's what we (including @SteveBronder ) thought.

@SteveBronder
Copy link
Contributor

SteveBronder commented Oct 6, 2021

Dug into this and found the convo here between me and @wds15 when we added multi-chain. Essentially this was done for pretty printing purposes so that when we did multichain the iteration printing and csv files would start indexed from 1 i.e.

Chain [1]:
Chain [2]:
Chain [3]:

and output_1.csv, output_2.csv, output_3.csv

I think it matters what we consider reproducible. idt in the past we've guaranteed high precision reproducibility over versions. But usually the result was pretty close as long as you used the same seed. Now users would need to set the same seed and initial chain id (if they are comparing 2.28+ to 2.28-).

@WardBrian
Copy link
Member Author

I think it will be more confusing to have the outputs of something with a fixed seed change than it is to count from 0. If we want to keep the behavior we need to be much clearer about the change and the fact that it means all fixed-seed results can change without setting specifically an id.

@rok-cesnovar
Copy link
Member

Sorry for closing, fat fingers…

Another alternative is to have 1 be a default id whenever we set num_chains > 1. Since that is a new feature there is nothing to compare it with in previous version so there will be no confusion?

@wds15
Copy link
Contributor

wds15 commented Oct 6, 2021

I recall that we did change the default chain_id in the past already without anyone complaining. Adding more if's with a different default for a different case... not sure...

@WardBrian
Copy link
Member Author

I agree consistency is good. Changing it to 1 is fine, we just need to say something like this in release notes/all announcements:

The default chain id is now 1 instead of 0. This will change the output of any program that uses a set seed for the random number generator unless it is updated to set id=0

@rok-cesnovar
Copy link
Member

I am good with that.

@SteveBronder
Copy link
Contributor

I recall that we did change the default chain_id in the past already without anyone complaining. Adding more if's with a different default for a different case... not sure...

idt we did, at least looking at the history of the hpp file

https://github.com/stan-dev/cmdstan/commits/develop/src/cmdstan/arguments/arg_id.hpp

Though if I remember right chain id has changed in downstream packages.

I agree with Brian that changing it to 1 is fine as long as we doc it in the release. But let's wait to do the 2.28.1 release till after this Thursday's Stan meeting as I'd like to run this by people so it doesn't get missed and make sure we are not doing a big goof

@mitzimorris
Copy link
Member

mitzimorris commented Oct 6, 2021

cmdstanpy assembles calls to to optimize and variational but since it only runs 1 chain, doesn't specify chain id.
so you're saying that the reason the estimate changed by a factor of 10 is because it started from a different place in the PRNG?

ok, we really need pathfinder!

@SteveBronder
Copy link
Contributor

From Rok's comment (#1049 (comment)) that seems to be the case. Do other seeds give other odd answers?

@rok-cesnovar
Copy link
Member

so you're saying that the estimate changed by a factor of 10 makes sense because it started from a different place in the PRNG?

if I specify id=0 with 2.26.1,2.27.0 and 2.28.0 you get the exact same results (at least up to the 6th decimal point). Same with id=1 and 2.

the results change significantly for advi with a different seed yes. I would says this is because of the model?

@mitzimorris
Copy link
Member

not sure, but CmdStanPy is running the 'eta_should_be_big.stan' model same as used in the core Stan tests. how is it being used there? is this a perticularly unstable thing the model is trying to estimate?

@WardBrian
Copy link
Member Author

This has also made me realize that cmdstanpy doesn't even let you specify the chain id for variational/optimize because it has never mattered before

@mitzimorris
Copy link
Member

cmdstanpy def needs more logic to take advantage of 2.28. that's up next.

@wds15
Copy link
Contributor

wds15 commented Oct 6, 2021

Also: If people do not fix a chain id...how can they expect reproducibility? Some doc in the release notes would be goo, of course.

@WardBrian
Copy link
Member Author

I think it's reasonable to assume a value who's default had never changed before wasn't needed for reproducibility. If our own internal tests made such an assumption, some users likely will have as well

@mitzimorris
Copy link
Member

If people do not fix a chain id...how can they expect reproducibility?

for CmdStan, running a single chain with fixed seed and no id would be reproducible because offset would always be 0.

@rok-cesnovar
Copy link
Member

how is it being used there? is this a perticularly unstable thing the model is trying to estimate?

the Stan test only checks if eta is big (100 to be exact). It does not check parameter values. It does specify ids and seeds though: https://github.com/stan-dev/stan/blob/develop/src/test/unit/variational/eta_adapt_big_test.cpp

@WardBrian
Copy link
Member Author

offset

another good point, are we now skipping some huge portion at the start of the PRNG by default?

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Oct 6, 2021

@WardBrian
Copy link
Member Author

I assume (hope) that rng.discard is O(1)?

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Oct 6, 2021

Its O(log2(N))

We are using an rng that combines two multiplicative linear congruential rngs which can discard with O(log2(N)): https://github.com/boostorg/random/blob/a2740d4b30178cb187fabca163e5be7803a577b9/include/boost/random/linear_congruential.hpp#L193

@bob-carpenter
Copy link
Contributor

@rok-cesnovar: Thanks for the link. I had thought it was constant, but log2(2^50) is still only 50 simple arithmetic steps, so nothing we'd notice done once per chain.

@wds15 and @mitzimorris: We can't guarantee exact numerical reproducibility from release to release. Little things like improving numerics for a function will lead to differences.

@bob-carpenter
Copy link
Contributor

I verified that this model is very stable under MCMC (multiple runs with different seeds agree to 3+ decimal places) but completely unstable for ADVI (multiple runs can be off by more than a factor of 10).

It's known that the ADVI algorithm doesn't work well for non-unit-scale posteriors.

What was the intent of the test and can it be made with something more stable? This doesn't seem like a good model to use to evaluate ADVI.

P.S. I rewrote using declare/define and array/vector/matrix constructors:

transformed data {
  vector[2]   y[2] = { [40000, 30000]', [30000, 20000]' };
  matrix[2, 2] sigma = [ [10000, 0], [0, 10000] ];
  vector[2] mu0 = [0, 0]';
}
parameters {
  vector[2] mu;
}
model {
  mu ~ multi_normal(mu0, sigma);
  y ~ multi_normal(mu, sigma);
}

NUTS run 1

 variable   mean median sd mad     q5    q95 rhat ess_bulk ess_tail
    mu[1]  23333  23333 58  58  23238  23427    1     3352     2791

NUTS run 2

 variable   mean median sd mad     q5    q95 rhat ess_bulk ess_tail
    mu[1]  23333  23334 57  56  23237  23427    1     3765     2467

ADVI run 1

    variable    mean  median  sd mad      q5     q95
 mu[1]          7190    7192  26  25    7147    7232

ADVI run 2

    variable    mean  median   sd  mad      q5     q95
 mu[1]           685     685    1    1     683     687

For run 1, ADVI reported convergence, whereas for run 2 it reported there might be a divergence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants