Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/3299 diagnostics chainset #3310

Closed
wants to merge 32 commits into from

Conversation

mitzimorris
Copy link
Member

Submission Checklist

  • Run unit tests: ./runTests.py src/test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary

This PR uses the improved R-hat diagnostics (#3266) and extends rank-normalization to the computation of ESS-bulk and ESS-tail. These diagnostics are computed on samples from one or more chains, which entails parsing the sample out of a stan-csv file and assembling it into an Eigen MatrixXd object.

The work done for this PR is:

Intended Effect

The consumer of this functionality is CmdStan's stansummary utility, which, given a list of Stan CSV filenames, assembles them into a chainset object and then calls functions on the chainset object to get summary stats and diagnostics.

How to Verify

unit tests in this repo; unit tests in CmdStan repo test end-to-end parsing and reporting

Side Effects

N/A

Documentation

Documentation for CmdStan bin/stansummary to be done in docs repo, and CmdStanPy docs (CmdStanPy's summary function wraps CmdStan's stansummary).

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@mitzimorris
Copy link
Member Author

Created a new PR because accidentally included some version of Stan math_lib in #3305 and don't have the GitHub chops to undo its fubarness.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
arma/arma.stan 0.37 0.41 0.9 -10.81% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 1.08 7.16% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.15 12.77% faster
gp_regr/gp_regr.stan 0.13 0.1 1.32 24.28% faster
sir/sir.stan 71.91 70.15 1.03 2.44% faster
irt_2pl/irt_2pl.stan 4.38 4.07 1.08 7.14% faster
eight_schools/eight_schools.stan 0.06 0.05 1.17 14.52% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.28 0.24 1.16 13.79% faster
pkpd/one_comp_mm_elim_abs.stan 20.33 18.96 1.07 6.75% faster
garch/garch.stan 0.46 0.41 1.14 11.94% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.81 2.69 1.04 4.29% faster
arK/arK.stan 1.88 1.76 1.06 6.06% faster
gp_pois_regr/gp_pois_regr.stan 2.93 2.82 1.04 3.66% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 9.4 8.63 1.09 8.15% faster
performance.compilation 184.16 197.69 0.93 -7.34% slower
Mean result: 1.0835976351849117

Jenkins Console Log
Blue Ocean
Commit hash: 2c8bf6d3d4fe497e1d8ff310ce5e5f26b1925faa


Machine information No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 2400.000
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 1.3 MiB
L1i cache: 1.3 MiB
L2 cache: 40 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++:
g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang:
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

@mitzimorris
Copy link
Member Author

CmdStan tests failed - correctly!
offending test is here: https://github.com/stan-dev/cmdstan/blob/33e27a4f7b72c7df558d982a33bfe33ce0b14211/src/test/interface/variational_output_test.cpp#L53

ADVI output format is 1st line is the ADVI estimate, followed by 1000 samples.
summary of samples should only look at the sample, not the ADVI mean.

@WardBrian
Copy link
Member

Created a new PR because accidentally included some version of Stan math_lib in #3305 and don't have the GitHub chops to undo its fubarness.

Seems like a good move. Only downside is we lost the review history, which makes it difficult to track what has changed

@mitzimorris
Copy link
Member Author

agreed - will continue to add in changes suggested/requested in #3305

@WardBrian
Copy link
Member

Sounds good! Could you ping me when you think all of the 'old' review is resolved and I can take a look at the branch then?

@mitzimorris mitzimorris mentioned this pull request Sep 26, 2024
3 tasks
src/stan/io/stan_csv_reader.hpp Outdated Show resolved Hide resolved
src/stan/mcmc/chainset.hpp Outdated Show resolved Hide resolved
src/stan/mcmc/chainset.hpp Outdated Show resolved Hide resolved
src/stan/mcmc/chainset.hpp Outdated Show resolved Hide resolved
src/stan/mcmc/chainset.hpp Outdated Show resolved Hide resolved
src/stan/mcmc/chainset.hpp Outdated Show resolved Hide resolved
src/test/unit/io/bernoulli_corrupt.csv Outdated Show resolved Hide resolved
src/test/unit/io/test_csv_files/missing_draws.csv Outdated Show resolved Hide resolved
@WardBrian
Copy link
Member

I used the current develop branch

I’m not saying we’re obligated to report the same numbers from now until the end of days, but I just think if we do make a change we should have a reason why the new number is better/more correct, not just that we like the code that calculates it more

@mitzimorris
Copy link
Member Author

I have checked the existing unit tests and the behavior on real output files. I think this is OK.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Sep 30, 2024

I’m not saying we’re obligated to report the same numbers from now until the end of days, but I just think if we do make a change we should have a reason why the new number is better/more correct, not just that we like the code that calculates it more

The quantiles shouldn't change. [edit: this is wrong---see below] But the R-hat and ESS will. That's one of the motivations for the PR---to switch over to our improved versions of these as described in this paper:

Vehtari, Gelman, Simpson, Carpenter, Bürkner. 2019.
Rank-normalization, folding, and localization: An improved Rˆ for assessing convergence of MCMC
https://arxiv.org/abs/1903.08008

I believe @avehtari has already switched over RStan and PyStan and we're just waiting for this PR to switch over CmdStan and hence CmdStanPy and CmdStanR. I believe arViz also does these new calculations.

We could keep our old definitions and report two versions of R-hat (old, old per second, new) and three versions of ESS (old, new bulk, new tail), but I think it'd be confusing for users.

@bob-carpenter
Copy link
Contributor

As another comment, I don't think we should be reporting ESS for sampler parameters other than lp__.

@avehtari
Copy link
Collaborator

avehtari commented Oct 1, 2024

I believe @avehtari has already switched over RStan and PyStan and we're just waiting for this PR to switch over CmdStan and hence CmdStanPy and CmdStanR. I believe arViz also does these new calculations.

CmdStanR can call CmdStan diagnostics, but CmdStanR is also using posterior package, which has new Rhat and bulk- and tail-ESS, and it is more handy to use posterio package diagnostics than CmdStan diagnostics. ArviZ, which is used by Python and Julia users, has these new functions, too. RStan is not yet using posterior package.

We could keep our old definitions and report two versions of R-hat (old, old per second, new) and three versions of ESS (old, new bulk, new tail), but I think it'd be confusing for users.

The old ESS which (called ess_basic() in posterior package) is still needed for computing MCSE for mean estimate. I think the issue is that reporting everything by default would be too much, but CmdStan doesn't have a nice way for the users to define which diagnostic quantities they want to see, and thus we need to make some choices.

As another comment, I don't think we should be reporting ESS for sampler parameters other than lp__.

I agree. Same for Rhat.

@WardBrian
Copy link
Member

The quantiles shouldn't change.

My point is, they do. I believe the test cases we are using are not specific enough to see that, but doing a bit of testing on the side:

  • Running the bernoulli model
  • Running the current stansummary and this branch's version, with some extra decimals to be able to actually tell the difference
  • On basically every run I've observed so far, they disagree on the 50% quantile by the 4th decimal place

Here's a CSV file: https://gist.github.com/WardBrian/41bbf1b57829efa32e6aa58c5c493f57

Current stansummary (with --sig_figs=4)

                     Mean       MCSE     StdDev         5%        50%        95%  N_Eff    N_Eff/s      R_hat
theta           2.593e-01  7.266e-03  1.251e-01  8.379e-02  2.429e-01  4.830e-01  296.7      59330      1.003

new stansummary (with --sig_figs=4)

                  Mean       MCSE  StdDev      MAD       5%     50%     95%  N_Eff_bulk  N_Eff_tail   R_hat
theta           0.2593  7.263e-03  0.1251   0.1343  0.08379  0.2425  0.4830       296.9       446.6   1.008

0.2429 != 0.2425

CmdStanPy uses --sig_figs=6 by default, so end users will see different numbers.

Again, we can decide we want to make this change, but it just happening incidentally doesn't sit right with me, hence me not merging as-is.

@avehtari
Copy link
Collaborator

avehtari commented Oct 1, 2024

N_Eff_bulk N_Eff_tail

It would be better to use terms ESS_bulk and ESS_tail, as 1) N has different meaning in the paper describing these quantities, and 2) N_eff makes people to misread it as "Number of EFFective samples" which is wrong, and the correct term is "Effective Sample Size", 3) CmdStanR using posterior is showing ESS_bulk and ESS_tail.

@mitzimorris
Copy link
Member Author

The old ESS which (called ess_basic() in posterior package) is still needed for computing MCSE for mean estimate.

@avehtari - could you check the implementations of mcse and mcse_sd?

/**
* Computes the mean Monte Carlo error estimate for the central 90% interval.
* See https://arxiv.org/abs/1903.08008, section 4.4.
* Follows implementation in the R posterior package
*
* @param index parameter index
* @return pair (bulk_ess, tail_ess)
*/
double mcse_mean(const int index) const {
double ess_bulk = analyze::split_rank_normalized_ess(samples(index)).first;
return sd(index) / std::sqrt(ess_bulk);
}
/**
* Computes the mean Monte Carlo error estimate for the central 90% interval.
* See https://arxiv.org/abs/1903.08008, section 4.4.
* Follows implementation in the R posterior package.
*
* @param name parameter name
* @return pair (bulk_ess, tail_ess)
*/
double mcse_mean(const std::string& name) const {
return mcse_mean(index(name));
}
/**
* Computes the standard deviation of the Monte Carlo error estimate
* https://arxiv.org/abs/1903.08008, section 4.4.
* Follows implementation in the R posterior package.
*
* @param index parameter index
* @return pair (bulk_ess, tail_ess)
*/
double mcse_sd(const int index) const {
Eigen::MatrixXd s = samples(index);
Eigen::MatrixXd s2 = s.array().square();
double ess_s = analyze::split_rank_normalized_ess(s).first;
double ess_s2 = analyze::split_rank_normalized_ess(s2).first;
double ess_sd = std::min(ess_s, ess_s2);
return sd(index)
* std::sqrt(stan::math::e() * std::pow(1 - 1 / ess_sd, ess_sd - 1)
- 1);
}

@WardBrian
Copy link
Member

It would be better to use terms ESS_bulk and ESS_tail

Yes, there is an existing cmdstan issue for this stan-dev/cmdstan#916, which @mitzimorris will address as part of the cmdstan pr that accompanies this

@mitzimorris
Copy link
Member Author

@WardBrian regarding the implementation of quantiles, we need a better understanding how the boost accumulators library computes quantiles - it is returning the value of a the nth draw or interpolating? and which is appropriate?

I spent yesterday looking at the boost code and don't understand it well enough to make that call. changing to use C++ std::nth_element would make the code easier to understand and remove dependencies, but as you say, that's not a good reason to change behavoirs.

@WardBrian
Copy link
Member

I suspect the difference may just be as simple as the existing code prioritizes counting from the left if you're requesting the 50% or higher quantile, but I agree the boost reference is pretty opaque in this area

@WardBrian
Copy link
Member

There is also stan::math::quantile -- calling this would probably be the ideal thing to do from a code standpoint, but it yields yet a third set of numbers...

@@ -117,7 +117,6 @@ class chains {
using boost::accumulators::tag::tail;
using boost::accumulators::tag::tail_quantile;
double M = x.rows();
// size_t cache_size = std::min(prob, 1-prob)*M + 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to change anything in chains.hpp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this commented out line after going through the blame - this code has always been there in its commented out form. why is it here? maybe to show one alternative - which would use less memory, run faster, and provide approximate as opposed to exact estimates. I don't know. so I got rid of this comment, and a few other comments that were clearly outdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to change anything in chains.hpp?

the one thing that needed to be changed was to remove the rank-normalized r-hat functions, which weren't being called by anything, since this PR changed the function arguments.

@bob-carpenter
Copy link
Contributor

The quantiles shouldn't change.

I spoke too soon. The problem with quantiles is that reasonable definitions can differ. When you don't have an exact item, a lot of algorithms are going to interpolate, and how you do that can differ. For example if you only have two numbers [1, 3] and ask for a 50% quantile (i.e., a median), what is the answer? You could return 1 or 3, which is what you get if you turn 50% and the size of 2 into an index. Or you could interpolate between 1 and 3 and return 2. The problem's the exact same thing at an edge, but it might not be 50%. What do you do if I ask you for a 40% quantile of [1, 3]? There you might be tempted to linearly interpolate and say 1.8. That interpolation can often be made more stable if you use more elements around an element, so if we have four elements [2, 3, 5, 10], we might return a different answer than [3, 5] for the 50% quantile. It's a mess!

The bottom line, though, is that in these situations, we really don't care. The sampling algorithm didn't have the granularity to cleanly resolve the quantile so if we use one of the neightboring values or interpolate, it should be fine for our use.

There is also stan::math::quantile -- calling this would probably be the ideal thing to do from a code standpoint, but it yields yet a third set of numbers.

See above. That should also be fine. Using a quantile function that works on a sorted list will be faster if we do a lot of quantiles. Otherwise, using an accumulator boost style will be faster. With an accumulator, if you are looking for the rank K element in a list, there's an O(N * log2(K)) algorithm (N for considering each element, log2(K) to insert it into position in the accumulator) that only consumes log2(K) memory. It can get even more clever than that, see, e.g., the variants available here:

https://www.boost.org/doc/libs/1_86_0/doc/html/accumulators/user_s_guide.html#accumulators.user_s_guide.the_statistical_accumulators_library

@WardBrian
Copy link
Member

Calling stan::math::quantile seems like the best choice then. The code can also be made quite simple:

    Eigen::MatrixXd draws = samples(index);
    Eigen::Map<Eigen::VectorXd> map(draws.data(), draws.size());

    return stan::math::quantile(map, prob);

There is an overload for if prob is a single number or if it is a vector

@mitzimorris
Copy link
Member Author

plugged most recent version of this branch into CmdStan and comparing CmdStanR's summary (via fit$print) against CmdStan's bin/stansummary on the test CSV files in cmdstan/src/test/interface/example_outputs and median and quantile values, as well as mean, sd, mad, ess, and rhat all match out to 9 digits.

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I had a few small things I noticed going through but otherwise I think this is good

Comment on lines +6 to +7
#include <stan/math/prim/fun/constants.hpp>
#include <stan/math/prim/fun/quantile.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: These two includes shouldn't be necessary if we're also including math/prim above

*/
double variance(const int index) const {
Eigen::MatrixXd draws = samples(index);
return (draws.array() - draws.mean()).square().sum() / (draws.size() - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stan::math::variance

*/
double quantile(const int index, const double prob) const {
// Ensure the probability is within [0, 1]
if (prob <= 0.0 || prob >= 1.0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can allow the endpoints using math's quantile. It also throws its own error if necessary, so we could just skip this entirely now.

Comment on lines +337 to +339
if (probs.minCoeff() <= 0.0 || probs.maxCoeff() >= 1.0) {
throw std::out_of_range("Probabilities must be between 0 and 1.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as above

Comment on lines +423 to +424
double ess_bulk = analyze::split_rank_normalized_ess(samples(index)).first;
return sd(index) / std::sqrt(ess_bulk);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @avehtari from above about which ESS this should use:

The old ESS which (called ess_basic() in posterior package) is still needed for computing MCSE for mean estimate.

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcse_mean should use ess_mean, not ess_bulk
in the same way, mcse_sd should not use the rank normalized ess

* @param index parameter index
* @return pair (bulk_rhat, tail_rhat)
*/
std::pair<double, double> split_rank_normalized_rhat(const int index) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I still think it is worth adding functions for the "old" rhat and ess to this class, if they'd be this same size)

@mitzimorris
Copy link
Member Author

comments look great - thanks! per discussion with @SteveBronder, I think the way to proceed here is with 3 PRs:

  1. fix issues with stan_csv_reader
  2. refactor code added to stan::analyze::mcmc in Improved rhat diagnostic #3266 to add ESS_bulk, ESS_tail and MCSE calculations (to chains.hpp)
  3. add chainset.hpp to code base; a replacement for chains.hpp.

@WardBrian
Copy link
Member

I agree that would have been a better (or at least probably faster) way to go originally, but this being as far along as it is it would also be fine by me to proceed with this one big PR. Whichever you feel like, at this point

@mitzimorris
Copy link
Member Author

I think 3 PRs would be better - a lot of work, but the grown-up thing to do here. (especially since I've been advocating pragmatic programming lately).

@avehtari
Copy link
Collaborator

avehtari commented Oct 2, 2024

@avehtari - could you check the implementations of mcse and mcse_sd?

These should use ess_mean without rank normalization. The rank normalized versions are generic diagnostics that work also with infinite variance, but by definition of MCSE mcse_mean and mcse_sd should not use rank-normalization.

@mitzimorris
Copy link
Member Author

closing this per above comment - #3310 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants