Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During-sampling diagnostics (feature request & design discussion) #425

Open
mike-lawrence opened this issue Jan 2, 2021 · 2 comments
Open
Labels
feature New feature or request
Milestone

Comments

@mike-lawrence
Copy link
Collaborator

mike-lawrence commented Jan 2, 2021

I propose to add optional computation of diagnostics during sampling.

To achieve this, I propose to read the csv files to:
(1) track the proportion of treedepth exceeded
(2) track whether any post-warmup divergences were encountered
(3) track the Bulk & Tail ESS of parameters (with option to specify which to include/exclude)
(4) track the Rhat of parameters (with option to specify which to include/exclude)

To enable efficient incremental parsing of the CSV files, I propose keeping track of how many lines have been read so far and skipping that many lines the next time a read is triggered, storing new samples together with prior samples in an object kept in memory.

To enable resuming this monitoring across R sessions, we could either start the csv parsing from scratch, or we could be writing the contents to a faster binary format (I'm thinking NetCDF) from the outset. This latter has the benefit of leaving the Stan output in a much better format than CSV. If we opted for this, I propose storing both the CSVs and NetCDF fils in a stan_scratch folder (n.b. said folder is involved in the proposed implementations of these FRs as well: Background/asynchronous sampling, Recompile only on changes to output of stanc3 auto-formatter )

@avehtari
Copy link
Contributor

  • This looks more like an cmdstan issue.
  • As ESS and Rhat are computed for scalar variables the computation cost during sampling can be significant for models with a lare number of parameters. The issue is complicated if ESS and Rhat would be computed also for generated quantities. One possibility would be by default only examine lp__ and by option give possibility to examine other quantities.
  • Bulk and Tail-ESS and rank-normalized Rhat don't have sequential computation rule which would mean increasing computation time with the number of iterations. For non-rank-normalized Rhat there would be possibility for sequential estimate.

@mike-lawrence
Copy link
Collaborator Author

Good call that the compute may be expected to get unweildy, so it should certainly be something that the user opts-in to rather than being on by default.

I think the question of whether this should be in cmdstanr versus cmdstan is an interesting orthogonal topic. My bias to have it in cmdstanr rather than cmdstan comes purely from the consideration that this is something I have the skill to implement in the former but not the latter.

@rok-cesnovar rok-cesnovar added this to the future milestone Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants