Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error regarding dimensions after sampling. #1050

Closed
tillahoffmann opened this issue Dec 12, 2024 · 16 comments
Closed

Error regarding dimensions after sampling. #1050

tillahoffmann opened this issue Dec 12, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@tillahoffmann
Copy link

Describe the bug

After the model completes sampling, the following error is raised.

Error in dim(x) <- c(dim(x), 1) : 
  dims [product 150] do not match the length of object [3]

To Reproduce

# Problematic Stan model.
parameters {
    vector [150] f;
}

model {
    f ~ normal(0, 1);
}
> # Calling R code.
> library(cmdstanr)
> library(gptoolsStan)
> 
> cmdstan_model(
+   stan_file = "debug.stan"
+ )$sample(
+   data = list(n = 100, sigma = 1, length_scale = 0.1, period = 1),
+   chains = 1,
+   iter_warmup = 500,
+   iter_sampling = 50
+ )
Compiling Stan program...

[C++ compiler output]

Running MCMC with 1 chain...

Chain 1 Iteration:   1 / 550 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 550 [ 18%]  (Warmup) 
Chain 1 Iteration: 200 / 550 [ 36%]  (Warmup) 
Chain 1 Iteration: 300 / 550 [ 54%]  (Warmup) 
Chain 1 Iteration: 400 / 550 [ 72%]  (Warmup) 
Chain 1 Iteration: 500 / 550 [ 90%]  (Warmup) 
Chain 1 Iteration: 501 / 550 [ 91%]  (Sampling) 
Chain 1 Iteration: 550 / 550 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
Error in dim(x) <- c(dim(x), 1) : 
  dims [product 150] do not match the length of object [3]

Expected behavior

The R code above returns a fit.

Operating system

$ uname -a
Linux 4dcbafc21a34 6.10.14-linuxkit #1 SMP Thu Oct 24 19:28:55 UTC 2024 aarch64 GNU/Linux
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ R --version
R version 4.2.2 Patched (2022-11-10 r83330) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: aarch64-unknown-linux-gnu (64-bit)
$ gcc --version
gcc (Debian 12.2.0-14) 12.2.0

I am running this code in a Docker container with the above operating system.

I do not get an error if I run the same code on the host machine with the following configuration.

$ uname -a
Darwin dhcp-10-250-31-164.harvard.edu 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:05:14 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T8103 arm64
$ gcc --version
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.1.0

The output on the host machine is as follows.

Compiling Stan program...
Running MCMC with 1 chain...

Chain 1 Iteration:   1 / 550 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 550 [ 18%]  (Warmup) 
Chain 1 Iteration: 200 / 550 [ 36%]  (Warmup) 
Chain 1 Iteration: 300 / 550 [ 54%]  (Warmup) 
Chain 1 Iteration: 400 / 550 [ 72%]  (Warmup) 
Chain 1 Iteration: 500 / 550 [ 90%]  (Warmup) 
Chain 1 Iteration: 501 / 550 [ 91%]  (Sampling) 
Chain 1 Iteration: 550 / 550 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
     lp__ -75.81 -75.93 6.44 4.80 -88.77 -65.62 1.01       26       33
     f[1]   0.02   0.00 1.04 1.04  -1.53   1.48 1.25       71       31
     f[2]  -0.02  -0.09 0.91 0.81  -1.47   1.63 1.00       84       46
     f[3]   0.02  -0.10 1.02 1.11  -1.49   1.65 1.08       84       41
     f[4]  -0.09  -0.27 0.99 0.88  -1.45   1.61 1.02       84       62
     f[5]   0.10   0.34 1.13 1.01  -1.78   1.79 1.00       84       41
     f[6]  -0.13  -0.25 1.02 1.20  -1.61   1.64 1.00       84       37
     f[7]   0.01  -0.02 1.19 1.39  -1.88   1.86 1.00       84       46
     f[8]  -0.10  -0.25 1.12 1.27  -1.74   1.54 1.02       56       33
     f[9]  -0.21  -0.21 1.01 0.80  -2.02   1.90 1.00       84       40

 # showing 10 of 151 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
[...]

CmdStanR version number (same in container and on host)

> packageVersion("cmdstanr")
[1] ‘0.8.1> cmdstan_version()
[1] "2.36.0"

Additional context

This problem arose in running the reproduction materials for our Gaussian process inference library after upgrading cmdstan and cmdstanr (cf. onnela-lab/gptools-reproduction-material#4).

I searched GitHub for the code in the error message and found this section. Maybe it's relevant.

https://github.com/stan-dev/posterior/blob/79d4521b943e44f4ac31636c4488d9e2cfeac3ec/R/as_draws_array.R#L227-L239

@tillahoffmann tillahoffmann added the bug Something isn't working label Dec 12, 2024
@jgabry
Copy link
Member

jgabry commented Dec 12, 2024

Wow, that's very strange! I'm not able to reproduce this on either of my computers (although they're both Macs, just running different OS versions), which is going to make this tricky to debug. Does this happen with all models or just specific ones?

I searched GitHub for the code in the error message and found this section. Maybe it's relevant.

There's a decent chance that line in the posterior package is where the error is coming from, but, if so, I'm not sure why.

Are you able to generate a traceback() so we can see more about where the error is happening?

@tillahoffmann
Copy link
Author

Thanks for the fast reply! Here's the output of traceback. I have to admit I don't quite know how to interpret it.

12: as_array_matrix_list(x)
11: fun(x, ...)
10: as_draws.default(x)
9: as_draws(x)
8: as_draws_array.default(list(structure(list(treedepth__ = c(4L, 
   4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 3L, 4L, 
   4L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 
   4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 3L, 4L, 4L, 
   4L), divergent__ = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
   0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
   0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
   0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), energy__ = c(160.001, 164.589, 
   149.329, 144.494, 140.444, 154.648, 147.947, 149.639, 141.597, 
   152.719, 143.787, 132.576, 148.491, 160.705, 165.513, 157.754, 
   146.246, 160.25, 160.585, 147.17, 146.789, 170.271, 161.011, 
   144.471, 151.956, 161.701, 165.189, 165.872, 154.899, 175.701, 
   156.912, 153.286, 132.044, 140.624, 136.904, 142.348, 130.973, 
   145.396, 143.553, 155.442, 150.118, 156.114, 156.571, 145.717, 
   146.662, 168.475, 166.723, 159.978, 158.845, 148.17)), row.names = c(NA, 
   -50L), class = "data.frame")))
7: (function (x, ...) 
   {
       UseMethod("as_draws_array")
   })(list(structure(list(treedepth__ = c(4L, 4L, 4L, 4L, 4L, 4L, 
   4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 3L, 4L, 
   4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
   4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 3L, 4L, 4L, 4L), divergent__ = c(0L, 
   0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
   0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
   0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
   0L), energy__ = c(160.001, 164.589, 149.329, 144.494, 140.444, 
   154.648, 147.947, 149.639, 141.597, 152.719, 143.787, 132.576, 
   148.491, 160.705, 165.513, 157.754, 146.246, 160.25, 160.585, 
   147.17, 146.789, 170.271, 161.011, 144.471, 151.956, 161.701, 
   165.189, 165.872, 154.899, 175.701, 156.912, 153.286, 132.044, 
   140.624, 136.904, 142.348, 130.973, 145.396, 143.553, 155.442, 
   150.118, 156.114, 156.571, 145.717, 146.662, 168.475, 166.723, 
   159.978, 158.845, 148.17)), row.names = c(NA, -50L), class = "data.frame")))
6: do.call(as_draws_format, list(post_warmup_sampler_diagnostics))
5: read_cmdstan_csv(files = self$output_files(include_failed = FALSE), 
       variables = variables, sampler_diagnostics = sampler_diagnostics, 
       format = format)
4: private$read_csv_(variables = "", sampler_diagnostics = convert_hmc_diagnostic_names(diagnostics))
3: initialize(...)
2: CmdStanMCMC$new(runset)
1: cmdstanr::cmdstan_model(stan_file = "debug.stan")$sample(data = list(n = 100, 
       sigma = 1, length_scale = 0.1, period = 1), chains = 1, iter_warmup = 500, 
       iter_sampling = 50)

@tillahoffmann
Copy link
Author

Was able to put together a reproducible example in Docker here: https://gist.github.com/tillahoffmann/ada92a970706c772c6ad2a477ec95fb2

@jgabry
Copy link
Member

jgabry commented Dec 12, 2024

Thanks, that's great. I'll build the image now, but unfortunately the rest of the day and the next week is insanely busy for me so I'm not sure when I'm going to have the time to dig into this if it doesn't turn out to be really simple . I'll try to make some time though!

I think the traceback confirms your suspicion that the error is happening in those lines from posterior, but I'm not sure why yet.

@jgabry
Copy link
Member

jgabry commented Dec 12, 2024

Does this happen with all models or just certain ones?

@tillahoffmann
Copy link
Author

tillahoffmann commented Dec 12, 2024

Sorry, forgot to address that earlier. I don't know if it happens for all models, but it seems to be an issue even for very simple ones like this one.

parameters {
    real f;
}

model {
    f ~ normal(0, 1);
}

The error messsage is as follows.

Running MCMC with 1 chain...

Chain 1 Iteration:   1 / 550 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 550 [ 18%]  (Warmup) 
Chain 1 Iteration: 200 / 550 [ 36%]  (Warmup) 
Chain 1 Iteration: 300 / 550 [ 54%]  (Warmup) 
Chain 1 Iteration: 400 / 550 [ 72%]  (Warmup) 
Chain 1 Iteration: 500 / 550 [ 90%]  (Warmup) 
Chain 1 Iteration: 501 / 550 [ 91%]  (Sampling) 
Chain 1 Iteration: 550 / 550 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
Error in dim(x) <- c(dim(x), 1) : 
  dims [product 150] do not match the length of object [3]

I've just played around with this a bit more, and it only seems to happen if a single chain is run. It works fine for multiple chains.

Edit: It looks like the number 150 is not related to the size of the vector but the number of samples. Specifically, the reported number is 3 * iter_sampling in some brief experiments.

tillahoffmann added a commit to onnela-lab/gptools-reproduction-material that referenced this issue Dec 12, 2024
@jgabry
Copy link
Member

jgabry commented Dec 13, 2024

Thanks for the extra details. This is indeed quite strange.

It looks like the number 150 is not related to the size of the vector but the number of samples. Specifically, the reported number is 3 * iter_sampling in some brief experiments.

The error seems to be happening when processing the data frame of sampler diagnostics, which your docker image helped me figure out. The data frame has iter_sampling rows and 3 columns (treedepth__, divergent__, energy__). If you set diagnostics = NULL when calling the sample method there's no error, but then if you call fit$sampler_diagnostics() after sampling you'll still get the error. But that's as far as I've gotten so far. What I really don't understand is why this would only be reproducible on Docker (as far as we know). Or have you been able to reproduce it outside of Docker?

@tillahoffmann
Copy link
Author

tillahoffmann commented Dec 13, 2024

It's weird with the Docker image. Maybe it's related to being on Linux rather than macOS? R being differently on Linux and macOS would be surprising, but I'm no R expert.

@jgabry
Copy link
Member

jgabry commented Dec 13, 2024

This keeps getting stranger. So the function in the posterior package that you found (https://github.com/stan-dev/posterior/blob/79d4521b943e44f4ac31636c4488d9e2cfeac3ec/R/as_draws_array.R#L227-L239) is indeed where the error is happening. On my computer the input to that function is a list of matrices (one per chain). In the Docker image the input to that function is a list of data frames (one per chain, if chains > 1) and a single data frame if chains = 1.

@jgabry
Copy link
Member

jgabry commented Dec 13, 2024

@tillahoffmann @paul-buerkner I think the problem is this commit to posterior: stan-dev/posterior@79d4521. I finally noticed that it was using the very latest development version from posterior. When I first tested that I couldn't reproduce the error deterministically because I hadn't refreshed my R session when installing different versions of posterior (I thought I had but I hadn't). When I did that I could reproduce the error.

@jgabry
Copy link
Member

jgabry commented Dec 13, 2024

Ok @tillahoffmann check out stan-dev/posterior#386. Can you replicate that?

This should also mean that if you use the posterior package that’s on CRAN, not the latest GitHub version, the error should go away (I hope).

@tillahoffmann
Copy link
Author

Thank you for the thorough investigation. Yes, I can replicate the behavior from stan-dev/posterior#386.

@jgabry
Copy link
Member

jgabry commented Dec 13, 2024

Ok great, thanks for checking and for reporting this.

@jgabry
Copy link
Member

jgabry commented Dec 17, 2024

We reverted the problematic commit in posterior, so I'm going to close this

@jgabry jgabry closed this as completed Dec 17, 2024
@jgabry
Copy link
Member

jgabry commented Dec 17, 2024

Thanks again for reporting this and creating the reproducible example in Docker, that was really helpful.

@tillahoffmann
Copy link
Author

Great, thank you for digging into it and fixing it so quickly! I probably shouldn't be running single chains anyway. 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants