Running multiple chains in one Stan program #987

SteveBronder · 2021-03-06T07:07:38Z

Submisison Checklist

Run tests: ./runCmdStanTests.py src/test
Declare copyright holder and open-source license: see below

Summary:

This is a prototype that can utilize a chains and n_chains argument to cmdstan so that a single stan program can run multiple chains in parallel like the below

one_comp sample data file=one_comp.data.R chains n_chains=8

The Good

It works and gives back correct results! Comparing the results from the multi chain program to single chain programs. This also allows the data in the model to be shared across chains so it should be nice on memory consumption

This also seems to be a bit faster which is nice (see dawid) test here

The Bad

I'm not totally sure how to implement this at the stan level. Should we have _par() methods for all the services?

With multiple chains do we also need to think about parallel PRNGs?

My main Q is whether there's enough clever stuff we can do once we have multiple chains in one Stan program to make up for any performance loss. A couple ideas are

@wds15 's parallel idea for tree building
@bbbales2 and @yizhang-yiz Cross chain warmup
The new variational inference algorithms Akash/Aki/JHuggins/Hyunji/Shin and crew are working on (I think this is also a multi-chain thing)
Maybe useful for the adaptation algorithms Lu is working on?
Allow OpenCL to share data on the GPU across multiple chains (and generally reduced memory consumption)

The Ugly

This is a rough prototype. While the results are correct there's probably some bads happening with the rng. I also wonder about writing to disk across multiple threads. It feels like that could thrash the I/O

How to Verify:

You can run the examples I've put in the examples folder like

# make sure STAN_THREADS=true is defined in make/local
# and 
# TBB_LIBRARIES=tbb tbbmalloc tbbmalloc_proxy
# for tbb_malloc 
export STAN_NUM_THREADS=8

make examples/dawid/dawid
examples/dawid/dawid sample num_samples=150 num_warmup=150 data file=examples/dawid/caries_dump.R chains n_chains=8

make examples/one_comp/one_comp sample
examples/one_comp/one_comp sample num_samples=150 num_warmup=150 data file=examples/one_comp/one_comp.data.R chains n_chains=8

Or use perf to check it out like

export STAN_NUM_THREADS=8
# If you get a warning about lost chunks turn down freq
perf record -s --per-thread --freq=2800 -g -o test_perf.data examples/dawid/dawid sample num_samples=150 num_warmup=150 data file=examples/dawid/caries_dump.R chains n_chains=8
# -g prints caller graph, -T by thread
perf report -g -T -i ./test_perf.data

perf record -s --per-thread --freq=2800 -g -o test_perf2.data examples/one_comp/one_comp sample num_samples=150 num_warmup=150 data file=examples/one_comp/one_comp.data.R chains n_chains=8

perf report -g -T -i ./test_perf2.data

dawid seems to be making a ton of calls to malloc ints. I still need to dive deeper into the perf to figure out where / why that's happening.

one_comp is spending most of it's time in boost::math::lanczos::lanczos_sum which is called from lgamma

Documentation:

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Steve Bronder

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

SteveBronder · 2021-03-06T07:09:21Z

Also running heap analysis now with

STAN_NUM_THREADS=8 heaptrack examples/dawid/dawid sample num_samples=150 num_warmup=150 data file=examples/dawid/caries_dump.R chains n_chains=8

and will report back once I have results from both non-parallel and parallel

…4.1 (tags/RELEASE_600/final)

yizhang-yiz · 2021-03-06T07:36:42Z

My main Q is whether there's enough clever stuff we can do once we have multiple chains in one Stan program to make up for any performance loss.

Working on cross-chain warmup tells me that as soon as chains start talking to each other there's lot of information to utilize. Another thing that I came to believe is that massive parallel chains, in the sense that multiple(many more than 4 as in standard practice) communicating chains + within-chain parallelization, plays a key role to bring large-scale problems I'm interested in to be reasonably within reach.

SteveBronder · 2021-03-06T08:05:59Z

Yeah that's what I was kind of thinking with the part here where we can pretty easily take N warmup samples, do stuff, then run the chains again in an outer for loop around the parallel_for(). Once we think we've hit some level of convergence we could even spawn a bunch of new chains based on those points.

There's def a lot of neat things possible with this pattern, but on it's own it's a pretty heavy cost. If we can justify it with some fancy new parallel chain algorithm then I'm 100% for making this look pretty and staring in the deep abyss of perf records to figure out where the bottlenecks are

yizhang-yiz · 2021-03-06T08:18:31Z

There's def a lot of neat things possible with this pattern, but on it's own it's a pretty heavy cost.

Agreed. Other than giving us chance to explore, which is important in its own regard, from performance perspective maybe we need think what's the best bang for the buck for a limited # of cores when faced with parallel around chains vs parallel within chains: if a 4-core within-chain parallel run gives speedup 2.0, using that 4 cores for 4 parallel chains better do some real magic on the warmup/sampling.

SteveBronder · 2021-03-06T21:40:07Z

Yes 100%, would you and/or @bbbales2 have time to use this as a base for the auto warmup? In my brain I think that's the minimum for this to be reasonable. Though we could also take it a step further and have the user call something like

mymodel sample num_warmup=auto num_samples=500 data file='blah.json' chains n_chains=8 max_chains=16

Where after the auto warmup we spin off (max_chains - n_chains) additional chains. I think if we can spin those up nicely + doing auto we can get the ESS up to a point that this becomes a very good idea. Going farther we could probably even have something like

mymodel sample num_warmup=auto num_samples_ess=300 data file='blah.json' chains n_chains=8 max_chains=16

Where we run sampling until the minimum ESS for a parameter is more than num_samples_ess

SteveBronder · 2021-03-06T21:41:02Z

Also tagging @LuZhangstat as this may be of interest to her for the adaptation stuff

bbbales2 · 2021-03-06T22:25:49Z

have time to use this as a base for the auto warmup?

Short term no, long term yes. Once we're closer to the other side of some subset of ocl/varmat/tuples/closures/adjoint/laplace. (Edit: and I guess the go time is a function of which of the things on wishlist above are ready to go -- there's a lot of leverage here to motivate this change I think)

It's definitely very interesting and an encouragingly small amount of code. (Edit: And feeds into a lot of different things)

Where we run sampling until the minimum ESS

That's another idea that to the list.

yizhang-yiz · 2021-03-08T16:46:58Z

Yes 100%, would you and/or @bbbales2 have time to use this as a base for the auto warmup?

@SteveBronder I'm booked until May, after that would love to dig into this, hopefully sooner.

LuZhangstat · 2021-03-08T18:41:25Z

Maybe useful for the adaptation algorithms Lu is working on?

Looks great! Thanks for sharing. I think the implementation of Pathfinder needs GPU parallel computing and clever memory management. Implementing Pathfinder in low-level parallel programming now has low priority in my to-do list but I plan to work on it in the near future. By the way, do you happen to know any good tutorial for GPU programming? Many thanks!

SteveBronder · 2021-03-08T19:45:40Z

Looks great! Thanks for sharing. I think the implementation of Pathfinder needs GPU parallel computing and clever memory management. Implementing Pathfinder in low-level parallel programming now has low priority in my to-do list but I plan to work on it in the near future. By the way, do you happen to know any good tutorial for GPU programming? Many thanks!

Word! If you want GPU stuff I think you can use Stan math's OpenCL backend which has a pretty high level API to do most stuff.

What kinds of operations are you looking to do in particular? If your matrix sizes are not very large it may be better to do something with TBB so you don't pay the transfer cost of moving from the host->GPU

http://mc-stan.org/math/d5/de5/group__opencl.html

@rok-cesnovar we should go through and tag all the GPU functions so they show up there. We should probably do the same thing for the new matrix type then we'd just have an automatic list

For books OpenCL Programming By Example is pretty good. The OpenCL Guide is pretty tried and true. I'll try to look around and find what book I used for my GPU course.

LuZhangstat · 2021-03-08T20:05:20Z

@SteveBronder Thank you so much for sharing the GPU stuff! I will read them later.

What kinds of operations are you looking to do in particular? If your matrix sizes are not very large it may be better to do something with TBB so you don't pay the transfer cost of moving from the host->GPU

I will need matrix decomposition like thin QR decomposition and Cholesky decomposition. The matrix size depends on the number of parameters so it could be large.

SteveBronder · 2021-03-08T21:03:27Z

icic We have the cholesky implemented for gpus in the math library but no thin QR

LuZhangstat · 2021-03-08T21:20:05Z

We have the cholesky implemented for gpus in the math library but no thin QR

I see, thanks a lot!

SteveBronder · 2021-03-09T23:00:41Z

So, after fixing #842 in stanc3 I went back and tried these out again

# This PR
time -p examples/dawid/dawid sample data file=examples/dawid/caries_dump.R chains n_chains=8
# real 200.60
# user 1452.17
# sys 2.09

# cmdstan develop with gnu parallel
time -p seq -w 0 8 | parallel "examples/dawid/dawid sample data file=examples/dawid/caries_dump.R output file=outputs{}.csv"
# real 246.03
# user 1824.17
# sys 1.52

LoL, so that's cool. This went from being "good idea with buts" to just a good idea! I guess it makes since that current_statement__ constantly changing in the global scope was causing false sharing to happen. This will probably go even faster if I can do #736 in stanc3 since that will heavily cut down the programs overall memory footprint.

Anyway, this seems nice but I have no idea how to proceed. My main Qs are around how this should be impl'd in the stan repo. Like should there be a _par() equivalent for each of the functions in service? I'm not against writing a design doc for this but would like to have some sort of discussion beforehand.

tagging parties that might be interested @yizhang-yiz @bbbales2 @wds15

bbbales2 · 2021-03-10T02:17:02Z

current_statement__ constantly changing in the global scope

Wowowow that is a big boost. So now the multithreading is just flat faster than the multiple chains (by a big margin)?

Anyway, this seems nice but I have no idea how to proceed.

If multithreading is just faster then there is no need to tap into the algorithms above. Otherwise, for justification cross-chain warmup is probably the most ready of those.

As far as coding challenges, I think you're write it'd be a matter of working out how to support the interfaces. Seems like figuring out what potential cmdstan/cmdstanr/cmdstanpy support looks like would be the easiest and then we could work back what would need changed in services from there.

If we went cross-chain warmup route, then we'd need a way to compute cross-chain Rhat and ESS (this code would get used by the variational stuff too).

Yi said he'd be busy until May but it sounds like he'd like to work on it. I like the team projects, so seems fun to me and I'm happy to wait. I think it would be easier to develop on this once varmat is in the bag too since this would be a big thing.

wds15 · 2021-03-20T13:41:23Z

This is really awesome work! Given that this shows already benefits - maybe there is a way to get this in quickly in a first version? For example adding an argument int n_chains = 1 to the current functions so that the default argument will make things just work fine?

Now, there is one thing which I am worried about and that is the nested parallelism pattern which we will end up having when using this with programs based on reduce_sum or map_rect. The issue is that the TBB runs things unsequenced. Have a look here:

https://www.threadingbuildingblocks.org/docs/help/tbb_userguide/work_isolation.html

I do think that we are affected by this and need to be careful as we do use a thread-local ressource (the AD stack). From my understanding we need to ensure that anything within stan-math which fires off sub processes (like a reduce_sum call) must use work isolation. My suggestion is to wrap reduce calls (and anything else in stan-math which runs parallel things) into an isolated task group as demonstrated in the link above.

Makes sense?

EDIT: An additional point for headaches is MPI... maybe that should just be not allowed with multiple chains. The current MPI design would throw up on this (it would continue working, but only a single chain would ever get to use the workers at any given time).

SteveBronder · 2021-03-20T19:34:03Z

However, in some cases such unsequenced execution may result in errors. For example, a thread-local variable might unexpectedly change its value after a nested parallel construct:

Ooof yeah if one thread can start writing to another threads stack allocator that could be very bad. But I'm also confused about how that work sharing interacts with thread local storage objects. Will the thread stealing work use the original threads thread local object or will it just not care and use it's own? If the latter then yeah we def need to program against this

Can we put the tbb::enumerable_thread_specific<int> ets; inside of the ChainableStack or where would that go? Looking at this it feels like we could just do tbb::this_task_arena::isolate inside of reduce_sum when we call the parallel reducer.

EDIT: An additional point for headaches is MPI... maybe that should just be not allowed with multiple chains. The current MPI design would throw up on this (it would continue working, but only a single chain would ever get to use the workers at any given time).

I've never looked at the MPI stuff but I'll take your word for it. Though how do we program against that so users don't goof? Would it literally just crash or would it just work poorly? I wonder how we can check before we run the chains that the user is not utilizing MPI.

wds15 · 2021-03-20T20:35:05Z

The TBB handles tasks and abstracts away threads entirely. The only guarantee you have about threads is that they are assigned to task_arena's. No thread can be part of two task_arena's at the same time, but threads can be moved between arenas. One solution is to run different chains in different arenas... but that will not take advantage of resource re-allocation as smoothly, I think (so early finishing chains giving their resources to still running ones).

So this is why we need to isolate any nested parallelism as outlined in the link above. This will give us the right gurantees for the thread local resources. The right action is to do so in reduce_sum and map_rect. At least this is how I understand it.

…4.1 (tags/RELEASE_600/final)

SteveBronder · 2021-03-31T05:46:41Z

@bbbales2 latest changes should have everything in here. I think since I sorted all this out I'm going to add the dense and other samplers to the Stan PR

stan-buildbot · 2021-08-12T02:04:00Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.0	3.64	0.83	-21.06% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.15	13.35% faster
eight_schools/eight_schools.stan	0.1	0.09	1.12	11.04% faster
gp_regr/gp_regr.stan	0.16	0.14	1.11	9.68% faster
irt_2pl/irt_2pl.stan	5.85	5.15	1.13	11.86% faster
performance.compilation	89.57	89.52	1.0	0.05% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.59	8.24	1.04	4.01% faster
pkpd/one_comp_mm_elim_abs.stan	30.59	29.97	1.02	2.01% faster
sir/sir.stan	131.63	127.94	1.03	2.8% faster
gp_regr/gen_gp_data.stan	0.03	0.03	1.08	7.19% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.99	3.63	0.82	-21.64% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.38	0.42	0.91	-9.6% slower
arK/arK.stan	1.9	2.04	0.93	-7.02% slower
arma/arma.stan	0.83	0.23	3.55	71.84% faster
garch/garch.stan	0.53	0.6	0.88	-14.02% slower
Mean result: 1.17415320059

Jenkins Console Log
Blue Ocean
Commit hash: c03addb

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder · 2021-08-12T05:10:15Z

@wds15 we should be good to go! Two little things here

I had to modify the make files because when I turned threading on cmdstan was having a hard time compiling stansummary diagnostics etc.
I added a little test that just runs a simple test model with 2 chains. I think setting this up for doing threading would be a bit much for once test and we do that testing at the stan repo level so I think having num_chains go off is reasonable.

I think this is ready for review!

…4.1 (tags/RELEASE_600/final)

stan-buildbot · 2021-08-12T10:30:39Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.08	3.48	0.88	-13.27% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.13	11.49% faster
eight_schools/eight_schools.stan	0.11	0.09	1.15	13.08% faster
gp_regr/gp_regr.stan	0.16	0.14	1.13	11.86% faster
irt_2pl/irt_2pl.stan	5.82	5.28	1.1	9.27% faster
performance.compilation	89.65	89.43	1.0	0.24% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.66	8.16	1.06	5.8% faster
pkpd/one_comp_mm_elim_abs.stan	29.55	29.6	1.0	-0.16% slower
sir/sir.stan	129.65	125.17	1.04	3.46% faster
gp_regr/gen_gp_data.stan	0.03	0.03	1.06	5.97% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.0	3.01	0.99	-0.65% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.38	1.04	3.67% faster
arK/arK.stan	1.88	2.05	0.92	-8.8% slower
arma/arma.stan	0.84	0.23	3.58	72.06% faster
garch/garch.stan	0.53	0.6	0.89	-12.37% slower
Mean result: 1.1987580311

Jenkins Console Log
Blue Ocean
Commit hash: b4b9953

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

wds15

If I run

./bernoulli-2 sample num_chains=4 id=100 output refresh=1000 data file=bernoulli.data.R num_threads=4

I'd expect to get output_100.csv, output_101.csv and so on...but I do get output_1.csv, output_2.csv and so on... I think we want the chain id to be the number, not 1...num_chains.

wds15 · 2021-08-15T18:32:04Z

src/test/interface/multi_chain_test.cpp

+        + " sample num_warmup=200 num_samples=1 num_chains=2 random seed=1234"
+        + " output file=" + cmdstan::test::convert_model_path(model_path)
+        + ".csv";
+


This test implicitly tests that the default init chain id is 1 (as we test that the filename ends with _1.csv)...do we want that? Or is it better to specify chain_id=1 as an argument to cmdstan? I don't know the answer, but some thought and doc would be nice here.

Yes this was wrong before good catch! I set this to id=10 in the test so that in the test we then look for files ending in 10.csv and 11.csv. I think I should add docs to the output file argument so that folks know that for id=N and num_chains=j the output will be output_{N + 1}.csv, output_{N + 2}.csv, ..., output_{N + j}.csv

wds15 · 2021-08-15T18:42:48Z

src/cmdstan/command.hpp

+          model, *(init_contexts[0]), random_seed, id, init_radius,
+          grad_samples, elbo_samples, max_iterations, tol_rel_obj, eta,
+          adapt_engaged, adapt_iterations, eval_elbo, output_samples, interrupt,
+          logger, init_writers[0], sample_writers[0], diagnostic_writers[0]);
    }
  }
  stan::math::profile_map &profile_data = get_stan_profile_data();


The interaction of the multi chain stuff is probably not great with the profiling, right? At the moment you get a single profile.csv where multiple thread id's are listed.... or is this exactly what one wants? Maybe @rok-cesnovar has some thoughts? I'd guess we can go with this as it is presumably.

I think a single profile CSV file is what I would want to see. Its less files to deal with and all the info is in one place.

As I said in previous discussions, profiling is something one would typically run with a single chain, as there is no added information gained when running multiple chains (at least that I can think of). If anything, it just adds more noise to the per-gradient numbers.

So I would keep this simple.

stan-buildbot · 2021-08-16T21:07:02Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	2.89	3.51	0.82	-21.58% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.11	10.04% faster
eight_schools/eight_schools.stan	0.11	0.09	1.14	12.4% faster
gp_regr/gp_regr.stan	0.16	0.14	1.13	11.81% faster
irt_2pl/irt_2pl.stan	5.89	5.13	1.15	12.82% faster
performance.compilation	89.08	88.79	1.0	0.32% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.71	8.22	1.06	5.65% faster
pkpd/one_comp_mm_elim_abs.stan	29.18	32.15	0.91	-10.16% slower
sir/sir.stan	125.22	121.41	1.03	3.04% faster
gp_regr/gen_gp_data.stan	0.04	0.03	1.09	8.25% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.03	3.09	0.98	-1.97% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.4	0.38	1.07	6.93% faster
arK/arK.stan	2.51	2.03	1.24	19.05% faster
arma/arma.stan	0.76	0.25	3.06	67.29% faster
garch/garch.stan	0.66	0.69	0.97	-3.47% slower
Mean result: 1.18420764211

Jenkins Console Log
Blue Ocean
Commit hash: e80e5eb

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder · 2021-08-17T16:15:11Z

This is failing because of a recent change in stanc3. I opened up a revert for that so once that goes in we can rerun the tests and all should be good

stan-dev/stanc3#945

stan-buildbot · 2021-08-18T14:28:59Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	2.84	3.44	0.82	-21.28% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.11	9.82% faster
eight_schools/eight_schools.stan	0.1	0.09	1.14	12.15% faster
gp_regr/gp_regr.stan	0.15	0.14	1.11	9.52% faster
irt_2pl/irt_2pl.stan	5.9	5.13	1.15	13.05% faster
performance.compilation	88.75	88.6	1.0	0.16% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.57	8.24	1.04	3.83% faster
pkpd/one_comp_mm_elim_abs.stan	31.14	32.34	0.96	-3.85% slower
sir/sir.stan	142.09	129.84	1.09	8.62% faster
gp_regr/gen_gp_data.stan	0.03	0.03	1.06	5.5% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.0	3.05	0.98	-1.55% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.37	1.05	5.08% faster
arK/arK.stan	2.52	2.02	1.25	19.85% faster
arma/arma.stan	0.76	0.25	3.08	67.5% faster
garch/garch.stan	0.67	0.68	0.98	-2.22% slower
Mean result: 1.18834145433

Jenkins Console Log
Blue Ocean
Commit hash: e80e5eb

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder · 2021-08-18T17:24:08Z

@wds15 ready for another look!

wds15 · 2021-08-23T18:57:44Z

This error msg is a bit confusing (not sure if it can be improved easily):

[20:55:19][sebi@sebastians-macbook-pro:~/work/cmdstan/examples/bernoulli]$ ./bernoulli sample num_chains=4 id=10 data file=bernoulli.data.R output file=samples.csv num_threads=0
0 is not a valid value for "num_threads"
  Valid values: All
Failed to parse command arguments, cannot run model.

Valid values: All is obviously not true...

wds15 · 2021-08-23T19:04:39Z

And I think here is a bug:

./bernoulli sample num_chains=2 id=200 data file=bernoulli.data.R output file=samples.csv diagnostic_file=diag.csv num_threads=1

gives these files:

-rw-r--r--   1 sebi  staff    67483 23 Aug 21:03 diag_201les.csv
-rw-r--r--   1 sebi  staff    67531 23 Aug 21:03 diag_200les.csv
-rw-r--r--   1 sebi  staff    49336 23 Aug 21:03 samples_201.csv
-rw-r--r--   1 sebi  staff    49435 23 Aug 21:03 samples_200.csv

The "les" included in the diagnostic filename seems wrong?

EDIT: We want a test for this being correct, I guess?

…lti_chain test

…4.1 (tags/RELEASE_600/final)

stan-buildbot · 2021-08-24T07:36:54Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	2.92	3.44	0.85	-17.6% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.05	5.08% faster
eight_schools/eight_schools.stan	0.11	0.09	1.15	13.3% faster
gp_regr/gp_regr.stan	0.16	0.14	1.14	11.97% faster
irt_2pl/irt_2pl.stan	5.82	5.05	1.15	13.24% faster
performance.compilation	89.14	88.85	1.0	0.33% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.63	8.09	1.07	6.26% faster
pkpd/one_comp_mm_elim_abs.stan	30.3	31.03	0.98	-2.4% slower
sir/sir.stan	125.37	136.13	0.92	-8.58% slower
gp_regr/gen_gp_data.stan	0.04	0.03	1.1	9.43% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.01	2.98	1.01	1.04% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.38	1.02	2.23% faster
arK/arK.stan	1.88	2.07	0.91	-9.99% slower
arma/arma.stan	0.82	0.27	3.0	66.66% faster
garch/garch.stan	0.7	0.69	1.02	1.9% faster
Mean result: 1.15861289082

Jenkins Console Log
Blue Ocean
Commit hash: 422b26f

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

…proto/multi-chain

SteveBronder · 2021-08-24T17:39:41Z

I fixed up the error message so now we have

0 is not a valid value for "num_threads"
  Valid values: num_threads > 0 || num_threads == -1

and fixed up the error message for num_chains as well to be more helpful. Also added a test to make sure the diagnostic file name is written correctly. Before I was doing a goof and appending the end of the sample file name to the diagnostic file name.

stan-buildbot · 2021-08-24T17:57:21Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.02	3.51	0.86	-16.22% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.12	10.97% faster
eight_schools/eight_schools.stan	0.1	0.09	1.12	10.33% faster
gp_regr/gp_regr.stan	0.15	0.14	1.12	10.94% faster
irt_2pl/irt_2pl.stan	5.83	5.13	1.14	12.01% faster
performance.compilation	89.17	88.76	1.0	0.45% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.74	8.07	1.08	7.64% faster
pkpd/one_comp_mm_elim_abs.stan	30.46	29.89	1.02	1.88% faster
sir/sir.stan	123.38	120.93	1.02	1.98% faster
gp_regr/gen_gp_data.stan	0.03	0.03	1.06	5.96% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.02	3.0	1.01	0.6% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.36	1.07	6.38% faster
arK/arK.stan	1.89	2.8	0.67	-48.67% slower
arma/arma.stan	0.83	0.28	3.01	66.78% faster
garch/garch.stan	0.71	0.68	1.04	3.49% faster
Mean result: 1.15611155001

Jenkins Console Log
Blue Ocean
Commit hash: bc4bc94

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

wds15 · 2021-08-25T20:37:51Z

When I do

[22:35:34][sebi@sebastians-macbook-pro:~/work/cmdstan/examples/bernoulli]$ STAN_NUM_THREADS=4 ./bernoulli sample num_chains=2 id=30 data file=bernoulli.data.R output file=samples.csv diagnostic_file=diag.csv num_threads=1
STAN_NUM_THREADS= 4 but argument num_threads= 1. Please either only set one or make sure they are equal.

The error msg has an extra space "STAN_NUM_THREADS= 4" and "num_threads= 1" ... which is just a cosmetic issue...

wds15

LGTM

Hooray!

adds command line options for multiple chain in adapt diag_e

a6beb44

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

fe10358

…4.1 (tags/RELEASE_600/final)

SteveBronder mentioned this pull request Mar 8, 2021

Make locations_array constexpr and move curr_statement__ to function scope stan-dev/stanc3#842

Merged

SteveBronder mentioned this pull request Mar 21, 2021

Parallel Run Adaptive Sampler stan-dev/stan#3028

Closed

3 tasks

SteveBronder added 2 commits March 25, 2021 18:44

update to parallel nuts adapt sampler

9d01a8d

update

59baf8b

SteveBronder mentioned this pull request Mar 26, 2021

How to do IO args for parallel cmdstan? #1002

Closed

SteveBronder and others added 4 commits March 30, 2021 18:39

update with parallel argument

fbca7ae

update to develop

fca7202

remove extra output files

dfca52f

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

6a1d2cf

…4.1 (tags/RELEASE_600/final)

use test_model for multi chain test

96a7283

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

b4b9953

…4.1 (tags/RELEASE_600/final)

SteveBronder mentioned this pull request Aug 13, 2021

[FR] Use new num_chains arg in cmdstan for hmc, fixed params, and generated quantities stan-dev/cmdstanr#534

Open

wds15 requested changes Aug 15, 2021

View reviewed changes

SteveBronder added 2 commits August 16, 2021 14:31

Merge remote-tracking branch 'origin/develop' into proto/multi-chain

7facac5

adds id to the output name

e80e5eb

SteveBronder and others added 3 commits August 24, 2021 00:38

Merge remote-tracking branch 'origin/develop' into proto/multi-chain

45a3276

Fix diagonstic file name and add check for diagnostic file name in mu…

2367739

…lti_chain test

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

422b26f

…4.1 (tags/RELEASE_600/final)

SteveBronder added 2 commits August 24, 2021 13:35

update error message for num_chains/num_threads

dc0d85b

Merge branch 'proto/multi-chain' of github.com:stan-dev/cmdstan into …

bc4bc94

…proto/multi-chain

wds15 approved these changes Aug 25, 2021

View reviewed changes

SteveBronder merged commit c299040 into develop Aug 25, 2021

rok-cesnovar mentioned this pull request Oct 6, 2021

Changed variational output in 2.28.0 #1049

Closed

rok-cesnovar mentioned this pull request Nov 7, 2021

Performance regression 2.28 #1055

Closed

WardBrian mentioned this pull request Aug 28, 2023

Fix: multichain does not use any init file besides the first #1191

Merged

2 tasks

WardBrian deleted the proto/multi-chain branch September 6, 2023 20:17

Running multiple chains in one Stan program #987

Running multiple chains in one Stan program #987

Conversation

SteveBronder commented Mar 6, 2021 • edited Loading

Submisison Checklist

Summary:

The Good

The Bad

The Ugly

How to Verify:

Documentation:

Copyright and Licensing

SteveBronder commented Mar 6, 2021

yizhang-yiz commented Mar 6, 2021

SteveBronder commented Mar 6, 2021

yizhang-yiz commented Mar 6, 2021 • edited Loading

SteveBronder commented Mar 6, 2021

SteveBronder commented Mar 6, 2021

bbbales2 commented Mar 6, 2021 • edited Loading

yizhang-yiz commented Mar 8, 2021

LuZhangstat commented Mar 8, 2021 • edited Loading

SteveBronder commented Mar 8, 2021

LuZhangstat commented Mar 8, 2021

SteveBronder commented Mar 8, 2021

LuZhangstat commented Mar 8, 2021

SteveBronder commented Mar 9, 2021

bbbales2 commented Mar 10, 2021

wds15 commented Mar 20, 2021 • edited Loading

SteveBronder commented Mar 20, 2021

wds15 commented Mar 20, 2021

SteveBronder commented Mar 31, 2021

stan-buildbot commented Aug 12, 2021

SteveBronder commented Aug 12, 2021

stan-buildbot commented Aug 12, 2021

wds15 left a comment

Choose a reason for hiding this comment

wds15 Aug 15, 2021

Choose a reason for hiding this comment

SteveBronder Aug 16, 2021 • edited Loading

Choose a reason for hiding this comment

wds15 Aug 15, 2021

Choose a reason for hiding this comment

rok-cesnovar Aug 16, 2021 • edited Loading

Choose a reason for hiding this comment

stan-buildbot commented Aug 16, 2021

SteveBronder commented Aug 17, 2021

stan-buildbot commented Aug 18, 2021

SteveBronder commented Aug 18, 2021

wds15 commented Aug 23, 2021

wds15 commented Aug 23, 2021 • edited Loading

stan-buildbot commented Aug 24, 2021

SteveBronder commented Aug 24, 2021

stan-buildbot commented Aug 24, 2021

wds15 commented Aug 25, 2021

wds15 left a comment

Choose a reason for hiding this comment

SteveBronder commented Mar 6, 2021 •

edited

Loading

yizhang-yiz commented Mar 6, 2021 •

edited

Loading

bbbales2 commented Mar 6, 2021 •

edited

Loading

LuZhangstat commented Mar 8, 2021 •

edited

Loading

wds15 commented Mar 20, 2021 •

edited

Loading

SteveBronder Aug 16, 2021 •

edited

Loading

rok-cesnovar Aug 16, 2021 •

edited

Loading

wds15 commented Aug 23, 2021 •

edited

Loading