Add design doc for parallel services #40

SteveBronder · 2021-04-06T22:15:30Z

This is a design doc of the API for providing a service layer to handle multiple chains executing in parallel.

I've tried to write out the proposal spec as clear as I could, but if there are unanswered questions lmk and I can add them!

rendered version

SteveBronder · 2021-04-09T16:53:38Z

Tagging folks who are interested in other threads @wds15 @bbbales2 @mitzimorris @yizhang-yiz and @bgoodri @riddell-stan for rstan and pystan.

I tried keeping the scope tight to just the services layer, though it's hard to talk about the services layer without also talking about how the service layer is used.

wds15 · 2021-04-11T07:16:23Z

^What about controlling threads per chain and threads total? I think we should allow controlling both or otherwise do some benchmarks to decide on that. Sensible defaults should let user allow to usually just specify one of these controls and something reasonable will happen. So even for this default we would need a small benchmark to decide. I hope that makes sense in the context of this PR, but to me I would expect that we settle this question. Will need to dive more into the doc. Thanks for writing it!

betanalpha · 2021-04-12T20:12:04Z

In my option the most important criteria for this API is maintaining compatible behavior with the current single-chain routes. In particular if n_chain = 1 then the new route should behave exactly the same as the current route and if n_chain > 1 then the behavior should be exactly the same as running the current route multiple times with appropriate chain indices. Reproducibility with n_chain = 1 would just require looking for init files without any chain_id modification unless n_chain > 1, no? Reproducibility between parallel and serial routes with multiple chains would depend on the behavior of `chain`. Does `chain` become an initial chain index in the parallel context with the other chains given incremental indices, chain + 1, chain + 2, etc? If so then we should probably rename it to something more precise such as `init_chain_id` and the init file lookup should use these chain ids and not indices starting from 1. An alternative would be to accept a vector of chain_id values, although that puts a nontrivial burden on the routes to validate the chain_ids and how the PRNG generator is strided. In any case with the introduction of `n_chain` the old `chain` argument should probably be made more explicit, for example `chain_id` or even `init_chain_id` as above. None of the callbacks are called often enough for the vtable hit to significant and templating _everything_ seems a bit aggressive, but that’s a minor point.

riddell-stan · 2021-04-13T01:19:48Z

I'm not persuaded that one needs to add a function to services for this. Doing parallel sampling "manually" doesn't seem too hard (e.g., using std::thread or tbb::parallel_for() to launch each chain separately). Anyone who can call hmc_nuts_dense_e_adapt can call it n times in n different threads.

Nits:

num_chains (to match num_samples, etc)

mitzimorris · 2021-04-13T02:00:39Z

I'm not persuaded that one needs to add a function to services for this.

this is something that should be implemented once instead of allowing users to roll their own. when it comes to threading, these things need to be done carefully.

SteveBronder · 2021-04-13T03:14:52Z

^What about controlling threads per chain and threads total? I think we should allow controlling both or otherwise do some benchmarks to decide on that. Sensible defaults should let user allow to usually just specify one of these controls and something reasonable will happen. So even for this default we would need a small benchmark to decide. I hope that makes sense in the context of this PR, but to me I would expect that we settle this question. Will need to dive more into the doc. Thanks for writing it!

I ran some benchmarks with the redcards reduce_sum() model and my opinion now is that we should just let the user set threads and not threads per chain. The below just with the version of parallel cmdstan that only lets you set the number of threads. It looks like the tbb is doing the right thing and looking at the results and looking at the results things seemed to match up for the manual parallel vs parallel cmdstan versions.

idt I like the threads per chain thing because then like how would a user (or us choosing a default) do like 8 chains when there are a total of 32 cores available. Since the tbb seems to be Just Working™ I think we should just have a threads argument

Cmdstan Parallel Chains Redcard

4 chains, 32 threads

time -p examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json parallel chains=4 threads=32

real 18.22
user 550.52
sys 5.14

4 chains, 24 threads

time -p examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json parallel chains=4 threads=24

real 19.14
user 457.70
sys 0.36

8 chains, 32 threads

time -p examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json parallel chains=8 threads=32

real 33.83
user 1071.50
sys 1.28

GNU Parallel

4 chains, 32 threads

time -p seq -w 1 4 | parallel "STAN_NUM_THREADS=8 examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json output file=outputs{}.csv"

real 17.95
user 568.26
sys 0.74

4 chains, 24 threads

time -p seq -w 1 4 | parallel "STAN_NUM_THREADS=6 examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json output file=outputs{}.csv"

real 19.38
user 444.35
sys 3.39

8 chains, 32 threads

time -p seq -w 1 8 | parallel "STAN_NUM_THREADS=4 examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json output file=outputs{}.csv"

real 34.26
user 1026.32
sys 3.48

SteveBronder · 2021-04-13T03:19:24Z

I'm not persuaded that one needs to add a function to services for this. Doing parallel sampling "manually" doesn't seem too hard (e.g., using std::thread or tbb::parallel_for() to launch each chain separately). Anyone who can call hmc_nuts_dense_e_adapt can call it n times in n different threads.

I'm not persuaded that one needs to add a function to services for this.

this is something that should be implemented once instead of allowing users to roll their own. when it comes to threading, these things need to be done carefully.

+1 to @mitzimorris there's a few gotchas here in setting up the parallel things and I think having a standardized API (and docs the services can use to make sure everything is lined up) makes it worth it. I thought about just leaving this as something the users of the services could impl on their own (you can do this by just wrapping the service layer call in a parallel for). Though I think there's enough little annoyances to have something standardized

SteveBronder · 2021-04-13T03:46:36Z

In my option the most important criteria for this API is maintaining compatible behavior with the current single-chain routes. In particular if n_chain = 1 then the new route should behave exactly the same as the current route and if n_chain > 1 then the behavior should be exactly the same as running the current route multiple times with appropriate chain indices.

Yep! I have it hard-coded that if n_chain=1 is used when calling the service layer we default to what happened previously.

Reproducibility with n_chain = 1 would just require looking for init files without any chain_id modification unless n_chain > 1, no?

Yep!

Reproducibility between parallel and serial routes with multiple chains would depend on the behavior of chain. Does chain become an initial chain index in the parallel context with the other chains given incremental indices, chain + 1, chain + 2, etc?

Yep!

If so then we should probably rename it to something more precise such as init_chain_id and the init file lookup should use these chain ids and not indices starting from 1.

I like having the init file lookup based on the starting id number, so that if a user is running multiple stan programs with multiple chains per program they can have one program start at 1 with 4 chains, another at 5 with 4 chains, etc.

In any case with the introduction of n_chain the old chain argument should probably be made more explicit, for example chain_id or even init_chain_id as above.

I really thought we had a chain argument in cmdstan, but we don't? We actually only have an id argument

  id=<int>
    Unique process identifier
    Valid values: id >= 0
    Defaults to 0

So actually kind of nice for the machinations here! I was worried about ambiguity between having a chain argument and chains. I think I was confused because in the services layer we call id -> chain. Or if you are talking about the chain argument at the services layer then yeah I'm fine with calling the new API's chain argument init_chain_id (or maybe just process_id)?

An alternative would be to accept a vector of chain_id values, although that puts a nontrivial burden on the routes to validate the chain_ids and how the PRNG generator is strided.

Let's stick with the easier version for now, I think at some point we may want this but the easy peasy version works for what we want right now.

None of the callbacks are called often enough for the vtable hit to significant and templating everything seems a bit aggressive, but that’s a minor point.

Yeah so the templates here are not for performance but for flexibility for developers using the service layer. Since now the std::vector<> holding those inits and readers can be anything with a get_underlying() overload it's easier to have them as templates. So this let's rstan use Rcpp::Xptras the inner value in the vector and that sort of thing. We could just have the API be std::vector<var_context*> but then we lose the ability to use smart pointers like std::shared_ptr<> etc.

wds15 · 2021-04-13T09:43:06Z

The results for threads total vs threads per chain look promising. Is it a possibility to lay out the code such that the threads per chain thing could be added later on? Given how things look right now I agree with you that we do not need the threads per chain thing.... and in fact, a user who requires the threads per chain thing could simply fire up different chains in different processes and control the threads for each of them.... which may lead us to conclude that threads per chains is not needed?

The other thing which is really important in this context is the ability to do cross-chain warmup stuff. That's hopefully not an after-thought, right? (Sorry for not having studied the text in detail yet).

It was raised that we change the meaning of some current command line options as I understood... it would be good to avoid that as much as possible unless we say we bump major version numbers.

SteveBronder · 2021-04-13T13:53:54Z

Is it a possibility to lay out the code such that the threads per chain thing could be added later on?

We could add it, it would just require another signature. Because for threads per chain we are going to want a task_group for the parallel_for() telling it how many threads each thread is allowed to use at that level. Nothing here blocks that from happening but idt we need it in this particular design.

The other thing which is really important in this context is the ability to do cross-chain warmup stuff. That's hopefully not an after-thought, right? (Sorry for not having studied the text in detail yet).

Yeah a lot of this is to setup the tooling so that a lot of the API stuff is done for adding a cross chain warmup scheme. Cuz then it's just targeting a different multi-chain service layer which should have very similar input/output schemes

It was raised that we change the meaning of some current command line options as I understood... it would be good to avoid that as much as possible unless we say we bump major version numbers.

Yeah I agree

wds15 · 2021-04-13T18:34:36Z

One more q: despite the templates being introduced, we can still pre compile ahead of time the Stan services, right?

assuming a yes here, which I would guess, then this design is sound for me.

SteveBronder · 2021-04-13T23:07:53Z

One more q: despite the templates being introduced, we can still pre compile ahead of time the Stan services, right?

Yep! I can double check but I believe that's totally fine.

assuming a yes here, which I would guess, then this design is sound for me.

Word! Let me read this over and fix some things for what @betanalpha posted and then I'll call for a vote

betanalpha · 2021-04-15T15:50:48Z

Reproducibility between parallel and serial routes with multiple chains would depend on the behavior of chain. Does chain become an initial chain index in the parallel context with the other chains given incremental indices, chain + 1, chain + 2, etc? Yep!

I agree that this is the right behavior but it has to be thoroughly doc’d. In particular I think that we need some discussion of how to emulate the same behavior with multiple calls verses one call (including not just multiple one chain calls with chain=1, chain=2, …, chain=n but also multiple multi-chain calls with properly strided chain ids, for example chain=1, chain=2 * num_chains, chain=…n * num_chains).

If so then we should probably rename it to something more precise such as init_chain_id and the init file lookup should use these chain ids and not indices starting from 1. I like having the init file lookup based on the starting id number, so that if a user is running multiple stan programs with multiple chains per program they can have one program start at 1 with 4 chains, another at 5 with 4 chains, etc.

I think we mean the same thing! If chain (or whatever we call it) is, say, 5, and num_chain is 3 then the function will look for the files name.init.5.R name.init.6.R name.init.7.R yeah?

In any case with the introduction of n_chain the old chain argument should probably be made more explicit, for example chain_id or even init_chain_id as above. I really thought we had a chain argument in cmdstan, but we don't? We actually only have an id argument id=<int> Unique process identifier Valid values: id >= 0 Defaults to 0 So actually kind of nice for the machinations here! I was worried about ambiguity between having a chain argument and chains. I think I was confused because in the services layer we call id -> chain. Or if you are talking about the chain argument at the services layer then yeah I'm fine with calling the new API's chain argument init_chain_id (or maybe just process_id)?

Yeah, I’m just thinking about the naming for the services API. The interface naming is another matter altogether. I personally prefer something like `init_chain_id` because it emphasizes that the individual chains being bundled together will be given related ids. `process_id` on the other hand seems to label the entire bundle and not any individual chain.

An alternative would be to accept a vector of chain_id values, although that puts a nontrivial burden on the routes to validate the chain_ids and how the PRNG generator is strided. Let's stick with the easier version for now, I think at some point we may want this but the easy peasy version works for what we want right now.

I agree — I was just throwing out alternatives in case I was misunderstanding what the intent of the new API.

None of the callbacks are called often enough for the vtable hit to significant and templating everything seems a bit aggressive, but that’s a minor point. Yeah so the templates here are not for performance but for flexibility for developers using the service layer. Since now the std::vector<> holding those inits and readers can be anything with a get_underlying() overload it's easier to have them as templates. So this let's rstan use Rcpp::Xptras the inner value in the vector and that sort of thing. We could just have the API be std::vector<var_context*> but then we lose the ability to use smart pointers like std::shared_ptr<> etc.

You mean that you just want avoid having to make the interfaces wrap things in container classes that derive from var_context or the like? I don’t have any huge objections here, just want to understand the motivation. Thanks!

SteveBronder · 2021-04-19T18:20:09Z

I agree that this is the right behavior but it has to be thoroughly doc’d. In particular I think that we need some discussion of how to emulate the same behavior with multiple calls verses one call (including not just multiple one chain calls with chain=1, chain=2, …, chain=n but also multiple multi-chain calls with properly strided chain ids, for example chain=1, chain=2 * num_chains, chain=…n * num_chains).

Yes so I think the way I have it written should work to be reproducible. In the scheme I have now each PRNG is set by the seed + chain_id (the initial id to N chains). So chain 1 is seed + 1, chain 2 is seed + 2 etc. So if a user wants to reproduce a multi chain process with individual programs they would just do the same seed but then iterate the id value for each process. I can add a note in the design doc about how users can replicate multi-chain processes with single chains and how we will add this info the the documentation***.

Though the output file name will be different in a multi-chain program vs multi program scheme. Changing the behavior of the multi program scheme would require breaking backwards compatibility since we would tag each output with a _{id}.csv. I think I'd rather not break backwards compatibility here and just have docs that specify this output behavior.

I think we mean the same thing!

Yes we are! Should have affirmed that, it's good and cool!

Yeah, I’m just thinking about the naming for the services API. The interface naming is another matter altogether.

I personally prefer something like init_chain_id because it emphasizes that the individual chains being bundled together will be given related ids. process_id on the other hand seems to label the entire bundle and not any individual chain.

icic yeah if we are talking about the service layer I think that makes a lot of sense, I'll update this.

You mean that you just want avoid having to make the interfaces wrap things in container classes that derive from var_context or the like? I don’t have any huge objections here, just want to understand the motivation.

Yeah so the main thing is just letting upstream services be flexible in what they pass to the API. So a user of the service layer could just pass an std::vector<SomeInitType> or alt a std::vector<some_smart_ptr<SomeInitType>>. I just realized I wrote, "This is for performance" in the design doc and then I was like "why do people keep thinking this is for performance???" lol so I nee to change that.

Another alt here is just to specify that the inner element of a vector must have a valid * operator that returns a reference to the underlying object. So then a developers's class would only need to have that * defined to operate with the API, but then that also means that if the dev wanted to pass in a std::vector<NotAPointer> then they need to have a * method defined for that class that returns back a reference to itself. That's not the worst, thought I think the normal use of * is to return a reference to an underlying object so it just seems a bit non-standard so I'd rather just have the get_underlying() function folks can override if they need to

…ulti-chain program and multiple programs with single chains

SteveBronder · 2021-04-20T23:11:17Z

Alrighty I just updated with fixes! I think I'd like to hold a vote starting from April 21st to April 28th (Wednesday to Wednesday) if that is alright with everyone

riddell-stan · 2021-04-20T23:40:25Z

designs/0020-parallel-chain-api.md

+    std::vector<InitWriter>& init_writer,
+    std::vector<SampleWriter>& sample_writer,
+    std::vector<DiagnosticWriter>& diagnostic_writer,
+    size_t n_chain)


I would welcome some consistency with the parameter name. Having n_chain in C++ alongside num_warmup, num_samples, and num_thin is odd. And then in CmdStan it's chains (plural, no n_ or num_ prefix).

Seems like num_chains would be consistent with the current C++ API.

Reasonable!

SteveBronder · 2021-04-21T16:59:37Z

Alrighty lets do the vote starting today, tagging everyone who was commented so far

@riddell-stan @wds15 @betanalpha @mitzimorris @wds15 @bbbales2

let's have the vote as either a comment or as a review

riddell-stan · 2021-04-21T17:24:07Z

I thought design docs just needed one endorsement by someone other than the author(s).

…

On Wed, Apr 21, 2021, at 12:59 PM, Steve Bronder wrote: Alrighty lets do the vote starting today, tagging everyone who was commented so far @riddell-stan <https://github.com/riddell-stan> @wds15 <https://github.com/wds15> @betanalpha <https://github.com/betanalpha> @mitzimorris <https://github.com/mitzimorris> @wds15 <https://github.com/wds15> @bbbales2 <https://github.com/bbbales2> let's have the vote as either a comment or as a review — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJQUBVU4WH34XCO3NGOEYUDTJ4AATANCNFSM42PSWFOA>.

SteveBronder · 2021-04-21T17:32:44Z

So in the past we had it that if there were no disagreements between anyone whose commented so far (full quorum) we would just merge the design doc. Folks seem to want to move to a voting scheme lately so I'm fine with also doing a vote here. Though if no one has an objection or 50% of the people tagged above vote yes then I think we are good to just approve and merge it

yizhang-yiz · 2021-04-21T17:35:39Z

Folks seem to want to move to a voting scheme lately so I'm fine with also doing a vote here.

I'm not aware any proposal to move to voting scheme. The adjoint ode doc vote was due to disagreement.

SteveBronder · 2021-04-21T18:31:18Z

Oh in that case if folks are fine with it let's approve and merge this. I can update the parallel branch with the above changes today

mitzimorris · 2021-04-21T19:02:11Z

lgtm.
is this going to be added to all methods in stan::services or just to methods in stand::services::sample ?

SteveBronder · 2021-04-21T21:41:13Z

is this going to be added to all methods in stan::services or just to methods in stand::services::sample ?

For now I was going to just do sampling, though I think this scheme makes a lot of sense for vb (and I think @hyunjimoon had some stuff for vb where multi chain would be very useful.). Does it make sense to add this for optimization? if so then yeah I'm also fine with doing it for optimization as well

betanalpha · 2021-04-26T16:17:09Z

Motivation: We’ve had this discussion before, but I think it’s important enough to mention again. In my opinion the main scope of this design doc is setting up the API for multiple chains to be managed and run by the services, instead of requiring the interfaces to implement this all independently. This not only reduces the work required of an interface implementation but also unifies the UX across interfaces. The ability to run the chains in parallel with the TBB and to consider sharing information across chains is an additional benefit that follows from the API. In my opinion the only reason to even mention these right now is the thread safety requirement of some of the inputs. Guide-level explanation: I think that the `num_chains` arguments should be up higher in the signature. Honestly it should probably go write after `model`. To avoid confusion I think that we should probably also change `num_warmup` to `num_warmup_per_chain` and `num_samples` to `num_samples_per_chain`. `seed + init_chain_id + chain_num` is not a good initialization. For the L’ecuyer PRNG each total chain_id needs to be scaled by a long enough stride, `chain_seed = seed + stride_size * (init_chain_id + chain_num)` or more readably ``` chain_id = init_chain_id + chain_idx chain_seed = base_seed + stride_size * chain_id ``` Upon further reflection I’m not a fan of templating out the var_context. The problem is that this puts the responsibility of supporting different `get_underlying` implementations on the services library instead of the user. Sure wrapping a smart pointer in var_context derived class is a bit more work for an interface but then any interface can implement any var_context strategy they want without any work/approval from the services layer. In my opinion the object oriented approach provides a much more flexible compartmentalization here. Buffering output into chain-specific stringstreams is not ideal. Yes it avoids threading issues but it also requires a significant memory burden for each chain (one of the reasons why CmdStan is so much more scalable than RStan and PyStan is because of its streaming output) and prevents realtime updates. std::cout based loggers shouldn’t need buffering at all which leaves the only nontrivial problem thread safe interrupt implementations, no? Recommended Upstream Initialization: It would help to write out a specific example of the init file logic here to avoid any confusion. Open Questions: “Model” here is being used in a non-standard way statistically. When a `log_prob` objects gets initialized it partially evaluates the target density defined by the Stan program which immediately defines an unnormalized posterior density function. That’s the object that then gets consumed by the algorithms for each “fit”. The behavior of the `transformed data` block isn’t an open question — for this design doc the only issue is reproducing the current behavior. That said clarifying the current behavior (correct me if I’m wrong but the `transformed data` PRNG gets seeded by `seed` alone and the `generated quantity` PRNG gets seeded by `seed + stride * chain_id`) would be a welcome addition to the doc.

rok-cesnovar · 2021-04-26T18:18:20Z

(correct me if I’m wrong but the transformed data PRNG gets seeded by seed alone and the generated quantity PRNG gets seeded by seed + stride * chain_id)

This is correct.

t4c1 · 2021-04-28T07:36:45Z

chain_id = init_chain_id + chain_idx
stride_size = num_warmup + num_samples
chain_seed = base_seed + stride_size * chain_id

I have two issues with that. The first one is how this is written and the second one is what is written. Lets start with the first one (You are probably thinking about the right thing, but it is not written correctly). Assuming we want to keep behavior the same we need to seed all generator with the same seed and than jump for some steps ahead. In current code this is (with seed in being the same as base_seed in your suggestion):

  static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
  boost::ecuyer1988 rng(seed);
  rng.discard(DISCARD_STRIDE * chain);

Second issue is by how many strides we should jump ahead (what is the chain variable in C++). This does NOT depend on number of samples or warmup. It just need to be index of the PRNG being seeded. AFAIK if we can run multiple processes at the same time, multiple chains in the same process and multiple threads for each chain. Each thread needs its own PRNG instance. So this needs to be equal to global thread id, which is init_chain_id + id_of_thread_in_process.

SteveBronder · 2021-04-28T15:34:35Z

Thanks! I think I see what you are saying, but just to check are you saying something like

  static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
  boost::ecuyer1988 rng(seed);
  // Just changing this part
  rng.discard(DISCARD_STRIDE * init_chain_id + id_of_thread_in_process);

Though the RNG is not currently constructed within the thread, is there anything wrong with just using the chain_id? i.e. if I have 4 chains starting at id of 0 they would be strided by

DISCARD_STRIDE * 0
DISCARD_STRIDE * 1
DISCARD_STRIDE * 2
DISCARD_STRIDE * 3

t4c1 · 2021-04-29T05:58:31Z

but just to check are you saying something like

Exactly.

Though the RNG is not currently constructed within the thread, is there anything wrong with just using the chain_id?

Well how does it now work, if multiple threads are computing one chain? Do they share single PRNG? I think that would make threaded part not deterministic if it uses PRNG.

betanalpha · 2021-04-29T14:53:35Z

Though for me I think we can see some good speedups, not in the fact that we are running multiple things at once, but in the reduced memory usage we get from sharing data across chains.

Totally agree about the eventual benefit, but I think the priority of the design doc here is the design fo the API route itself and not its implementation (again modulo details like thread safety that have to be enforced in the route).

So idt we can get rid of the actual templates for the var_context inputs while supporting multiple types of smart pointers. For instance if the signature had std::vector<var_context*> then we could only bring in raw pointers to var_context derived types. What we can do is get rid of get_underlying() and say that we accept any input vector whose elements have a valid operator* which returns a reference to a class derived from var_context. I think that's flexible enough for any doo-dads that downstream devs may need while being concrete in the service layer.

Is the issue here that RCPP wants to wrap all references in smart pointers? So the problem isn’t that RStan couldn’t wrap local memory in a `var_context` derivation but rather that it couldn’t expose it outside of a smart pointer reference?

So the stringstreams for this are very short lived and only exist because we use multiple << to construct the output of each line. How std::cout etc work is that they have a buffer that we write to that is then printed out to where-ever. When do do multiple << like the below std::cout << "Yo" << "oo" << "oo" << "\n"; // Prints out Yooooo \n on a single thread in a multithreaded context std::cout is threadsafe, but can choose to print at any point in calling with those << so you can end up with valid but silly looking output like YoooYooo oooo YoooYooo oo (and by valid here I just mean it won't cause the computer to crash from UB) For when we want an individual line to be printed all in one go we want to put that whole line in a string and then tell cout to print that whole string at once std::stringstream my_yo; my_yo << "Yo" << "oo" << "oo" << "\n"; std::cout << my_yo.str(); // multi-thread prints out /* Yooooo * Yooooo * Yooooo */

That makes sense, but I don’t think it was clear from the doc. I think it would help to emphasize buffering local, contiguous outputs like numerical output for a single iteration or messages but not the entire chain output.

This was supposed to be done with voting tmrw. What I'd like to do is, after tmrw, as long as a bunch of people don't vote no suddenly then we'll make this a "Tentative Yes" pending sorting out the stride issue and changing the things in the above which I think are good and reasonable

As was pointed out earlier in the thread I don’t think there’s a need for voting here. The design is still evolving, for the better, once the design equilibrates we can consider voting if there’s not consensus.

I have two issues with that. The first one is how this is written and the second one is what is written. Lets start with the first one (You are probably thinking about the right thing, but it is not written correctly). Assuming we want to keep behavior the same we need to seed all generator with the same seed and than jump for some steps ahead. In current code this is (with seed in being the same as base_seed in your suggestion): static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50; boost::ecuyer1988 rng(seed); rng.discard(DISCARD_STRIDE * chain);

Yes, apologies, I was being way too sloppy and thinking of seed as the state initialized by the variable `seed` and `+` as the seek operator on the internal PRNG state. I agree that this is the correct implementation of seeking for Boost’s L’ecuyer generator.

AFAIK if we can run multiple processes at the same time, multiple chains in the same process and multiple threads for each chain. Each thread needs its own PRNG instance. So this needs to be equal to global thread id, which is init_chain_id + id_of_thread_in_process. Though the RNG is not currently constructed within the thread, is there anything wrong with just using the chain_id? Well how does it now work, if multiple threads are computing one chain? Do they share single PRNG? I think that would make threaded part not deterministic if it uses PRNG.

I’m not sure exactly how precisely constrained this is in the compiler but my understanding is that the functors that expose threading don’t accept functions that utilize any of the `_rng` functions, in which case there is no need to parallelize the PRNG within each chain.

static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50; boost::ecuyer1988 rng(seed); // Just changing this part rng.discard(DISCARD_STRIDE * init_chain_id + id_of_thread_in_process); Though the RNG is not currently constructed within the thread, is there anything wrong with just using the chain_id? i.e. if I have 4 chains starting at id of 0 they would be strided by DISCARD_STRIDE * 0 DISCARD_STRIDE * 1 DISCARD_STRIDE * 2 DISCARD_STRIDE * 3

We still need the init_chain_id in there as well so that if someone runs multiple chains split into multiple calls to the service layer with the same seed but a different chain_id then the PRNGs will still be roughly independent. ``` // Initialize L’ecuyer generator boost::ecuyer1988 rng(seed); // Seek generator to disjoint region for each chain static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50; rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num)); ``` All that’s needed is for `DISCARD_STRIDE` to be long enough that each chain has enough space to consume as many PRNG states as it needs white not being so long that `DISCARD_STRIDE * num_chains` gets close to the period of the generator.

t4c1 · 2021-04-29T17:00:49Z

I’m not sure exactly how precisely constrained this is in the compiler but my understanding is that the functors that expose threading don’t accept functions that utilize any of the _rng functions, in which case there is no need to parallelize the PRNG within each chain.

Than ignoring threads working on the same chain is fine. Although the restriction seems somewhat arbitrary. It makes more sense for reduce_sum, but I could imagine someone using rngs in the functor passed to map_rect. Anyway even if we ignore multiple threads per chain this should be (with parenthesis around addition):

rng.discard(DISCARD_STRIDE * (init_chain_id + id_of_thread_in_process));

All that’s needed is for DISCARD_STRIDE to be long enough that each chain has enough space to consume as many PRNG states as it needs white not being so long that DISCARD_STRIDE * num_chains gets close to the period of the generator.

We could also evenly space the generator period between streams, doing DISCARD_STRIDE = generator_period / total_n_chains. Although the way it is now is also fine as long as each chain needs less than 1<<50 randomly generated numbers and we run less than generator_period / (1<<50) streams, which is probably a reasonable assumption.

wds15 · 2021-04-29T18:30:43Z

With the TBB the concept of a thread is abstracted away. You could have thread local streams for each chain, but that is not super performant. Threaded out of order access to a rng is not a nice thing to support. If that can be avoided that would be good.

hyunjimoon · 2021-04-30T14:39:42Z

is this going to be added to all methods in stan::services or just to methods in stand::services::sample ?

For now I was going to just do sampling, though I think this scheme makes a lot of sense for vb (and I think @hyunjimoon had some stuff for vb where multi chain would be very useful.). Does it make sense to add this for optimization? if so then yeah I'm also fine with doing it for optimization as well

For instance, this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful.

SteveBronder · 2021-04-30T16:44:45Z

Totally agree about the eventual benefit, but I think the priority of the design doc here is the design fo the API route itself and not its implementation (again modulo details like thread safety that have to be enforced in the route).

That's fair, though if you check out the most recent update I think I touch more on the motivation for this.

So idt we can get rid of the actual templates for the var_context inputs while supporting multiple types of smart pointers. For instance if the signature had std::vector<var_context*> then we could only bring in raw pointers to var_context derived types. What we can do is get rid of get_underlying() and say that we accept any input vector whose elements have a valid operator* which returns a reference to a class derived from var_context. I think that's flexible enough for any doo-dads that downstream devs may need while being concrete in the service layer.

Is the issue here that RCPP wants to wrap all references in smart pointers? So the problem isn’t that RStan couldn’t wrap local memory in a var_context derivation but rather that it couldn’t expose it outside of a smart pointer reference?

I think it's that I want to wrap pointers into smart pointers. We could make this a std::vector<var_context*> but then that means all the upstream interfaces need to manage the memory for their var context pointers. Just to make sure I understand, is the alternative interface you are thinking about just having a std::vector<var_context*> or is there something else you had in mind?

wrt the vector of raw pointers, it's just more manual memory management that I think we can let users just avoid/hide by making these templates that take in a class with an operator* that returns a reference to a class derived from var_context. Then cmdstan can use shared_ptr<var_context>, rstan can use Rcpp::Xptr<var_context>, python can use the smart pointer it wants to use if it has one, etc.

As was pointed out earlier in the thread I don’t think there’s a need for voting here. The design is still evolving, for the better, once the design equilibrates we can consider voting if there’s not consensus.

That's reasonable. It feels like we are talking more in depth on implementation details. Like the stride etc. for the PRNG are not user facing. At this point I'd prefer approving this and moving this PRNG conversation over to the implementation (That should also make that conversation easier as we can look at how I currently have things coded up.)

bob-carpenter · 2021-04-30T18:59:56Z

this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful.

ADVI is stochastic (initialization point and stochastic gradients for ELBO), but it's not an MCMC method and does not have chains. The draws it produces are just independent draws from the normal approximation. You can treat them like a Markov chain, but we know it's stationary distribution is normal(mu, Sigma) and we know the draws are independent. This means there's no point in splitting, as in split R-hat.

If we stuck to the unconstrained scale, all the sample quantities required for R-hat are available analytically. We lose that if we convert back to the constrained scale.

Just to make sure I understand, is the alternative interface you are thinking about just having a std::vector<var_context*>?

I'm late to this discussion, but why not just have std::vector<var_context> and let the vector manage the memory? It's just as efficient as constructing independently if you use emplace_back, for example.

the stride etc. for the PRNG are not user facing

They may not be user controllable, but they can impact users. But it's probably not an issue if the skip-ahead strides are big enough. I think we used values for CmdStan MCMC that were larger than the number of RNGs that we could sample in any reasonable amount of time.

SteveBronder · 2021-04-30T19:57:24Z

this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful.

ADVI is stochastic (initialization point and stochastic gradients for ELBO), but it's not an MCMC method and does not have chains. The draws it produces are just independent draws from the normal approximation. You can treat them like a Markov chain, but we know it's stationary distribution is normal(mu, Sigma) and we know the draws are independent. This means there's no point in splitting, as in split R-hat.

I think we can talk about ADVI another time as the design doc is just for the sampler's service API. Its def something we should think about soon, I think a lot of the API layout here will be applicable to ADVI as well.

I'm late to this discussion, but why not just have std::vector<var_context> and let the vector manage the memory? It's just as efficient as constructing independently if you use emplace_back, for example.

std::vector<var_context> does not work since var_context is an abstract base class :(. That and I think what we want is to allow either one init we share across all the chains or multiple inits for each chain. This is nice and easy to do in cmdstan if we have a std::vector<shared_ptr<var_context>>

https://godbolt.org/z/nK4dhMcMo

the stride etc. for the PRNG are not user facing

They may not be user controllable, but they can impact users. But it's probably not an issue if the skip-ahead strides are big enough. I think we used values for CmdStan MCMC that were larger than the number of RNGs that we could sample in any reasonable amount of time.

Yeah I agree though I think that is something we would put in the Stan docs about parallel processes which we would add once we sort out the striding thing in the impl.

hyunjimoon · 2021-05-01T05:38:09Z

this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful.

ADVI is stochastic (initialization point and stochastic gradients for ELBO), but it's not an MCMC method and does not have chains. The draws it produces are just independent draws from the normal approximation. You can treat them like a Markov chain, but we know it's stationary distribution is normal(mu, Sigma) and we know the draws are independent. This means there's no point in splitting, as in split R-hat.

Sorry if the point was not clear. The quote below from this paper suggests running stochastic optimization with multiple chains. It is true split-R is used, but the algorithm is designed based on J chains. Viewing the stochastic optimization as MCMC and therefore borrowing MCMC diagnostics is the novelty of the paper which motivated us (@Dashadower @bbbales2 and I) to include khat, Rhat to VI diagnostics; related to this issue.

"We use the split-Rb version, where all chains are split into two before carrying out the
computation above, which helps with detecting non-stationarity [17, 54] and allows us to use it even
when J = 1."

betanalpha · 2021-05-03T15:09:18Z

We could also evenly space the generator period between streams, doing DISCARD_STRIDE = generator_period / total_n_chains. Although the way it is now is also fine as long as each chain needs less than 1<<50randomly generated numbers and we run less than generator_period / (1<<50) streams, which is probably a reasonable assumption.

I think technically we’d want to do something like (generator_period / 2) / num_chains to avoid any issues from equidistribution near the end of the period, but that’s a small issue and seeking ahead 1 << 50 states per chain should be good enough. Using a fixed stride much less than period / num_chains also allows for more chains to be run in the future; running first with seed=S, init_chain_id=1, num_chains=4 and then afterwards with seed=S, init_chain_id=5, num_chains=4 would ensure pseudo-independent sequences for each of the 8 chains.

I think it's that I want to wrap pointers into smart pointers. We could make this a std::vector<var_context*> but then that means all the upstream interfaces need to manage the memory for their var context pointers. Just to make sure I understand, is the alternative interface you are thinking about just having a std::vector<var_context*> or is there something else you had in mind?

Yeah `std::vector<var_context*>` matching the current implementations.

wrt the vector of raw pointers, it's just more manual memory management that I think we can let users just avoid/hide by making these templates that take in a class with an operator* that returns a reference to a class derived from var_context. Then cmdstan can use shared_ptr<var_context>, rstan can use Rcpp::Xptr<var_context>, python can use the smart pointer it wants to use if it has one, etc.

Sorry for being dense here. I didn’t think that the memory management was all that burdensome and, if anything, it gave the interfaces the flexibility to, for example, let the local REPL environment manage the memory. I’m not caught up on the latest RCPP but it sounds like Xptr achieves that for R using a smart pointer pattern? @riddell-stan Is there a similarly straightforward pattern that httpstan can exploit? In any case is the choice of smart pointers over regular pointers is orthogonal to the parallelization issues? If so wouldn’t it be better to add new smart pointer routes for all the current routes in one go in a separate PR?

It feels like we are talking more in depth on implementation details. Like the stride etc. for the PRNG are not user facing.

I just want to make sure that the signature and automated chain_id will be flexible enough for all the use cases we might encounter. Once we’re agreed on that (which I think we largely are) then I agree the further implementation details go beyond the scope of the design doc.

SteveBronder · 2021-05-03T18:07:08Z

wrt the vector of raw pointers, it's just more manual memory management that I think we can let users just avoid/hide by making these templates that take in a class with an operator* that returns a reference to a class derived from var_context. Then cmdstan can use shared_ptr<var_context>, rstan can use Rcpp::Xptr<var_context>, python can use the smart pointer it wants to use if it has one, etc.

Sorry for being dense here. I didn’t think that the memory management was all that burdensome and, if anything, it gave the interfaces the flexibility to, for example, let the local REPL environment manage the memory. I’m not caught up on the latest RCPP but it sounds like Xptr achieves that for R using a smart pointer pattern? @riddell-stan Is there a similarly straightforward pattern that httpstan can exploit?

No worries at all! It's not like it's "oof this is going to take me days" but it's just another thing I have to think about where using a smart pointer lets me not think about. Like for cmdstan it would be nice to not have to do a bunch of new and delete's over the elements of the vector where I could just have a std::shared_ptr<>. For R using raw pointers instead of Rcpp::Xptr just involves calling PROTECT() / UNPROTECT() / R_SetExternalPtrProtected() etc. that you don't have to think about if you use Rcpp::Xptr

In any case is the choice of smart pointers over regular pointers is orthogonal to the parallelization issues? If so wouldn’t it be better to add new smart pointer routes for all the current routes in one go in a separate PR?

idk if I'd call it orthogonal. Sort of? Like right now the templated version of the signature in this proposal allows someone to pass in a raw pointer since the only requirement is that a valid operator* exists for the object that returns a reference to a class derived from var_context. That allows for both raw and smart pointers.

The current routes just take in a reference to an object derived from var_context and idt there's any need to change those. Those ones just call operator* on the var context derived object when calling the service layer API

I just want to make sure that the signature and automated chain_id will be flexible enough for all the use cases we might encounter. Once we’re agreed on that (which I think we largely are) then I agree the further implementation details go beyond the scope of the design doc.

Is the Q here whether we can ensure pseudo-independent sequences for each of the chains given init_chain_id, seed, and num_chains? imo it kind of feels like our answer is yes at this point and now we are mostly focusing on how exactly we are going to setup the stride. Or am I missing some part of the convo?

SteveBronder · 2021-05-05T17:52:01Z

Wanted to bump this as I think if we can get this merged this week then we will have enough time to add it to cmdstan before the next release. @betanalpha does the above answer your Qs about the var context? imo I'd like to move the rest of the PRNG things to the implementation discussion

betanalpha · 2021-05-06T03:54:32Z

No worries at all! It's not like it's "oof this is going to take me days" but it's just another thing I have to think about where using a smart pointer lets me not think about. Like for cmdstan it would be nice to not have to do a bunch of new and delete's over the elements of the vector where I could just have a std::shared_ptr<>. For R using raw pointers instead of Rcpp::Xptr just involves calling PROTECT() / UNPROTECT() / R_SetExternalPtrProtected() etc. that you don't have to think about if you use Rcpp::Xptr In any case is the choice of smart pointers over regular pointers is orthogonal to the parallelization issues? If so wouldn’t it be better to add new smart pointer routes for all the current routes in one go in a separate PR? idk if I'd call it orthogonal. Sort of? Like right now the templated version of the signature in this proposal allows someone to pass in a raw pointer since the only requirement is that a valid operator* exists for the object that returns a reference to a class derived from var_context. That allows for both raw and smart pointers. The current routes just take in a reference to an object derived from var_context and idt there's any need to change those. Those ones just call operator* on the var context derived object when calling the service layer API

But, and correct me if I’m wrong, all the smart pointers would do is call delete on the collection of var_context instances once the std::vector goes out of scope in the interface, code, right? That doesn’t seem like a tremendous amount of memory management to me, especially when care is already going to be required in constructing the separate var_context for each initialization anyways. In particular why couldn’t an interface just construct the var_context instances on the stack, as CmdStan currently does, and then throw the local stack addresses into a vector? They should be valid through the lifetime of the function call, no? What’s bothering me is the asymmetry this would create in the multi-chain and single-chain routes, and the different work the interfaces would need to do to use those routes. To me a BLAH and std::vector<BLAH*> pattern is much more natural, although a consistent shared_ptr<BLAH> and std::vector<shared_ptr<BLAH>> pattern wouldn’t be much worse.

Is the Q here whether we can ensure pseudo-independent sequences for each of the chains given init_chain_id, seed, and num_chains? imo it kind of feels like our answer is yes at this point and now we are mostly focusing on how exactly we are going to setup the stride. Or am I missing some part of the convo?

Yes, I think we’re largely in agreement but at the same time I don’t think this can be fully pushed to implementation. Because this is behavior that we want the API to guarantee I think it has to be fully fleshed out in the design doc (unless a `stride` argument is added to the route or something). Like the filename lookup logic I think we want to have the precise PRNG logic specified. I think we’re all on the same page with ``` // Initialize L’ecuyer generator boost::ecuyer1988 rng(seed); // Seek generator to disjoint region for each chain static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50; rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num - 1)); ``` Personally I also think it would make sense to have the parallel filenames also mach this logic, using `init_chain_id + chain_num - 1` as the file index instead of just `chain_num`.

wds15 · 2021-05-06T07:41:35Z

It looks as if things are settled enough from my view... but I don't want to jump ahead here. Given that @SteveBronder obviously is willing to invest a lot of time the next days on it, I think it would be good to make an attempt to get this over the line and into 2.27. So how about @betanalpha and @SteveBronder have a quick call to sort out the remaining bits?

From my end I can review stuff from the PRs as they relate to threading and all that; or other things I have worked on in the past.

There is still some time so it's not entirely out of scope for 2.27 if people put some time into it now.

SteveBronder · 2021-05-10T18:40:43Z

But, and correct me if I’m wrong, all the smart pointers would do is call delete on the collection of var_context instances once the std::vector goes out of scope in the interface, code, right? That doesn’t seem like a tremendous amount of memory management to me, especially when care is already going to be required in constructing the separate var_context for each initialization anyways. In particular why couldn’t an interface just construct the var_context instances on the stack, as CmdStan currently does, and then throw the local stack addresses into a vector? They should be valid through the lifetime of the function call, no?

It's a lot less trivial than it first appears. Like here's an example. Say with the init var context we want to follow the guidelines above and if the user only specifies my_init.data.R we want one pointer to a context that all of the chains use, otherwise we will have N inits. Well, now I'm going to have two cases, one where I have N inits and need to call N deletes, and another where I have 1 init and need to call 1 delete (else I double free). Sure I can code around that, but I can also just make them an std::shared_ptr<var_context*>() and I don't have to care about managing that. Now suppose in the future we want to allow the users to be able to supply some modulo of num_chains inits. Then I need to keep track of the total amount of deletes I need to do when I clean up (aka pretty much rewriting shared_ptr<>). This all get's complicated and a smart pointer lets me be dumb and safe for a small bit of overhead that we don't actually pay very much of since we pass the vector as a reference. To be clear, I'd need an extreme case where having a template parameter here is bad enough to change this piece of the API.

Is the Q here whether we can ensure pseudo-independent sequences for each of the chains given init_chain_id, seed, and num_chains? imo it kind of feels like our answer is yes at this point and now we are mostly focusing on how exactly we are going to setup the stride. Or am I missing some part of the convo?

Yes, I think we’re largely in agreement but at the same time I don’t think this can be fully pushed to implementation. Because this is behavior that we want the API to guarantee I think it has to be fully fleshed out in the design doc (unless a stride argument is added to the route or something). Like the filename lookup logic I think we want to have the precise PRNG logic specified.

It's not clear to me what's left to be discussed. It feels like at this point we are talking about the stride behavior, but imo that's an impl detail and I don't like the idea of specifying the exact implementation inside of the design doc. Like say we build it out and we are like, "Oh X is much better we should be doing that with the PRNG", then any convo we had here would end up being tossed and we also have to go back and make another PR to change that. I think what we want in the design doc is to specify what sort of user behavior allows for reproducibility (which I think is what we have right now). Any technical details would be documented in the doxygen and explaining to users how to run an analysis reproducibly would go in the Stan documentation.

It looks as if things are settled enough from my view... but I don't want to jump ahead here. Given that @SteveBronder obviously is willing to invest a lot of time the next days on it, I think it would be good to make an attempt to get this over the line and into 2.27. So how about @betanalpha and @SteveBronder have a quick call to sort out the remaining bits?

I spoke to @betanalpha and he wants to keep it here to make sure all communication about the design process is public (which I think is fair and reasonable).

From my end I can review stuff from the PRs as they relate to threading and all that; or other things I have worked on in the past.

Much appreciated!

There is still some time so it's not entirely out of scope for 2.27 if people put some time into it now.

That's the other thing, the impl is pretty much done based on the current design doc and I think talking about the PRNG things on the impl side is going to make the conversation a lot easier since there we can actually look at and comment on code. Talking about technical things like the PRNG in here in the design doc is very difficult for me and to be honest I'm pretty lost on what needs to be done wrt the PRNG in order to move forward.

bob-carpenter · 2021-05-10T19:43:28Z

No reason people can't have calls then report back to the lists. Communication via GitHub message is slow.
I agree that we should split functional specs (what the API does) from how it does it. You can write a technical spec for the latter if needed to justify the functional spec's ability to be coded.
I'm jumping into this fairly late, but do we really need pointers here? Why not just use vector<var_context&> to make sure the result is a reference? (Yes, I know they're the same underlying, but we've avoided pointers as much as possible and using references simplifies the use of elements (replacing -> with the standard .).
I think we should ignore statements like "the impl is pretty much done based on the current design doc" when considering designs. I don't want sunk implementation cost to affect design the way it has in the past.
Stride behavior is not an implementation detail for whatever function is taking a PRNG reference. It affects how that reference is modified by the function call.

SteveBronder · 2021-05-10T23:02:13Z

No reason people can't have calls then report back to the lists. Communication via GitHub message is slow.

@betanalpha if your cool with that I think it would be easier. There's a couple places that I'm confused rn

I'm jumping into this fairly late, but do we really need pointers here? Why not just use vector<var_context&> to make sure the result is a reference? (Yes, I know they're the same underlying, but we've avoided pointers as much as possible and using references simplifies the use of elements (replacing -> with the standard .).

It would be nice to avoid a pointer here but since we use abstract base classes idt that works, unless I'm doing something wrong in the godbolt example below

https://godbolt.org/z/vvMhhKGre

I think we should ignore statements like "the impl is pretty much done based on the current design doc" when considering designs. I don't want sunk implementation cost to affect design the way it has in the past.

Yeah that's very fair, I think I was saying that because I'd like to have it in for this next release. Though to be clear I'm not saying, "How the impl is doing it should be what should be done" but that the impl that I have now does satisfy almost everything in this design doc and I'm mostly waiting to make any final changes in the impl till this is approved.

Stride behavior is not an implementation detail for whatever function is taking a PRNG reference. It affects how that reference is modified by the function call.

My thought process here is that this is an implementation detail with respect to the users of the service layer. imo this design doc is centered around the service layer APIs and so it should have explanations at that level. That reflects what I have in the design doc right now describing how users of the service layer will allow for reproducibility.

Also I think my other confusion here is that I don't understand what are the specific changes necessary to the current design doc with respect to the PRNG. I'm fine with copy-pasting in the below to the design doc, but it feels odd to me. Like what if we ever decide to change anything here? For instance what if we decide to change this to use equal spaced strides? Or swap out to a PRNG with a longer cycle like taus88? It's feels like I have to put the code I add in the design doc into a little reminder so if anything changes in create_rng() we also need to update the information here. imo anyone who wants to know exactly how we are generating the PRNG can go to the code on github and look it it. I also think documentation about this sort of thing should be in the Stan docs once we've sorted it out in the impl

inline boost::ecuyer1988 create_rng(unsigned int seed, unsigned int init_chain_id, unsigned int chain_num) {
  // Initialize L’ecuyer generator
  boost::ecuyer1988 rng(seed);
  
  // Seek generator to disjoint region for each chain
  static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
  rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num - 1));
  return rng;
}

What if I just included a link to the branches create_rng() function? Then anyone who wants to know how the prng is built can go there

betanalpha · 2021-05-11T05:35:26Z

@betanalpha if your cool with that I think it would be easier. There's a couple places that I'm confused rn

Sure, I can do a quick call tomorrow before 4.

I'm jumping into this fairly late, but do we really need pointers here? Why not just use vector<var_context&> to make sure the result is a reference? (Yes, I know they're the same underlying, but we've avoided pointers as much as possible and using references simplifies the use of elements (replacing -> with the standard .). It would be nice to avoid a pointer here but since we use abstract base classes idt that works, unless I'm doing something wrong in the godbolt example below https://godbolt.org/z/vvMhhKGre

Let’s back up a bit. Currently the interfaces allocate a var_context instance for the initializations on the stack and pass them to the API by reference. The current design in the doc changes this so that the var_contexts instances have to be allocated on the heap, in which case the memory has to be allocated and managed somewhere. I appreciate how smart pointers can be useful if some or all of the heap instances are replicated, I would just prefer to avoid heap allocation in the first place if at all possible. As Bob said we’ve tried to avoid that where possible in the code base so far.

I think we should ignore statements like "the impl is pretty much done based on the current design doc" when considering designs. I don't want sunk implementation cost to affect design the way it has in the past. Yeah that's very fair, I think I was saying that because I'd like to have it in for this next release. Though to be clear I'm not saying, "How the impl is doing it should be what should be done" but that the impl that I have now does satisfy almost everything in this design doc and I'm mostly waiting to make any final changes in the impl till this is approved.

This has come up a few times already and respectfully I think it’s the wrong way to approach design docs. An implementation can be useful for prototyping out a design but I am hesitant when that preliminary implementation starts to force design decisions that may or may not be appropriate.

Stride behavior is not an implementation detail for whatever function is taking a PRNG reference. It affects how that reference is modified by the function call

These proposed routes don’t take references to existing PRNG instances but instead create their own PRNG references internally (specially one for each chain) so we don’t have to worry about messing up some external PRNG state.

My thought process here is that this is an implementation detail with respect to the users of the service layer. imo this design doc is centered around the service layer APIs and so it should have explanations at that level. That reflects what I have in the design doc right now describing how users of the service layer will allow for reproducibility.

The subtlety here is that we have to consider reproducibility with multiple calls verses a single call. In particular is running ``` seed=848383, init_chain_id=1, num_chains=N ``` always equivalent to ``` seed=848383, init_chain_id=1, num_chains=n1 seed=848383, init_chain_id=1 + n1, num_chains=n2 seed=848383, init_chain_id=1 + n1 + n2, num_chains=n3 seed=848383, init_chain_id=1 + n1 + n2 + n3, num_chains=n4 ``` so long as n1 + n2 + n3 + n4 = N? If we set the stride by dividing the PRNG period, or some subset thereof, by the number of chains in each call, for example ``` // Seek generator to disjoint region for each chain static uintmax_t DISCARD_STRIDE = (PERIOD / 2) / num_chains; rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num - 1)); ``` then we would not have this consistency. If we use a constant stride that’s long enough for any single chain, as we currently do and I think we’ve largely agreed to do here, then we can guarantee this behavior. In this case in my opinion what we need to document is not the particular implementation but rather the consistency guarantee whenever the total number of chains is less than PERIOD / static_cast<uintmax_t>(1) << 50.

For instance what if we decide to change this to use equal spaced strides?

Then the API behavior fundamentally changes, which is why I think we need to doc it carefully.

Or swap out to a PRNG with a longer cycle like taus88?

The period does affect how many chains can be run in parallel (if we’re going with the constant stride strategy). That number should be pretty huge so it won’t affect most use cases but technically the PRNG details do leak out of the API a bit here. In any case changing the PRNG can have all kinds of unexpected side effects so I really don’t think that choice should be considered an implementation detail anyways.

t4c1 · 2021-05-11T06:20:49Z

I'm jumping into this fairly late, but do we really need pointers here? Why not just use vector<var_context&> to make sure the result is a reference? (Yes, I know they're the same underlying, but we've avoided pointers as much as possible and using references simplifies the use of elements (replacing -> with the standard .).

It would be nice to avoid a pointer here but since we use abstract base classes idt that works, unless I'm doing something wrong in the godbolt example below
https://godbolt.org/z/vvMhhKGre

We can't. C++ forbids pointers to references, which essentially prevents us from putting references into any container (except struct and tuple I think). It is possible to wrap reference in struct before putting them in a vector, but in that case it gets more complicated than simply using a pointer.

SteveBronder · 2021-05-11T14:37:52Z

Let’s back up a bit.

Currently the interfaces allocate a var_context instance for the initializations on the stack and pass them to the API by reference.

The current design in the doc changes this so that the var_contexts instances have to be allocated on the heap, in which case the memory has to be allocated and managed somewhere. I appreciate how smart pointers can be useful if some or all of the heap instances are replicated, I would just prefer to avoid heap allocation in the first place if at all possible. As Bob said we’ve tried to avoid that where possible in the code base so far.

Using new anywhere puts the pointer on the stack but the object the pointer references is put onto the heap.

And cmdstan currently uses a shared pointer

I think I understand what you are looking for with the PRNG and just updated the docs to have the constant stride in there

SteveBronder · 2021-05-11T20:02:03Z

Alrighty @betanalpha and I spoke over the phone and I think with the changes to the template context names and the updated PRNG docs we should be good to go!

SteveBronder · 2021-05-12T15:11:35Z

I think with the above + the original vote this is good to go, can someone click the approve button?

betanalpha

Looks good to me!

add design doc for parallel services

4bb62af

SteveBronder mentioned this pull request Apr 6, 2021

Running multiple chains in one Stan program stan-dev/cmdstan#987

Merged

2 tasks

update with sbc Q

ec7f146

update with init_chain_id and clarifications on replication between m…

401694a

…ulti-chain program and multiple programs with single chains

riddell-stan reviewed Apr 20, 2021

View reviewed changes

update n_chain to num_chains

6b60ded

hyunjimoon mentioned this pull request May 3, 2021

Parallelization hyunjimoon/SBC#11

Closed

update docs for PRNG

d1ea3f4

makes init contexts names to have Ptr in them

fa63256

betanalpha approved these changes May 12, 2021

View reviewed changes

SteveBronder merged commit 5d2812d into stan-dev:master May 12, 2021

hyunjimoon mentioned this pull request Sep 6, 2021

Support for variational, (optimize) hyunjimoon/SBC#32

Merged

Add design doc for parallel services #40

Add design doc for parallel services #40

Conversation

SteveBronder commented Apr 6, 2021 • edited Loading

SteveBronder commented Apr 9, 2021

wds15 commented Apr 11, 2021

betanalpha commented Apr 12, 2021 via email

riddell-stan commented Apr 13, 2021

mitzimorris commented Apr 13, 2021

SteveBronder commented Apr 13, 2021 • edited Loading

Cmdstan Parallel Chains Redcard

4 chains, 32 threads

4 chains, 24 threads

8 chains, 32 threads

GNU Parallel

4 chains, 32 threads

4 chains, 24 threads

8 chains, 32 threads

SteveBronder commented Apr 13, 2021

SteveBronder commented Apr 13, 2021

wds15 commented Apr 13, 2021

SteveBronder commented Apr 13, 2021 • edited Loading

wds15 commented Apr 13, 2021

SteveBronder commented Apr 13, 2021

betanalpha commented Apr 15, 2021 via email

SteveBronder commented Apr 19, 2021 • edited Loading

SteveBronder commented Apr 20, 2021

riddell-stan Apr 20, 2021

Choose a reason for hiding this comment

SteveBronder Apr 21, 2021

Choose a reason for hiding this comment

SteveBronder commented Apr 21, 2021

riddell-stan commented Apr 21, 2021 via email

SteveBronder commented Apr 21, 2021

yizhang-yiz commented Apr 21, 2021

SteveBronder commented Apr 21, 2021

mitzimorris commented Apr 21, 2021

SteveBronder commented Apr 21, 2021

betanalpha commented Apr 26, 2021 via email

rok-cesnovar commented Apr 26, 2021

t4c1 commented Apr 28, 2021

SteveBronder commented Apr 28, 2021

t4c1 commented Apr 29, 2021

betanalpha commented Apr 29, 2021 via email

t4c1 commented Apr 29, 2021

wds15 commented Apr 29, 2021

hyunjimoon commented Apr 30, 2021 • edited Loading

SteveBronder commented Apr 30, 2021

bob-carpenter commented Apr 30, 2021

SteveBronder commented Apr 30, 2021

hyunjimoon commented May 1, 2021

betanalpha commented May 3, 2021 via email

SteveBronder commented May 3, 2021

SteveBronder commented May 5, 2021

betanalpha commented May 6, 2021 via email

wds15 commented May 6, 2021

SteveBronder commented May 10, 2021

bob-carpenter commented May 10, 2021

SteveBronder commented May 10, 2021

betanalpha commented May 11, 2021 via email

t4c1 commented May 11, 2021

SteveBronder commented May 11, 2021

SteveBronder commented May 11, 2021

SteveBronder commented May 12, 2021

betanalpha left a comment

Choose a reason for hiding this comment

SteveBronder commented Apr 6, 2021 •

edited

Loading

SteveBronder commented Apr 13, 2021 •

edited

Loading

SteveBronder commented Apr 13, 2021 •

edited

Loading

SteveBronder commented Apr 19, 2021 •

edited

Loading

hyunjimoon commented Apr 30, 2021 •

edited

Loading