-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design doc for parallel services #40
Conversation
Tagging folks who are interested in other threads @wds15 @bbbales2 @mitzimorris @yizhang-yiz and @bgoodri @riddell-stan for rstan and pystan. I tried keeping the scope tight to just the services layer, though it's hard to talk about the services layer without also talking about how the service layer is used. |
^What about controlling threads per chain and threads total? I think we should allow controlling both or otherwise do some benchmarks to decide on that. Sensible defaults should let user allow to usually just specify one of these controls and something reasonable will happen. So even for this default we would need a small benchmark to decide. I hope that makes sense in the context of this PR, but to me I would expect that we settle this question. Will need to dive more into the doc. Thanks for writing it! |
In my option the most important criteria for this API is maintaining compatible behavior with the current single-chain routes. In particular if n_chain = 1 then the new route should behave exactly the same as the current route and if n_chain > 1 then the behavior should be exactly the same as running the current route multiple times with appropriate chain indices.
Reproducibility with n_chain = 1 would just require looking for init files without any chain_id modification unless n_chain > 1, no?
Reproducibility between parallel and serial routes with multiple chains would depend on the behavior of `chain`. Does `chain` become an initial chain index in the parallel context with the other chains given incremental indices, chain + 1, chain + 2, etc? If so then we should probably rename it to something more precise such as `init_chain_id` and the init file lookup should use these chain ids and not indices starting from 1. An alternative would be to accept a vector of chain_id values, although that puts a nontrivial burden on the routes to validate the chain_ids and how the PRNG generator is strided.
In any case with the introduction of `n_chain` the old `chain` argument should probably be made more explicit, for example `chain_id` or even `init_chain_id` as above.
None of the callbacks are called often enough for the vtable hit to significant and templating _everything_ seems a bit aggressive, but that’s a minor point.
|
I'm not persuaded that one needs to add a function to services for this. Doing parallel sampling "manually" doesn't seem too hard (e.g., using Nits:
|
this is something that should be implemented once instead of allowing users to roll their own. when it comes to threading, these things need to be done carefully. |
I ran some benchmarks with the redcards idt I like the threads per chain thing because then like how would a user (or us choosing a default) do like 8 chains when there are a total of 32 cores available. Since the tbb seems to be Just Working™ I think we should just have a Cmdstan Parallel Chains Redcard4 chains, 32 threads
real 18.22 4 chains, 24 threads
real 19.14 8 chains, 32 threads
real 33.83 GNU Parallel4 chains, 32 threads
real 17.95 4 chains, 24 threads
real 19.38 8 chains, 32 threads
real 34.26 |
+1 to @mitzimorris there's a few gotchas here in setting up the parallel things and I think having a standardized API (and docs the services can use to make sure everything is lined up) makes it worth it. I thought about just leaving this as something the users of the services could impl on their own (you can do this by just wrapping the service layer call in a parallel for). Though I think there's enough little annoyances to have something standardized |
Yep! I have it hard-coded that if
Yep!
Yep!
I like having the init file lookup based on the starting id number, so that if a user is running multiple stan programs with multiple chains per program they can have one program start at 1 with 4 chains, another at 5 with 4 chains, etc.
I really thought we had a
So actually kind of nice for the machinations here! I was worried about ambiguity between having a
Let's stick with the easier version for now, I think at some point we may want this but the easy peasy version works for what we want right now.
Yeah so the templates here are not for performance but for flexibility for developers using the service layer. Since now the |
The results for threads total vs threads per chain look promising. Is it a possibility to lay out the code such that the threads per chain thing could be added later on? Given how things look right now I agree with you that we do not need the threads per chain thing.... and in fact, a user who requires the threads per chain thing could simply fire up different chains in different processes and control the threads for each of them.... which may lead us to conclude that threads per chains is not needed? The other thing which is really important in this context is the ability to do cross-chain warmup stuff. That's hopefully not an after-thought, right? (Sorry for not having studied the text in detail yet). It was raised that we change the meaning of some current command line options as I understood... it would be good to avoid that as much as possible unless we say we bump major version numbers. |
We could add it, it would just require another signature. Because for threads per chain we are going to want a
Yeah a lot of this is to setup the tooling so that a lot of the API stuff is done for adding a cross chain warmup scheme. Cuz then it's just targeting a different multi-chain service layer which should have very similar input/output schemes
Yeah I agree |
One more q: despite the templates being introduced, we can still pre compile ahead of time the Stan services, right? assuming a yes here, which I would guess, then this design is sound for me. |
Yep! I can double check but I believe that's totally fine.
Word! Let me read this over and fix some things for what @betanalpha posted and then I'll call for a vote |
Reproducibility between parallel and serial routes with multiple chains would depend on the behavior of chain. Does chain become an initial chain index in the parallel context with the other chains given incremental indices, chain + 1, chain + 2, etc?
Yep!
I agree that this is the right behavior but it has to be thoroughly doc’d. In particular I think that we need some discussion of how to emulate the same behavior with multiple calls verses one call (including not just multiple one chain calls with chain=1, chain=2, …, chain=n but also multiple multi-chain calls with properly strided chain ids, for example chain=1, chain=2 * num_chains, chain=…n * num_chains).
If so then we should probably rename it to something more precise such as init_chain_id and the init file lookup should use these chain ids and not indices starting from 1.
I like having the init file lookup based on the starting id number, so that if a user is running multiple stan programs with multiple chains per program they can have one program start at 1 with 4 chains, another at 5 with 4 chains, etc.
I think we mean the same thing!
If chain (or whatever we call it) is, say, 5, and num_chain is 3 then the function will look for the files
name.init.5.R
name.init.6.R
name.init.7.R
yeah?
In any case with the introduction of n_chain the old chain argument should probably be made more explicit, for example chain_id or even init_chain_id as above.
I really thought we had a chain argument in cmdstan, but we don't? We actually only have an id argument
id=<int>
Unique process identifier
Valid values: id >= 0
Defaults to 0
So actually kind of nice for the machinations here! I was worried about ambiguity between having a chain argument and chains. I think I was confused because in the services layer we call id -> chain. Or if you are talking about the chain argument at the services layer then yeah I'm fine with calling the new API's chain argument init_chain_id (or maybe just process_id)?
Yeah, I’m just thinking about the naming for the services API. The interface naming is another matter altogether.
I personally prefer something like `init_chain_id` because it emphasizes that the individual chains being bundled together will be given related ids. `process_id` on the other hand seems to label the entire bundle and not any individual chain.
An alternative would be to accept a vector of chain_id values, although that puts a nontrivial burden on the routes to validate the chain_ids and how the PRNG generator is strided.
Let's stick with the easier version for now, I think at some point we may want this but the easy peasy version works for what we want right now.
I agree — I was just throwing out alternatives in case I was misunderstanding what the intent of the new API.
None of the callbacks are called often enough for the vtable hit to significant and templating everything seems a bit aggressive, but that’s a minor point.
Yeah so the templates here are not for performance but for flexibility for developers using the service layer. Since now the std::vector<> holding those inits and readers can be anything with a get_underlying() overload it's easier to have them as templates. So this let's rstan use Rcpp::Xptras the inner value in the vector and that sort of thing. We could just have the API be std::vector<var_context*> but then we lose the ability to use smart pointers like std::shared_ptr<> etc.
You mean that you just want avoid having to make the interfaces wrap things in container classes that derive from var_context or the like? I don’t have any huge objections here, just want to understand the motivation.
Thanks!
|
Yes so I think the way I have it written should work to be reproducible. In the scheme I have now each PRNG is set by the seed + chain_id (the initial Though the output file name will be different in a multi-chain program vs multi program scheme. Changing the behavior of the multi program scheme would require breaking backwards compatibility since we would tag each output with a
Yes we are! Should have affirmed that, it's good and cool!
icic yeah if we are talking about the service layer I think that makes a lot of sense, I'll update this.
Yeah so the main thing is just letting upstream services be flexible in what they pass to the API. So a user of the service layer could just pass an Another alt here is just to specify that the inner element of a vector must have a valid |
…ulti-chain program and multiple programs with single chains
Alrighty I just updated with fixes! I think I'd like to hold a vote starting from April 21st to April 28th (Wednesday to Wednesday) if that is alright with everyone |
designs/0020-parallel-chain-api.md
Outdated
std::vector<InitWriter>& init_writer, | ||
std::vector<SampleWriter>& sample_writer, | ||
std::vector<DiagnosticWriter>& diagnostic_writer, | ||
size_t n_chain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would welcome some consistency with the parameter name. Having n_chain
in C++ alongside num_warmup
, num_samples
, and num_thin
is odd. And then in CmdStan it's chains
(plural, no n_
or num_
prefix).
Seems like num_chains
would be consistent with the current C++ API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable!
Alrighty lets do the vote starting today, tagging everyone who was commented so far @riddell-stan @wds15 @betanalpha @mitzimorris @wds15 @bbbales2 let's have the vote as either a comment or as a review |
I thought design docs just needed one endorsement by someone other than the author(s).
…On Wed, Apr 21, 2021, at 12:59 PM, Steve Bronder wrote:
Alrighty lets do the vote starting today, tagging everyone who was
commented so far
@riddell-stan <https://github.com/riddell-stan> @wds15
<https://github.com/wds15> @betanalpha <https://github.com/betanalpha>
@mitzimorris <https://github.com/mitzimorris> @wds15
<https://github.com/wds15> @bbbales2 <https://github.com/bbbales2>
let's have the vote as either a comment or as a review
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJQUBVU4WH34XCO3NGOEYUDTJ4AATANCNFSM42PSWFOA>.
|
So in the past we had it that if there were no disagreements between anyone whose commented so far (full quorum) we would just merge the design doc. Folks seem to want to move to a voting scheme lately so I'm fine with also doing a vote here. Though if no one has an objection or 50% of the people tagged above vote yes then I think we are good to just approve and merge it |
I'm not aware any proposal to move to voting scheme. The adjoint ode doc vote was due to disagreement. |
Oh in that case if folks are fine with it let's approve and merge this. I can update the parallel branch with the above changes today |
lgtm. |
For now I was going to just do sampling, though I think this scheme makes a lot of sense for vb (and I think @hyunjimoon had some stuff for vb where multi chain would be very useful.). Does it make sense to add this for optimization? if so then yeah I'm also fine with doing it for optimization as well |
Motivation:
We’ve had this discussion before, but I think it’s important enough to mention again. In my opinion the main scope of this design doc is setting up the API for multiple chains to be managed and run by the services, instead of requiring the interfaces to implement this all independently. This not only reduces the work required of an interface implementation but also unifies the UX across interfaces.
The ability to run the chains in parallel with the TBB and to consider sharing information across chains is an additional benefit that follows from the API. In my opinion the only reason to even mention these right now is the thread safety requirement of some of the inputs.
Guide-level explanation:
I think that the `num_chains` arguments should be up higher in the signature. Honestly it should probably go write after `model`.
To avoid confusion I think that we should probably also change `num_warmup` to `num_warmup_per_chain` and `num_samples` to `num_samples_per_chain`.
`seed + init_chain_id + chain_num` is not a good initialization. For the L’ecuyer PRNG each total chain_id needs to be scaled by a long enough stride, `chain_seed = seed + stride_size * (init_chain_id + chain_num)` or more readably
```
chain_id = init_chain_id + chain_idx
chain_seed = base_seed + stride_size * chain_id
```
Upon further reflection I’m not a fan of templating out the var_context. The problem is that this puts the responsibility of supporting different `get_underlying` implementations on the services library instead of the user. Sure wrapping a smart pointer in var_context derived class is a bit more work for an interface but then any interface can implement any var_context strategy they want without any work/approval from the services layer. In my opinion the object oriented approach provides a much more flexible compartmentalization here.
Buffering output into chain-specific stringstreams is not ideal. Yes it avoids threading issues but it also requires a significant memory burden for each chain (one of the reasons why CmdStan is so much more scalable than RStan and PyStan is because of its streaming output) and prevents realtime updates. std::cout based loggers shouldn’t need buffering at all which leaves the only nontrivial problem thread safe interrupt implementations, no?
Recommended Upstream Initialization:
It would help to write out a specific example of the init file logic here to avoid any confusion.
Open Questions:
“Model” here is being used in a non-standard way statistically. When a `log_prob` objects gets initialized it partially evaluates the target density defined by the Stan program which immediately defines an unnormalized posterior density function. That’s the object that then gets consumed by the algorithms for each “fit”.
The behavior of the `transformed data` block isn’t an open question — for this design doc the only issue is reproducing the current behavior. That said clarifying the current behavior (correct me if I’m wrong but the `transformed data` PRNG gets seeded by `seed` alone and the `generated quantity` PRNG gets seeded by `seed + stride * chain_id`) would be a welcome addition to the doc.
|
This is correct. |
I have two issues with that. The first one is how this is written and the second one is what is written. Lets start with the first one (You are probably thinking about the right thing, but it is not written correctly). Assuming we want to keep behavior the same we need to seed all generator with the same seed and than jump for some steps ahead. In current code this is (with static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
boost::ecuyer1988 rng(seed);
rng.discard(DISCARD_STRIDE * chain); Second issue is by how many strides we should jump ahead (what is the |
Thanks! I think I see what you are saying, but just to check are you saying something like static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
boost::ecuyer1988 rng(seed);
// Just changing this part
rng.discard(DISCARD_STRIDE * init_chain_id + id_of_thread_in_process); Though the RNG is not currently constructed within the thread, is there anything wrong with just using the
|
Exactly.
Well how does it now work, if multiple threads are computing one chain? Do they share single PRNG? I think that would make threaded part not deterministic if it uses PRNG. |
Though for me I think we can see some good speedups, not in the fact that we are running multiple things at once, but in the reduced memory usage we get from sharing data across chains.
Totally agree about the eventual benefit, but I think the priority of the design doc here is the design fo the API route itself and not its implementation (again modulo details like thread safety that have to be enforced in the route).
So idt we can get rid of the actual templates for the var_context inputs while supporting multiple types of smart pointers. For instance if the signature had std::vector<var_context*> then we could only bring in raw pointers to var_context derived types. What we can do is get rid of get_underlying() and say that we accept any input vector whose elements have a valid operator* which returns a reference to a class derived from var_context. I think that's flexible enough for any doo-dads that downstream devs may need while being concrete in the service layer.
Is the issue here that RCPP wants to wrap all references in smart pointers? So the problem isn’t that RStan couldn’t wrap local memory in a `var_context` derivation but rather that it couldn’t expose it outside of a smart pointer reference?
So the stringstreams for this are very short lived and only exist because we use multiple << to construct the output of each line. How std::cout etc work is that they have a buffer that we write to that is then printed out to where-ever. When do do multiple << like the below
std::cout << "Yo" << "oo" << "oo" << "\n";
// Prints out Yooooo \n on a single thread
in a multithreaded context std::cout is threadsafe, but can choose to print at any point in calling with those << so you can end up with valid but silly looking output like
YoooYooo
oooo
YoooYooo
oo
(and by valid here I just mean it won't cause the computer to crash from UB)
For when we want an individual line to be printed all in one go we want to put that whole line in a string and then tell cout to print that whole string at once
std::stringstream my_yo;
my_yo << "Yo" << "oo" << "oo" << "\n";
std::cout << my_yo.str();
// multi-thread prints out
/* Yooooo
* Yooooo
* Yooooo
*/
That makes sense, but I don’t think it was clear from the doc. I think it would help to emphasize buffering local, contiguous outputs like numerical output for a single iteration or messages but not the entire chain output.
This was supposed to be done with voting tmrw. What I'd like to do is, after tmrw, as long as a bunch of people don't vote no suddenly then we'll make this a "Tentative Yes" pending sorting out the stride issue and changing the things in the above which I think are good and reasonable
As was pointed out earlier in the thread I don’t think there’s a need for voting here. The design is still evolving, for the better, once the design equilibrates we can consider voting if there’s not consensus.
I have two issues with that. The first one is how this is written and the second one is what is written. Lets start with the first one (You are probably thinking about the right thing, but it is not written correctly). Assuming we want to keep behavior the same we need to seed all generator with the same seed and than jump for some steps ahead. In current code this is (with seed in being the same as base_seed in your suggestion):
static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
boost::ecuyer1988 rng(seed);
rng.discard(DISCARD_STRIDE * chain);
Yes, apologies, I was being way too sloppy and thinking of seed as the state initialized by the variable `seed` and `+` as the seek operator on the internal PRNG state.
I agree that this is the correct implementation of seeking for Boost’s L’ecuyer generator.
AFAIK if we can run multiple processes at the same time, multiple chains in the same process and multiple threads for each chain. Each thread needs its own PRNG instance. So this needs to be equal to global thread id, which is init_chain_id + id_of_thread_in_process.
Though the RNG is not currently constructed within the thread, is there anything wrong with just using the chain_id?
Well how does it now work, if multiple threads are computing one chain? Do they share single PRNG? I think that would make threaded part not deterministic if it uses PRNG.
I’m not sure exactly how precisely constrained this is in the compiler but my understanding is that the functors that expose threading don’t accept functions that utilize any of the `_rng` functions, in which case there is no need to parallelize the PRNG within each chain.
static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
boost::ecuyer1988 rng(seed);
// Just changing this part
rng.discard(DISCARD_STRIDE * init_chain_id + id_of_thread_in_process);
Though the RNG is not currently constructed within the thread, is there anything wrong with just using the chain_id? i.e. if I have 4 chains starting at id of 0 they would be strided by
DISCARD_STRIDE * 0
DISCARD_STRIDE * 1
DISCARD_STRIDE * 2
DISCARD_STRIDE * 3
We still need the init_chain_id in there as well so that if someone runs multiple chains split into multiple calls to the service layer with the same seed but a different chain_id then the PRNGs will still be roughly independent.
```
// Initialize L’ecuyer generator
boost::ecuyer1988 rng(seed);
// Seek generator to disjoint region for each chain
static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num));
```
All that’s needed is for `DISCARD_STRIDE` to be long enough that each chain has enough space to consume as many PRNG states as it needs white not being so long that `DISCARD_STRIDE * num_chains` gets close to the period of the generator.
|
Than ignoring threads working on the same chain is fine. Although the restriction seems somewhat arbitrary. It makes more sense for
We could also evenly space the generator period between streams, doing |
With the TBB the concept of a thread is abstracted away. You could have thread local streams for each chain, but that is not super performant. Threaded out of order access to a rng is not a nice thing to support. If that can be avoided that would be good. |
For instance, this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful. |
That's fair, though if you check out the most recent update I think I touch more on the motivation for this.
I think it's that I want to wrap pointers into smart pointers. We could make this a wrt the vector of raw pointers, it's just more manual memory management that I think we can let users just avoid/hide by making these templates that take in a class with an
That's reasonable. It feels like we are talking more in depth on implementation details. Like the stride etc. for the PRNG are not user facing. At this point I'd prefer approving this and moving this PRNG conversation over to the implementation (That should also make that conversation easier as we can look at how I currently have things coded up.) |
ADVI is stochastic (initialization point and stochastic gradients for ELBO), but it's not an MCMC method and does not have chains. The draws it produces are just independent draws from the normal approximation. You can treat them like a Markov chain, but we know it's stationary distribution is normal(mu, Sigma) and we know the draws are independent. This means there's no point in splitting, as in split R-hat. If we stuck to the unconstrained scale, all the sample quantities required for R-hat are available analytically. We lose that if we convert back to the constrained scale.
I'm late to this discussion, but why not just have
They may not be user controllable, but they can impact users. But it's probably not an issue if the skip-ahead strides are big enough. I think we used values for CmdStan MCMC that were larger than the number of RNGs that we could sample in any reasonable amount of time. |
I think we can talk about ADVI another time as the design doc is just for the sampler's service API. Its def something we should think about soon, I think a lot of the API layout here will be applicable to ADVI as well.
https://godbolt.org/z/nK4dhMcMo
Yeah I agree though I think that is something we would put in the Stan docs about parallel processes which we would add once we sort out the striding thing in the impl. |
Sorry if the point was not clear. The quote below from this paper suggests running stochastic optimization with multiple chains. It is true split-R is used, but the algorithm is designed based on J chains. Viewing the stochastic optimization as MCMC and therefore borrowing MCMC diagnostics is the novelty of the paper which motivated us (@Dashadower @bbbales2 and I) to include khat, Rhat to VI diagnostics; related to this issue. "We use the split-Rb version, where all chains are split into two before carrying out the |
We could also evenly space the generator period between streams, doing DISCARD_STRIDE = generator_period / total_n_chains. Although the way it is now is also fine as long as each chain needs less than 1<<50randomly generated numbers and we run less than generator_period / (1<<50) streams, which is probably a reasonable assumption.
I think technically we’d want to do something like (generator_period / 2) / num_chains to avoid any issues from equidistribution near the end of the period, but that’s a small issue and seeking ahead 1 << 50 states per chain should be good enough. Using a fixed stride much less than period / num_chains also allows for more chains to be run in the future; running first with seed=S, init_chain_id=1, num_chains=4 and then afterwards with seed=S, init_chain_id=5, num_chains=4 would ensure pseudo-independent sequences for each of the 8 chains.
I think it's that I want to wrap pointers into smart pointers. We could make this a std::vector<var_context*> but then that means all the upstream interfaces need to manage the memory for their var context pointers. Just to make sure I understand, is the alternative interface you are thinking about just having a std::vector<var_context*> or is there something else you had in mind?
Yeah `std::vector<var_context*>` matching the current implementations.
wrt the vector of raw pointers, it's just more manual memory management that I think we can let users just avoid/hide by making these templates that take in a class with an operator* that returns a reference to a class derived from var_context. Then cmdstan can use shared_ptr<var_context>, rstan can use Rcpp::Xptr<var_context>, python can use the smart pointer it wants to use if it has one, etc.
Sorry for being dense here. I didn’t think that the memory management was all that burdensome and, if anything, it gave the interfaces the flexibility to, for example, let the local REPL environment manage the memory. I’m not caught up on the latest RCPP but it sounds like Xptr achieves that for R using a smart pointer pattern? @riddell-stan Is there a similarly straightforward pattern that httpstan can exploit?
In any case is the choice of smart pointers over regular pointers is orthogonal to the parallelization issues? If so wouldn’t it be better to add new smart pointer routes for all the current routes in one go in a separate PR?
It feels like we are talking more in depth on implementation details. Like the stride etc. for the PRNG are not user facing.
I just want to make sure that the signature and automated chain_id will be flexible enough for all the use cases we might encounter. Once we’re agreed on that (which I think we largely are) then I agree the further implementation details go beyond the scope of the design doc.
|
No worries at all! It's not like it's "oof this is going to take me days" but it's just another thing I have to think about where using a smart pointer lets me not think about. Like for cmdstan it would be nice to not have to do a bunch of
idk if I'd call it orthogonal. Sort of? Like right now the templated version of the signature in this proposal allows someone to pass in a raw pointer since the only requirement is that a valid The current routes just take in a reference to an object derived from
Is the Q here whether we can ensure pseudo-independent sequences for each of the chains given |
Wanted to bump this as I think if we can get this merged this week then we will have enough time to add it to cmdstan before the next release. @betanalpha does the above answer your Qs about the var context? imo I'd like to move the rest of the PRNG things to the implementation discussion |
No worries at all! It's not like it's "oof this is going to take me days" but it's just another thing I have to think about where using a smart pointer lets me not think about. Like for cmdstan it would be nice to not have to do a bunch of new and delete's over the elements of the vector where I could just have a std::shared_ptr<>. For R using raw pointers instead of Rcpp::Xptr just involves calling PROTECT() / UNPROTECT() / R_SetExternalPtrProtected() etc. that you don't have to think about if you use Rcpp::Xptr
In any case is the choice of smart pointers over regular pointers is orthogonal to the parallelization issues? If so wouldn’t it be better to add new smart pointer routes for all the current routes in one go in a separate PR?
idk if I'd call it orthogonal. Sort of? Like right now the templated version of the signature in this proposal allows someone to pass in a raw pointer since the only requirement is that a valid operator* exists for the object that returns a reference to a class derived from var_context. That allows for both raw and smart pointers.
The current routes just take in a reference to an object derived from var_context and idt there's any need to change those. Those ones just call operator* on the var context derived object when calling the service layer API
But, and correct me if I’m wrong, all the smart pointers would do is call delete on the collection of var_context instances once the std::vector goes out of scope in the interface, code, right? That doesn’t seem like a tremendous amount of memory management to me, especially when care is already going to be required in constructing the separate var_context for each initialization anyways. In particular why couldn’t an interface just construct the var_context instances on the stack, as CmdStan currently does, and then throw the local stack addresses into a vector? They should be valid through the lifetime of the function call, no?
What’s bothering me is the asymmetry this would create in the multi-chain and single-chain routes, and the different work the interfaces would need to do to use those routes. To me a BLAH and std::vector<BLAH*> pattern is much more natural, although a consistent shared_ptr<BLAH> and std::vector<shared_ptr<BLAH>> pattern wouldn’t be much worse.
Is the Q here whether we can ensure pseudo-independent sequences for each of the chains given init_chain_id, seed, and num_chains? imo it kind of feels like our answer is yes at this point and now we are mostly focusing on how exactly we are going to setup the stride. Or am I missing some part of the convo?
Yes, I think we’re largely in agreement but at the same time I don’t think this can be fully pushed to implementation. Because this is behavior that we want the API to guarantee I think it has to be fully fleshed out in the design doc (unless a `stride` argument is added to the route or something). Like the filename lookup logic I think we want to have the precise PRNG logic specified.
I think we’re all on the same page with
```
// Initialize L’ecuyer generator
boost::ecuyer1988 rng(seed);
// Seek generator to disjoint region for each chain
static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num - 1));
```
Personally I also think it would make sense to have the parallel filenames also mach this logic, using `init_chain_id + chain_num - 1` as the file index instead of just `chain_num`.
|
It looks as if things are settled enough from my view... but I don't want to jump ahead here. Given that @SteveBronder obviously is willing to invest a lot of time the next days on it, I think it would be good to make an attempt to get this over the line and into 2.27. So how about @betanalpha and @SteveBronder have a quick call to sort out the remaining bits? From my end I can review stuff from the PRs as they relate to threading and all that; or other things I have worked on in the past. There is still some time so it's not entirely out of scope for 2.27 if people put some time into it now. |
It's a lot less trivial than it first appears. Like here's an example. Say with the init var context we want to follow the guidelines above and if the user only specifies
It's not clear to me what's left to be discussed. It feels like at this point we are talking about the stride behavior, but imo that's an impl detail and I don't like the idea of specifying the exact implementation inside of the design doc. Like say we build it out and we are like, "Oh X is much better we should be doing that with the PRNG", then any convo we had here would end up being tossed and we also have to go back and make another PR to change that. I think what we want in the design doc is to specify what sort of user behavior allows for reproducibility (which I think is what we have right now). Any technical details would be documented in the doxygen and explaining to users how to run an analysis reproducibly would go in the Stan documentation.
I spoke to @betanalpha and he wants to keep it here to make sure all communication about the design process is public (which I think is fair and reasonable).
Much appreciated!
That's the other thing, the impl is pretty much done based on the current design doc and I think talking about the PRNG things on the impl side is going to make the conversation a lot easier since there we can actually look at and comment on code. Talking about technical things like the PRNG in here in the design doc is very difficult for me and to be honest I'm pretty lost on what needs to be done wrt the PRNG in order to move forward. |
|
@betanalpha if your cool with that I think it would be easier. There's a couple places that I'm confused rn
It would be nice to avoid a pointer here but since we use abstract base classes idt that works, unless I'm doing something wrong in the godbolt example below https://godbolt.org/z/vvMhhKGre
Yeah that's very fair, I think I was saying that because I'd like to have it in for this next release. Though to be clear I'm not saying, "How the impl is doing it should be what should be done" but that the impl that I have now does satisfy almost everything in this design doc and I'm mostly waiting to make any final changes in the impl till this is approved.
My thought process here is that this is an implementation detail with respect to the users of the service layer. imo this design doc is centered around the service layer APIs and so it should have explanations at that level. That reflects what I have in the design doc right now describing how users of the service layer will allow for reproducibility. Also I think my other confusion here is that I don't understand what are the specific changes necessary to the current design doc with respect to the PRNG. I'm fine with copy-pasting in the below to the design doc, but it feels odd to me. Like what if we ever decide to change anything here? For instance what if we decide to change this to use equal spaced strides? Or swap out to a PRNG with a longer cycle like inline boost::ecuyer1988 create_rng(unsigned int seed, unsigned int init_chain_id, unsigned int chain_num) {
// Initialize L’ecuyer generator
boost::ecuyer1988 rng(seed);
// Seek generator to disjoint region for each chain
static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num - 1));
return rng;
} What if I just included a link to the branches |
@betanalpha if your cool with that I think it would be easier. There's a couple places that I'm confused rn
Sure, I can do a quick call tomorrow before 4.
I'm jumping into this fairly late, but do we really need pointers here? Why not just use vector<var_context&> to make sure the result is a reference? (Yes, I know they're the same underlying, but we've avoided pointers as much as possible and using references simplifies the use of elements (replacing -> with the standard .).
It would be nice to avoid a pointer here but since we use abstract base classes idt that works, unless I'm doing something wrong in the godbolt example below
https://godbolt.org/z/vvMhhKGre
Let’s back up a bit.
Currently the interfaces allocate a var_context instance for the initializations on the stack and pass them to the API by reference.
The current design in the doc changes this so that the var_contexts instances have to be allocated on the heap, in which case the memory has to be allocated and managed somewhere. I appreciate how smart pointers can be useful if some or all of the heap instances are replicated, I would just prefer to avoid heap allocation in the first place if at all possible. As Bob said we’ve tried to avoid that where possible in the code base so far.
I think we should ignore statements like "the impl is pretty much done based on the current design doc" when considering designs. I don't want sunk implementation cost to affect design the way it has in the past.
Yeah that's very fair, I think I was saying that because I'd like to have it in for this next release. Though to be clear I'm not saying, "How the impl is doing it should be what should be done" but that the impl that I have now does satisfy almost everything in this design doc and I'm mostly waiting to make any final changes in the impl till this is approved.
This has come up a few times already and respectfully I think it’s the wrong way to approach design docs. An implementation can be useful for prototyping out a design but I am hesitant when that preliminary implementation starts to force design decisions that may or may not be appropriate.
Stride behavior is not an implementation detail for whatever function is taking a PRNG reference. It affects how that reference is modified by the function call
These proposed routes don’t take references to existing PRNG instances but instead create their own PRNG references internally (specially one for each chain) so we don’t have to worry about messing up some external PRNG state.
My thought process here is that this is an implementation detail with respect to the users of the service layer. imo this design doc is centered around the service layer APIs and so it should have explanations at that level. That reflects what I have in the design doc right now describing how users of the service layer will allow for reproducibility.
The subtlety here is that we have to consider reproducibility with multiple calls verses a single call. In particular is running
```
seed=848383, init_chain_id=1, num_chains=N
```
always equivalent to
```
seed=848383, init_chain_id=1, num_chains=n1
seed=848383, init_chain_id=1 + n1, num_chains=n2
seed=848383, init_chain_id=1 + n1 + n2, num_chains=n3
seed=848383, init_chain_id=1 + n1 + n2 + n3, num_chains=n4
```
so long as n1 + n2 + n3 + n4 = N?
If we set the stride by dividing the PRNG period, or some subset thereof, by the number of chains in each call, for example
```
// Seek generator to disjoint region for each chain
static uintmax_t DISCARD_STRIDE = (PERIOD / 2) / num_chains;
rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num - 1));
```
then we would not have this consistency.
If we use a constant stride that’s long enough for any single chain, as we currently do and I think we’ve largely agreed to do here, then we can guarantee this behavior.
In this case in my opinion what we need to document is not the particular implementation but rather the consistency guarantee whenever the total number of chains is less than PERIOD / static_cast<uintmax_t>(1) << 50.
For instance what if we decide to change this to use equal spaced strides?
Then the API behavior fundamentally changes, which is why I think we need to doc it carefully.
Or swap out to a PRNG with a longer cycle like taus88?
The period does affect how many chains can be run in parallel (if we’re going with the constant stride strategy). That number should be pretty huge so it won’t affect most use cases but technically the PRNG details do leak out of the API a bit here. In any case changing the PRNG can have all kinds of unexpected side effects so I really don’t think that choice should be considered an implementation detail anyways.
|
We can't. C++ forbids pointers to references, which essentially prevents us from putting references into any container (except struct and tuple I think). It is possible to wrap reference in struct before putting them in a vector, but in that case it gets more complicated than simply using a pointer. |
Using And I think I understand what you are looking for with the PRNG and just updated the docs to have the constant stride in there |
Alrighty @betanalpha and I spoke over the phone and I think with the changes to the template context names and the updated PRNG docs we should be good to go! |
I think with the above + the original vote this is good to go, can someone click the approve button? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
This is a design doc of the API for providing a service layer to handle multiple chains executing in parallel.
I've tried to write out the proposal spec as clear as I could, but if there are unanswered questions lmk and I can add them!
rendered version