Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design doc for parallel services #40

Merged
merged 7 commits into from
May 12, 2021

Conversation

SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented Apr 6, 2021

This is a design doc of the API for providing a service layer to handle multiple chains executing in parallel.

I've tried to write out the proposal spec as clear as I could, but if there are unanswered questions lmk and I can add them!

rendered version

@SteveBronder
Copy link
Collaborator Author

Tagging folks who are interested in other threads @wds15 @bbbales2 @mitzimorris @yizhang-yiz and @bgoodri @riddell-stan for rstan and pystan.

I tried keeping the scope tight to just the services layer, though it's hard to talk about the services layer without also talking about how the service layer is used.

@wds15
Copy link
Contributor

wds15 commented Apr 11, 2021

^What about controlling threads per chain and threads total? I think we should allow controlling both or otherwise do some benchmarks to decide on that. Sensible defaults should let user allow to usually just specify one of these controls and something reasonable will happen. So even for this default we would need a small benchmark to decide. I hope that makes sense in the context of this PR, but to me I would expect that we settle this question. Will need to dive more into the doc. Thanks for writing it!

@betanalpha
Copy link

betanalpha commented Apr 12, 2021 via email

@riddell-stan
Copy link
Collaborator

I'm not persuaded that one needs to add a function to services for this. Doing parallel sampling "manually" doesn't seem too hard (e.g., using std::thread or tbb::parallel_for() to launch each chain separately). Anyone who can call hmc_nuts_dense_e_adapt can call it n times in n different threads.

Nits:

  • num_chains (to match num_samples, etc)

@mitzimorris
Copy link
Member

I'm not persuaded that one needs to add a function to services for this.

this is something that should be implemented once instead of allowing users to roll their own. when it comes to threading, these things need to be done carefully.

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Apr 13, 2021

^What about controlling threads per chain and threads total? I think we should allow controlling both or otherwise do some benchmarks to decide on that. Sensible defaults should let user allow to usually just specify one of these controls and something reasonable will happen. So even for this default we would need a small benchmark to decide. I hope that makes sense in the context of this PR, but to me I would expect that we settle this question. Will need to dive more into the doc. Thanks for writing it!

I ran some benchmarks with the redcards reduce_sum() model and my opinion now is that we should just let the user set threads and not threads per chain. The below just with the version of parallel cmdstan that only lets you set the number of threads. It looks like the tbb is doing the right thing and looking at the results and looking at the results things seemed to match up for the manual parallel vs parallel cmdstan versions.

idt I like the threads per chain thing because then like how would a user (or us choosing a default) do like 8 chains when there are a total of 32 cores available. Since the tbb seems to be Just Working™ I think we should just have a threads argument


Cmdstan Parallel Chains Redcard

4 chains, 32 threads

time -p examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json parallel chains=4 threads=32

real 18.22
user 550.52
sys 5.14

4 chains, 24 threads

time -p examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json parallel chains=4 threads=24

real 19.14
user 457.70
sys 0.36

8 chains, 32 threads

time -p examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json parallel chains=8 threads=32

real 33.83
user 1071.50
sys 1.28

GNU Parallel

4 chains, 32 threads

time -p seq -w 1 4 | parallel "STAN_NUM_THREADS=8 examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json output file=outputs{}.csv"

real 17.95
user 568.26
sys 0.74

4 chains, 24 threads

time -p seq -w 1 4 | parallel "STAN_NUM_THREADS=6 examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json output file=outputs{}.csv"

real 19.38
user 444.35
sys 3.39

8 chains, 32 threads

time -p seq -w 1 8 | parallel "STAN_NUM_THREADS=4 examples/redcard/redcard sample num_samples=1000 num_warmup=1000 data file=examples/redcard/redcard.json output file=outputs{}.csv"

real 34.26
user 1026.32
sys 3.48


@SteveBronder
Copy link
Collaborator Author

I'm not persuaded that one needs to add a function to services for this. Doing parallel sampling "manually" doesn't seem too hard (e.g., using std::thread or tbb::parallel_for() to launch each chain separately). Anyone who can call hmc_nuts_dense_e_adapt can call it n times in n different threads.

I'm not persuaded that one needs to add a function to services for this.

this is something that should be implemented once instead of allowing users to roll their own. when it comes to threading, these things need to be done carefully.

+1 to @mitzimorris there's a few gotchas here in setting up the parallel things and I think having a standardized API (and docs the services can use to make sure everything is lined up) makes it worth it. I thought about just leaving this as something the users of the services could impl on their own (you can do this by just wrapping the service layer call in a parallel for). Though I think there's enough little annoyances to have something standardized

@SteveBronder
Copy link
Collaborator Author

In my option the most important criteria for this API is maintaining compatible behavior with the current single-chain routes. In particular if n_chain = 1 then the new route should behave exactly the same as the current route and if n_chain > 1 then the behavior should be exactly the same as running the current route multiple times with appropriate chain indices.

Yep! I have it hard-coded that if n_chain=1 is used when calling the service layer we default to what happened previously.

Reproducibility with n_chain = 1 would just require looking for init files without any chain_id modification unless n_chain > 1, no?

Yep!

Reproducibility between parallel and serial routes with multiple chains would depend on the behavior of chain. Does chain become an initial chain index in the parallel context with the other chains given incremental indices, chain + 1, chain + 2, etc?

Yep!

If so then we should probably rename it to something more precise such as init_chain_id and the init file lookup should use these chain ids and not indices starting from 1.

I like having the init file lookup based on the starting id number, so that if a user is running multiple stan programs with multiple chains per program they can have one program start at 1 with 4 chains, another at 5 with 4 chains, etc.

In any case with the introduction of n_chain the old chain argument should probably be made more explicit, for example chain_id or even init_chain_id as above.

I really thought we had a chain argument in cmdstan, but we don't? We actually only have an id argument

  id=<int>
    Unique process identifier
    Valid values: id >= 0
    Defaults to 0

So actually kind of nice for the machinations here! I was worried about ambiguity between having a chain argument and chains. I think I was confused because in the services layer we call id -> chain. Or if you are talking about the chain argument at the services layer then yeah I'm fine with calling the new API's chain argument init_chain_id (or maybe just process_id)?

An alternative would be to accept a vector of chain_id values, although that puts a nontrivial burden on the routes to validate the chain_ids and how the PRNG generator is strided.

Let's stick with the easier version for now, I think at some point we may want this but the easy peasy version works for what we want right now.

None of the callbacks are called often enough for the vtable hit to significant and templating everything seems a bit aggressive, but that’s a minor point.

Yeah so the templates here are not for performance but for flexibility for developers using the service layer. Since now the std::vector<> holding those inits and readers can be anything with a get_underlying() overload it's easier to have them as templates. So this let's rstan use Rcpp::Xptras the inner value in the vector and that sort of thing. We could just have the API be std::vector<var_context*> but then we lose the ability to use smart pointers like std::shared_ptr<> etc.

@wds15
Copy link
Contributor

wds15 commented Apr 13, 2021

The results for threads total vs threads per chain look promising. Is it a possibility to lay out the code such that the threads per chain thing could be added later on? Given how things look right now I agree with you that we do not need the threads per chain thing.... and in fact, a user who requires the threads per chain thing could simply fire up different chains in different processes and control the threads for each of them.... which may lead us to conclude that threads per chains is not needed?

The other thing which is really important in this context is the ability to do cross-chain warmup stuff. That's hopefully not an after-thought, right? (Sorry for not having studied the text in detail yet).

It was raised that we change the meaning of some current command line options as I understood... it would be good to avoid that as much as possible unless we say we bump major version numbers.

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Apr 13, 2021

Is it a possibility to lay out the code such that the threads per chain thing could be added later on?

We could add it, it would just require another signature. Because for threads per chain we are going to want a task_group for the parallel_for() telling it how many threads each thread is allowed to use at that level. Nothing here blocks that from happening but idt we need it in this particular design.

The other thing which is really important in this context is the ability to do cross-chain warmup stuff. That's hopefully not an after-thought, right? (Sorry for not having studied the text in detail yet).

Yeah a lot of this is to setup the tooling so that a lot of the API stuff is done for adding a cross chain warmup scheme. Cuz then it's just targeting a different multi-chain service layer which should have very similar input/output schemes

It was raised that we change the meaning of some current command line options as I understood... it would be good to avoid that as much as possible unless we say we bump major version numbers.

Yeah I agree

@wds15
Copy link
Contributor

wds15 commented Apr 13, 2021

One more q: despite the templates being introduced, we can still pre compile ahead of time the Stan services, right?

assuming a yes here, which I would guess, then this design is sound for me.

@SteveBronder
Copy link
Collaborator Author

One more q: despite the templates being introduced, we can still pre compile ahead of time the Stan services, right?

Yep! I can double check but I believe that's totally fine.

assuming a yes here, which I would guess, then this design is sound for me.

Word! Let me read this over and fix some things for what @betanalpha posted and then I'll call for a vote

@betanalpha
Copy link

betanalpha commented Apr 15, 2021 via email

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Apr 19, 2021

I agree that this is the right behavior but it has to be thoroughly doc’d. In particular I think that we need some discussion of how to emulate the same behavior with multiple calls verses one call (including not just multiple one chain calls with chain=1, chain=2, …, chain=n but also multiple multi-chain calls with properly strided chain ids, for example chain=1, chain=2 * num_chains, chain=…n * num_chains).

Yes so I think the way I have it written should work to be reproducible. In the scheme I have now each PRNG is set by the seed + chain_id (the initial id to N chains). So chain 1 is seed + 1, chain 2 is seed + 2 etc. So if a user wants to reproduce a multi chain process with individual programs they would just do the same seed but then iterate the id value for each process. I can add a note in the design doc about how users can replicate multi-chain processes with single chains and how we will add this info the the documentation***.

Though the output file name will be different in a multi-chain program vs multi program scheme. Changing the behavior of the multi program scheme would require breaking backwards compatibility since we would tag each output with a _{id}.csv. I think I'd rather not break backwards compatibility here and just have docs that specify this output behavior.

I think we mean the same thing!

Yes we are! Should have affirmed that, it's good and cool!

Yeah, I’m just thinking about the naming for the services API. The interface naming is another matter altogether.

I personally prefer something like init_chain_id because it emphasizes that the individual chains being bundled together will be given related ids. process_id on the other hand seems to label the entire bundle and not any individual chain.

icic yeah if we are talking about the service layer I think that makes a lot of sense, I'll update this.

You mean that you just want avoid having to make the interfaces wrap things in container classes that derive from var_context or the like? I don’t have any huge objections here, just want to understand the motivation.

Yeah so the main thing is just letting upstream services be flexible in what they pass to the API. So a user of the service layer could just pass an std::vector<SomeInitType> or alt a std::vector<some_smart_ptr<SomeInitType>>. I just realized I wrote, "This is for performance" in the design doc and then I was like "why do people keep thinking this is for performance???" lol so I nee to change that.

Another alt here is just to specify that the inner element of a vector must have a valid * operator that returns a reference to the underlying object. So then a developers's class would only need to have that * defined to operate with the API, but then that also means that if the dev wanted to pass in a std::vector<NotAPointer> then they need to have a * method defined for that class that returns back a reference to itself. That's not the worst, thought I think the normal use of * is to return a reference to an underlying object so it just seems a bit non-standard so I'd rather just have the get_underlying() function folks can override if they need to

…ulti-chain program and multiple programs with single chains
@SteveBronder
Copy link
Collaborator Author

Alrighty I just updated with fixes! I think I'd like to hold a vote starting from April 21st to April 28th (Wednesday to Wednesday) if that is alright with everyone

std::vector<InitWriter>& init_writer,
std::vector<SampleWriter>& sample_writer,
std::vector<DiagnosticWriter>& diagnostic_writer,
size_t n_chain)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would welcome some consistency with the parameter name. Having n_chain in C++ alongside num_warmup, num_samples, and num_thin is odd. And then in CmdStan it's chains (plural, no n_ or num_ prefix).

Seems like num_chains would be consistent with the current C++ API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable!

@SteveBronder
Copy link
Collaborator Author

Alrighty lets do the vote starting today, tagging everyone who was commented so far

@riddell-stan @wds15 @betanalpha @mitzimorris @wds15 @bbbales2

let's have the vote as either a comment or as a review

@riddell-stan
Copy link
Collaborator

riddell-stan commented Apr 21, 2021 via email

@SteveBronder
Copy link
Collaborator Author

So in the past we had it that if there were no disagreements between anyone whose commented so far (full quorum) we would just merge the design doc. Folks seem to want to move to a voting scheme lately so I'm fine with also doing a vote here. Though if no one has an objection or 50% of the people tagged above vote yes then I think we are good to just approve and merge it

@yizhang-yiz
Copy link

Folks seem to want to move to a voting scheme lately so I'm fine with also doing a vote here.

I'm not aware any proposal to move to voting scheme. The adjoint ode doc vote was due to disagreement.

@SteveBronder
Copy link
Collaborator Author

Oh in that case if folks are fine with it let's approve and merge this. I can update the parallel branch with the above changes today

@mitzimorris
Copy link
Member

lgtm.
is this going to be added to all methods in stan::services or just to methods in stand::services::sample ?

@SteveBronder
Copy link
Collaborator Author

is this going to be added to all methods in stan::services or just to methods in stand::services::sample ?

For now I was going to just do sampling, though I think this scheme makes a lot of sense for vb (and I think @hyunjimoon had some stuff for vb where multi chain would be very useful.). Does it make sense to add this for optimization? if so then yeah I'm also fine with doing it for optimization as well

@betanalpha
Copy link

betanalpha commented Apr 26, 2021 via email

@rok-cesnovar
Copy link
Member

(correct me if I’m wrong but the transformed data PRNG gets seeded by seed alone and the generated quantity PRNG gets seeded by seed + stride * chain_id)

This is correct.

@t4c1
Copy link
Contributor

t4c1 commented Apr 28, 2021

chain_id = init_chain_id + chain_idx
stride_size = num_warmup + num_samples
chain_seed = base_seed + stride_size * chain_id

I have two issues with that. The first one is how this is written and the second one is what is written. Lets start with the first one (You are probably thinking about the right thing, but it is not written correctly). Assuming we want to keep behavior the same we need to seed all generator with the same seed and than jump for some steps ahead. In current code this is (with seed in being the same as base_seed in your suggestion):

  static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
  boost::ecuyer1988 rng(seed);
  rng.discard(DISCARD_STRIDE * chain);

Second issue is by how many strides we should jump ahead (what is the chain variable in C++). This does NOT depend on number of samples or warmup. It just need to be index of the PRNG being seeded. AFAIK if we can run multiple processes at the same time, multiple chains in the same process and multiple threads for each chain. Each thread needs its own PRNG instance. So this needs to be equal to global thread id, which is init_chain_id + id_of_thread_in_process.

@SteveBronder
Copy link
Collaborator Author

Thanks! I think I see what you are saying, but just to check are you saying something like

  static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
  boost::ecuyer1988 rng(seed);
  // Just changing this part
  rng.discard(DISCARD_STRIDE * init_chain_id + id_of_thread_in_process);

Though the RNG is not currently constructed within the thread, is there anything wrong with just using the chain_id? i.e. if I have 4 chains starting at id of 0 they would be strided by

DISCARD_STRIDE * 0
DISCARD_STRIDE * 1
DISCARD_STRIDE * 2
DISCARD_STRIDE * 3

@t4c1
Copy link
Contributor

t4c1 commented Apr 29, 2021

but just to check are you saying something like

Exactly.

Though the RNG is not currently constructed within the thread, is there anything wrong with just using the chain_id?

Well how does it now work, if multiple threads are computing one chain? Do they share single PRNG? I think that would make threaded part not deterministic if it uses PRNG.

@betanalpha
Copy link

betanalpha commented Apr 29, 2021 via email

@t4c1
Copy link
Contributor

t4c1 commented Apr 29, 2021

I’m not sure exactly how precisely constrained this is in the compiler but my understanding is that the functors that expose threading don’t accept functions that utilize any of the _rng functions, in which case there is no need to parallelize the PRNG within each chain.

Than ignoring threads working on the same chain is fine. Although the restriction seems somewhat arbitrary. It makes more sense for reduce_sum, but I could imagine someone using rngs in the functor passed to map_rect. Anyway even if we ignore multiple threads per chain this should be (with parenthesis around addition):

rng.discard(DISCARD_STRIDE * (init_chain_id + id_of_thread_in_process));

All that’s needed is for DISCARD_STRIDE to be long enough that each chain has enough space to consume as many PRNG states as it needs white not being so long that DISCARD_STRIDE * num_chains gets close to the period of the generator.

We could also evenly space the generator period between streams, doing DISCARD_STRIDE = generator_period / total_n_chains. Although the way it is now is also fine as long as each chain needs less than 1<<50 randomly generated numbers and we run less than generator_period / (1<<50) streams, which is probably a reasonable assumption.

@wds15
Copy link
Contributor

wds15 commented Apr 29, 2021

With the TBB the concept of a thread is abstracted away. You could have thread local streams for each chain, but that is not super performant. Threaded out of order access to a rng is not a nice thing to support. If that can be avoided that would be good.

@hyunjimoon
Copy link

hyunjimoon commented Apr 30, 2021

is this going to be added to all methods in stan::services or just to methods in stand::services::sample ?

For now I was going to just do sampling, though I think this scheme makes a lot of sense for vb (and I think @hyunjimoon had some stuff for vb where multi chain would be very useful.). Does it make sense to add this for optimization? if so then yeah I'm also fine with doing it for optimization as well

For instance, this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful.

@SteveBronder
Copy link
Collaborator Author

Totally agree about the eventual benefit, but I think the priority of the design doc here is the design fo the API route itself and not its implementation (again modulo details like thread safety that have to be enforced in the route).

That's fair, though if you check out the most recent update I think I touch more on the motivation for this.

So idt we can get rid of the actual templates for the var_context inputs while supporting multiple types of smart pointers. For instance if the signature had std::vector<var_context*> then we could only bring in raw pointers to var_context derived types. What we can do is get rid of get_underlying() and say that we accept any input vector whose elements have a valid operator* which returns a reference to a class derived from var_context. I think that's flexible enough for any doo-dads that downstream devs may need while being concrete in the service layer.

Is the issue here that RCPP wants to wrap all references in smart pointers? So the problem isn’t that RStan couldn’t wrap local memory in a var_context derivation but rather that it couldn’t expose it outside of a smart pointer reference?

I think it's that I want to wrap pointers into smart pointers. We could make this a std::vector<var_context*> but then that means all the upstream interfaces need to manage the memory for their var context pointers. Just to make sure I understand, is the alternative interface you are thinking about just having a std::vector<var_context*> or is there something else you had in mind?

wrt the vector of raw pointers, it's just more manual memory management that I think we can let users just avoid/hide by making these templates that take in a class with an operator* that returns a reference to a class derived from var_context. Then cmdstan can use shared_ptr<var_context>, rstan can use Rcpp::Xptr<var_context>, python can use the smart pointer it wants to use if it has one, etc.

As was pointed out earlier in the thread I don’t think there’s a need for voting here. The design is still evolving, for the better, once the design equilibrates we can consider voting if there’s not consensus.

That's reasonable. It feels like we are talking more in depth on implementation details. Like the stride etc. for the PRNG are not user facing. At this point I'd prefer approving this and moving this PRNG conversation over to the implementation (That should also make that conversation easier as we can look at how I currently have things coded up.)

@bob-carpenter
Copy link
Collaborator

this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful.

ADVI is stochastic (initialization point and stochastic gradients for ELBO), but it's not an MCMC method and does not have chains. The draws it produces are just independent draws from the normal approximation. You can treat them like a Markov chain, but we know it's stationary distribution is normal(mu, Sigma) and we know the draws are independent. This means there's no point in splitting, as in split R-hat.

If we stuck to the unconstrained scale, all the sample quantities required for R-hat are available analytically. We lose that if we convert back to the constrained scale.

Just to make sure I understand, is the alternative interface you are thinking about just having a std::vector<var_context*>?

I'm late to this discussion, but why not just have std::vector<var_context> and let the vector manage the memory? It's just as efficient as constructing independently if you use emplace_back, for example.

the stride etc. for the PRNG are not user facing

They may not be user controllable, but they can impact users. But it's probably not an issue if the skip-ahead strides are big enough. I think we used values for CmdStan MCMC that were larger than the number of RNGs that we could sample in any reasonable amount of time.

@SteveBronder
Copy link
Collaborator Author

this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful.

ADVI is stochastic (initialization point and stochastic gradients for ELBO), but it's not an MCMC method and does not have chains. The draws it produces are just independent draws from the normal approximation. You can treat them like a Markov chain, but we know it's stationary distribution is normal(mu, Sigma) and we know the draws are independent. This means there's no point in splitting, as in split R-hat.

I think we can talk about ADVI another time as the design doc is just for the sampler's service API. Its def something we should think about soon, I think a lot of the API layout here will be applicable to ADVI as well.

I'm late to this discussion, but why not just have std::vector<var_context> and let the vector manage the memory? It's just as efficient as constructing independently if you use emplace_back, for example.

std::vector<var_context> does not work since var_context is an abstract base class :(. That and I think what we want is to allow either one init we share across all the chains or multiple inits for each chain. This is nice and easy to do in cmdstan if we have a std::vector<shared_ptr<var_context>>

https://godbolt.org/z/nK4dhMcMo

the stride etc. for the PRNG are not user facing

They may not be user controllable, but they can impact users. But it's probably not an issue if the skip-ahead strides are big enough. I think we used values for CmdStan MCMC that were larger than the number of RNGs that we could sample in any reasonable amount of time.

Yeah I agree though I think that is something we would put in the Stan docs about parallel processes which we would add once we sort out the striding thing in the impl.

@hyunjimoon
Copy link

this new low-rank ADVI is being tested with robust variational inference metric which uses multiple chains to compute Rhat. I think it would be useful.

ADVI is stochastic (initialization point and stochastic gradients for ELBO), but it's not an MCMC method and does not have chains. The draws it produces are just independent draws from the normal approximation. You can treat them like a Markov chain, but we know it's stationary distribution is normal(mu, Sigma) and we know the draws are independent. This means there's no point in splitting, as in split R-hat.

Sorry if the point was not clear. The quote below from this paper suggests running stochastic optimization with multiple chains. It is true split-R is used, but the algorithm is designed based on J chains. Viewing the stochastic optimization as MCMC and therefore borrowing MCMC diagnostics is the novelty of the paper which motivated us (@Dashadower @bbbales2 and I) to include khat, Rhat to VI diagnostics; related to this issue.

"We use the split-Rb version, where all chains are split into two before carrying out the
computation above, which helps with detecting non-stationarity [17, 54] and allows us to use it even
when J = 1."

@betanalpha
Copy link

betanalpha commented May 3, 2021 via email

@SteveBronder
Copy link
Collaborator Author

wrt the vector of raw pointers, it's just more manual memory management that I think we can let users just avoid/hide by making these templates that take in a class with an operator* that returns a reference to a class derived from var_context. Then cmdstan can use shared_ptr<var_context>, rstan can use Rcpp::Xptr<var_context>, python can use the smart pointer it wants to use if it has one, etc.

Sorry for being dense here. I didn’t think that the memory management was all that burdensome and, if anything, it gave the interfaces the flexibility to, for example, let the local REPL environment manage the memory. I’m not caught up on the latest RCPP but it sounds like Xptr achieves that for R using a smart pointer pattern? @riddell-stan Is there a similarly straightforward pattern that httpstan can exploit?

No worries at all! It's not like it's "oof this is going to take me days" but it's just another thing I have to think about where using a smart pointer lets me not think about. Like for cmdstan it would be nice to not have to do a bunch of new and delete's over the elements of the vector where I could just have a std::shared_ptr<>. For R using raw pointers instead of Rcpp::Xptr just involves calling PROTECT() / UNPROTECT() / R_SetExternalPtrProtected() etc. that you don't have to think about if you use Rcpp::Xptr

In any case is the choice of smart pointers over regular pointers is orthogonal to the parallelization issues? If so wouldn’t it be better to add new smart pointer routes for all the current routes in one go in a separate PR?

idk if I'd call it orthogonal. Sort of? Like right now the templated version of the signature in this proposal allows someone to pass in a raw pointer since the only requirement is that a valid operator* exists for the object that returns a reference to a class derived from var_context. That allows for both raw and smart pointers.

The current routes just take in a reference to an object derived from var_context and idt there's any need to change those. Those ones just call operator* on the var context derived object when calling the service layer API

I just want to make sure that the signature and automated chain_id will be flexible enough for all the use cases we might encounter. Once we’re agreed on that (which I think we largely are) then I agree the further implementation details go beyond the scope of the design doc.

Is the Q here whether we can ensure pseudo-independent sequences for each of the chains given init_chain_id, seed, and num_chains? imo it kind of feels like our answer is yes at this point and now we are mostly focusing on how exactly we are going to setup the stride. Or am I missing some part of the convo?

@SteveBronder
Copy link
Collaborator Author

Wanted to bump this as I think if we can get this merged this week then we will have enough time to add it to cmdstan before the next release. @betanalpha does the above answer your Qs about the var context? imo I'd like to move the rest of the PRNG things to the implementation discussion

@betanalpha
Copy link

betanalpha commented May 6, 2021 via email

@wds15
Copy link
Contributor

wds15 commented May 6, 2021

It looks as if things are settled enough from my view... but I don't want to jump ahead here. Given that @SteveBronder obviously is willing to invest a lot of time the next days on it, I think it would be good to make an attempt to get this over the line and into 2.27. So how about @betanalpha and @SteveBronder have a quick call to sort out the remaining bits?

From my end I can review stuff from the PRs as they relate to threading and all that; or other things I have worked on in the past.

There is still some time so it's not entirely out of scope for 2.27 if people put some time into it now.

@SteveBronder
Copy link
Collaborator Author

But, and correct me if I’m wrong, all the smart pointers would do is call delete on the collection of var_context instances once the std::vector goes out of scope in the interface, code, right? That doesn’t seem like a tremendous amount of memory management to me, especially when care is already going to be required in constructing the separate var_context for each initialization anyways. In particular why couldn’t an interface just construct the var_context instances on the stack, as CmdStan currently does, and then throw the local stack addresses into a vector? They should be valid through the lifetime of the function call, no?

It's a lot less trivial than it first appears. Like here's an example. Say with the init var context we want to follow the guidelines above and if the user only specifies my_init.data.R we want one pointer to a context that all of the chains use, otherwise we will have N inits. Well, now I'm going to have two cases, one where I have N inits and need to call N deletes, and another where I have 1 init and need to call 1 delete (else I double free). Sure I can code around that, but I can also just make them an std::shared_ptr<var_context*>() and I don't have to care about managing that. Now suppose in the future we want to allow the users to be able to supply some modulo of num_chains inits. Then I need to keep track of the total amount of deletes I need to do when I clean up (aka pretty much rewriting shared_ptr<>). This all get's complicated and a smart pointer lets me be dumb and safe for a small bit of overhead that we don't actually pay very much of since we pass the vector as a reference. To be clear, I'd need an extreme case where having a template parameter here is bad enough to change this piece of the API.

Is the Q here whether we can ensure pseudo-independent sequences for each of the chains given init_chain_id, seed, and num_chains? imo it kind of feels like our answer is yes at this point and now we are mostly focusing on how exactly we are going to setup the stride. Or am I missing some part of the convo?

Yes, I think we’re largely in agreement but at the same time I don’t think this can be fully pushed to implementation. Because this is behavior that we want the API to guarantee I think it has to be fully fleshed out in the design doc (unless a stride argument is added to the route or something). Like the filename lookup logic I think we want to have the precise PRNG logic specified.

It's not clear to me what's left to be discussed. It feels like at this point we are talking about the stride behavior, but imo that's an impl detail and I don't like the idea of specifying the exact implementation inside of the design doc. Like say we build it out and we are like, "Oh X is much better we should be doing that with the PRNG", then any convo we had here would end up being tossed and we also have to go back and make another PR to change that. I think what we want in the design doc is to specify what sort of user behavior allows for reproducibility (which I think is what we have right now). Any technical details would be documented in the doxygen and explaining to users how to run an analysis reproducibly would go in the Stan documentation.

It looks as if things are settled enough from my view... but I don't want to jump ahead here. Given that @SteveBronder obviously is willing to invest a lot of time the next days on it, I think it would be good to make an attempt to get this over the line and into 2.27. So how about @betanalpha and @SteveBronder have a quick call to sort out the remaining bits?

I spoke to @betanalpha and he wants to keep it here to make sure all communication about the design process is public (which I think is fair and reasonable).

From my end I can review stuff from the PRs as they relate to threading and all that; or other things I have worked on in the past.

Much appreciated!

There is still some time so it's not entirely out of scope for 2.27 if people put some time into it now.

That's the other thing, the impl is pretty much done based on the current design doc and I think talking about the PRNG things on the impl side is going to make the conversation a lot easier since there we can actually look at and comment on code. Talking about technical things like the PRNG in here in the design doc is very difficult for me and to be honest I'm pretty lost on what needs to be done wrt the PRNG in order to move forward.

@bob-carpenter
Copy link
Collaborator

  1. No reason people can't have calls then report back to the lists. Communication via GitHub message is slow.

  2. I agree that we should split functional specs (what the API does) from how it does it. You can write a technical spec for the latter if needed to justify the functional spec's ability to be coded.

  3. I'm jumping into this fairly late, but do we really need pointers here? Why not just use vector<var_context&> to make sure the result is a reference? (Yes, I know they're the same underlying, but we've avoided pointers as much as possible and using references simplifies the use of elements (replacing -> with the standard .).

  4. I think we should ignore statements like "the impl is pretty much done based on the current design doc" when considering designs. I don't want sunk implementation cost to affect design the way it has in the past.

  5. Stride behavior is not an implementation detail for whatever function is taking a PRNG reference. It affects how that reference is modified by the function call.

@SteveBronder
Copy link
Collaborator Author

No reason people can't have calls then report back to the lists. Communication via GitHub message is slow.

@betanalpha if your cool with that I think it would be easier. There's a couple places that I'm confused rn

I'm jumping into this fairly late, but do we really need pointers here? Why not just use vector<var_context&> to make sure the result is a reference? (Yes, I know they're the same underlying, but we've avoided pointers as much as possible and using references simplifies the use of elements (replacing -> with the standard .).

It would be nice to avoid a pointer here but since we use abstract base classes idt that works, unless I'm doing something wrong in the godbolt example below

https://godbolt.org/z/vvMhhKGre

I think we should ignore statements like "the impl is pretty much done based on the current design doc" when considering designs. I don't want sunk implementation cost to affect design the way it has in the past.

Yeah that's very fair, I think I was saying that because I'd like to have it in for this next release. Though to be clear I'm not saying, "How the impl is doing it should be what should be done" but that the impl that I have now does satisfy almost everything in this design doc and I'm mostly waiting to make any final changes in the impl till this is approved.

Stride behavior is not an implementation detail for whatever function is taking a PRNG reference. It affects how that reference is modified by the function call.

My thought process here is that this is an implementation detail with respect to the users of the service layer. imo this design doc is centered around the service layer APIs and so it should have explanations at that level. That reflects what I have in the design doc right now describing how users of the service layer will allow for reproducibility.

Also I think my other confusion here is that I don't understand what are the specific changes necessary to the current design doc with respect to the PRNG. I'm fine with copy-pasting in the below to the design doc, but it feels odd to me. Like what if we ever decide to change anything here? For instance what if we decide to change this to use equal spaced strides? Or swap out to a PRNG with a longer cycle like taus88? It's feels like I have to put the code I add in the design doc into a little reminder so if anything changes in create_rng() we also need to update the information here. imo anyone who wants to know exactly how we are generating the PRNG can go to the code on github and look it it. I also think documentation about this sort of thing should be in the Stan docs once we've sorted it out in the impl

inline boost::ecuyer1988 create_rng(unsigned int seed, unsigned int init_chain_id, unsigned int chain_num) {
  // Initialize L’ecuyer generator
  boost::ecuyer1988 rng(seed);
  
  // Seek generator to disjoint region for each chain
  static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
  rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num - 1));
  return rng;
}

What if I just included a link to the branches create_rng() function? Then anyone who wants to know how the prng is built can go there

@betanalpha
Copy link

betanalpha commented May 11, 2021 via email

@t4c1
Copy link
Contributor

t4c1 commented May 11, 2021

I'm jumping into this fairly late, but do we really need pointers here? Why not just use vector<var_context&> to make sure the result is a reference? (Yes, I know they're the same underlying, but we've avoided pointers as much as possible and using references simplifies the use of elements (replacing -> with the standard .).

It would be nice to avoid a pointer here but since we use abstract base classes idt that works, unless I'm doing something wrong in the godbolt example below
https://godbolt.org/z/vvMhhKGre

We can't. C++ forbids pointers to references, which essentially prevents us from putting references into any container (except struct and tuple I think). It is possible to wrap reference in struct before putting them in a vector, but in that case it gets more complicated than simply using a pointer.

@SteveBronder
Copy link
Collaborator Author

Let’s back up a bit.

Currently the interfaces allocate a var_context instance for the initializations on the stack and pass them to the API by reference.

The current design in the doc changes this so that the var_contexts instances have to be allocated on the heap, in which case the memory has to be allocated and managed somewhere. I appreciate how smart pointers can be useful if some or all of the heap instances are replicated, I would just prefer to avoid heap allocation in the first place if at all possible. As Bob said we’ve tried to avoid that where possible in the code base so far.

Using new anywhere puts the pointer on the stack but the object the pointer references is put onto the heap.

And cmdstan currently uses a shared pointer

I think I understand what you are looking for with the PRNG and just updated the docs to have the constant stride in there

@SteveBronder
Copy link
Collaborator Author

Alrighty @betanalpha and I spoke over the phone and I think with the changes to the template context names and the updated PRNG docs we should be good to go!

@SteveBronder
Copy link
Collaborator Author

I think with the above + the original vote this is good to go, can someone click the approve button?

Copy link

@betanalpha betanalpha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@SteveBronder SteveBronder merged commit 5d2812d into stan-dev:master May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants