-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running multiple chains in one Stan program #987
Conversation
Also running heap analysis now with
and will report back once I have results from both non-parallel and parallel |
…4.1 (tags/RELEASE_600/final)
Working on cross-chain warmup tells me that as soon as chains start talking to each other there's lot of information to utilize. Another thing that I came to believe is that massive parallel chains, in the sense that multiple(many more than 4 as in standard practice) communicating chains + within-chain parallelization, plays a key role to bring large-scale problems I'm interested in to be reasonably within reach. |
Yeah that's what I was kind of thinking with the part here where we can pretty easily take N warmup samples, do stuff, then run the chains again in an outer for loop around the There's def a lot of neat things possible with this pattern, but on it's own it's a pretty heavy cost. If we can justify it with some fancy new parallel chain algorithm then I'm 100% for making this look pretty and staring in the deep abyss of perf records to figure out where the bottlenecks are |
Agreed. Other than giving us chance to explore, which is important in its own regard, from performance perspective maybe we need think what's the best bang for the buck for a limited # of cores when faced with parallel around chains vs parallel within chains: if a 4-core within-chain parallel run gives speedup 2.0, using that 4 cores for 4 parallel chains better do some real magic on the warmup/sampling. |
Yes 100%, would you and/or @bbbales2 have time to use this as a base for the mymodel sample num_warmup=auto num_samples=500 data file='blah.json' chains n_chains=8 max_chains=16 Where after the auto warmup we spin off (max_chains - n_chains) additional chains. I think if we can spin those up nicely + doing auto we can get the ESS up to a point that this becomes a very good idea. Going farther we could probably even have something like mymodel sample num_warmup=auto num_samples_ess=300 data file='blah.json' chains n_chains=8 max_chains=16 Where we run sampling until the minimum ESS for a parameter is more than |
Also tagging @LuZhangstat as this may be of interest to her for the adaptation stuff |
Short term no, long term yes. Once we're closer to the other side of some subset of ocl/varmat/tuples/closures/adjoint/laplace. (Edit: and I guess the go time is a function of which of the things on wishlist above are ready to go -- there's a lot of leverage here to motivate this change I think) It's definitely very interesting and an encouragingly small amount of code. (Edit: And feeds into a lot of different things)
That's another idea that to the list. |
@SteveBronder I'm booked until May, after that would love to dig into this, hopefully sooner. |
Looks great! Thanks for sharing. I think the implementation of Pathfinder needs GPU parallel computing and clever memory management. Implementing Pathfinder in low-level parallel programming now has low priority in my to-do list but I plan to work on it in the near future. By the way, do you happen to know any good tutorial for GPU programming? Many thanks! |
Word! If you want GPU stuff I think you can use Stan math's OpenCL backend which has a pretty high level API to do most stuff. What kinds of operations are you looking to do in particular? If your matrix sizes are not very large it may be better to do something with TBB so you don't pay the transfer cost of moving from the host->GPU http://mc-stan.org/math/d5/de5/group__opencl.html @rok-cesnovar we should go through and tag all the GPU functions so they show up there. We should probably do the same thing for the new matrix type then we'd just have an automatic list For books OpenCL Programming By Example is pretty good. The OpenCL Guide is pretty tried and true. I'll try to look around and find what book I used for my GPU course. |
@SteveBronder Thank you so much for sharing the GPU stuff! I will read them later.
I will need matrix decomposition like thin QR decomposition and Cholesky decomposition. The matrix size depends on the number of parameters so it could be large. |
icic We have the cholesky implemented for gpus in the math library but no thin QR |
I see, thanks a lot! |
So, after fixing #842 in stanc3 I went back and tried these out again # This PR
time -p examples/dawid/dawid sample data file=examples/dawid/caries_dump.R chains n_chains=8
# real 200.60
# user 1452.17
# sys 2.09
# cmdstan develop with gnu parallel
time -p seq -w 0 8 | parallel "examples/dawid/dawid sample data file=examples/dawid/caries_dump.R output file=outputs{}.csv"
# real 246.03
# user 1824.17
# sys 1.52 LoL, so that's cool. This went from being "good idea with buts" to just a good idea! I guess it makes since that Anyway, this seems nice but I have no idea how to proceed. My main Qs are around how this should be impl'd in the stan repo. Like should there be a tagging parties that might be interested @yizhang-yiz @bbbales2 @wds15 |
Wowowow that is a big boost. So now the multithreading is just flat faster than the multiple chains (by a big margin)?
If multithreading is just faster then there is no need to tap into the algorithms above. Otherwise, for justification cross-chain warmup is probably the most ready of those. As far as coding challenges, I think you're write it'd be a matter of working out how to support the interfaces. Seems like figuring out what potential cmdstan/cmdstanr/cmdstanpy support looks like would be the easiest and then we could work back what would need changed in services from there. If we went cross-chain warmup route, then we'd need a way to compute cross-chain Rhat and ESS (this code would get used by the variational stuff too). Yi said he'd be busy until May but it sounds like he'd like to work on it. I like the team projects, so seems fun to me and I'm happy to wait. I think it would be easier to develop on this once varmat is in the bag too since this would be a big thing. |
This is really awesome work! Given that this shows already benefits - maybe there is a way to get this in quickly in a first version? For example adding an argument Now, there is one thing which I am worried about and that is the nested parallelism pattern which we will end up having when using this with programs based on https://www.threadingbuildingblocks.org/docs/help/tbb_userguide/work_isolation.html I do think that we are affected by this and need to be careful as we do use a thread-local ressource (the AD stack). From my understanding we need to ensure that anything within stan-math which fires off sub processes (like a Makes sense? EDIT: An additional point for headaches is MPI... maybe that should just be not allowed with multiple chains. The current MPI design would throw up on this (it would continue working, but only a single chain would ever get to use the workers at any given time). |
Ooof yeah if one thread can start writing to another threads stack allocator that could be very bad. But I'm also confused about how that work sharing interacts with thread local storage objects. Will the thread stealing work use the original threads thread local object or will it just not care and use it's own? If the latter then yeah we def need to program against this Can we put the
I've never looked at the MPI stuff but I'll take your word for it. Though how do we program against that so users don't goof? Would it literally just crash or would it just work poorly? I wonder how we can check before we run the chains that the user is not utilizing MPI. |
The TBB handles tasks and abstracts away threads entirely. The only guarantee you have about threads is that they are assigned to task_arena's. No thread can be part of two task_arena's at the same time, but threads can be moved between arenas. One solution is to run different chains in different arenas... but that will not take advantage of resource re-allocation as smoothly, I think (so early finishing chains giving their resources to still running ones). So this is why we need to isolate any nested parallelism as outlined in the link above. This will give us the right gurantees for the thread local resources. The right action is to do so in reduce_sum and map_rect. At least this is how I understand it. |
@bbbales2 latest changes should have everything in here. I think since I sorted all this out I'm going to add the dense and other samplers to the Stan PR |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
@wds15 we should be good to go! Two little things here
I think this is ready for review! |
…4.1 (tags/RELEASE_600/final)
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I run
./bernoulli-2 sample num_chains=4 id=100 output refresh=1000 data file=bernoulli.data.R num_threads=4
I'd expect to get output_100.csv
, output_101.csv
and so on...but I do get output_1.csv
, output_2.csv
and so on... I think we want the chain id to be the number, not 1...num_chains.
+ " sample num_warmup=200 num_samples=1 num_chains=2 random seed=1234" | ||
+ " output file=" + cmdstan::test::convert_model_path(model_path) | ||
+ ".csv"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test implicitly tests that the default init chain id is 1 (as we test that the filename ends with _1.csv)...do we want that? Or is it better to specify chain_id=1 as an argument to cmdstan? I don't know the answer, but some thought and doc would be nice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this was wrong before good catch! I set this to id=10
in the test so that in the test we then look for files ending in 10.csv
and 11.csv
. I think I should add docs to the output
file
argument so that folks know that for id=N
and num_chains=j
the output will be output_{N + 1}.csv
, output_{N + 2}.csv
, ..., output_{N + j}.csv
model, *(init_contexts[0]), random_seed, id, init_radius, | ||
grad_samples, elbo_samples, max_iterations, tol_rel_obj, eta, | ||
adapt_engaged, adapt_iterations, eval_elbo, output_samples, interrupt, | ||
logger, init_writers[0], sample_writers[0], diagnostic_writers[0]); | ||
} | ||
} | ||
stan::math::profile_map &profile_data = get_stan_profile_data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interaction of the multi chain stuff is probably not great with the profiling, right? At the moment you get a single profile.csv where multiple thread id's are listed.... or is this exactly what one wants? Maybe @rok-cesnovar has some thoughts? I'd guess we can go with this as it is presumably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single profile CSV file is what I would want to see. Its less files to deal with and all the info is in one place.
As I said in previous discussions, profiling is something one would typically run with a single chain, as there is no added information gained when running multiple chains (at least that I can think of). If anything, it just adds more noise to the per-gradient numbers.
So I would keep this simple.
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
This is failing because of a recent change in stanc3. I opened up a revert for that so once that goes in we can rerun the tests and all should be good |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
@wds15 ready for another look! |
This error msg is a bit confusing (not sure if it can be improved easily):
Valid values: All is obviously not true... |
And I think here is a bug:
gives these files:
The "les" included in the diagnostic filename seems wrong? EDIT: We want a test for this being correct, I guess? |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
I fixed up the error message so now we have
and fixed up the error message for |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
When I do
The error msg has an extra space "STAN_NUM_THREADS= 4" and "num_threads= 1" ... which is just a cosmetic issue... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hooray!
Submisison Checklist
./runCmdStanTests.py src/test
Summary:
This is a prototype that can utilize a
chains
andn_chains
argument to cmdstan so that a single stan program can run multiple chains in parallel like the belowThe Good
It works and gives back correct results! Comparing the results from the multi chain program to single chain programs. This also allows the data in the model to be shared across chains so it should be nice on memory consumption
This also seems to be a bit faster which is nice (see
dawid
) test hereThe Bad
I'm not totally sure how to implement this at the
stan
level. Should we have_par()
methods for all the services?With multiple chains do we also need to think about parallel PRNGs?
My main Q is whether there's enough clever stuff we can do once we have multiple chains in one Stan program to make up for any performance loss. A couple ideas are
The Ugly
This is a rough prototype. While the results are correct there's probably some bads happening with the rng. I also wonder about writing to disk across multiple threads. It feels like that could thrash the I/O
How to Verify:
You can run the examples I've put in the
examples
folder likeOr use perf to check it out like
dawid seems to be making a ton of calls to malloc ints. I still need to dive deeper into the perf to figure out where / why that's happening.
one_comp is spending most of it's time in
boost::math::lanczos::lanczos_sum
which is called from lgammaDocumentation:
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Steve Bronder
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: