Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running multiple chains in one Stan program #987

Merged
merged 45 commits into from
Aug 25, 2021
Merged

Conversation

SteveBronder
Copy link
Contributor

@SteveBronder SteveBronder commented Mar 6, 2021

Submisison Checklist

  • Run tests: ./runCmdStanTests.py src/test
  • Declare copyright holder and open-source license: see below

Summary:

This is a prototype that can utilize a chains and n_chains argument to cmdstan so that a single stan program can run multiple chains in parallel like the below

one_comp sample data file=one_comp.data.R chains n_chains=8

The Good

It works and gives back correct results! Comparing the results from the multi chain program to single chain programs. This also allows the data in the model to be shared across chains so it should be nice on memory consumption

This also seems to be a bit faster which is nice (see dawid) test here

The Bad

I'm not totally sure how to implement this at the stan level. Should we have _par() methods for all the services?

With multiple chains do we also need to think about parallel PRNGs?

My main Q is whether there's enough clever stuff we can do once we have multiple chains in one Stan program to make up for any performance loss. A couple ideas are

  1. @wds15 's parallel idea for tree building
  2. @bbbales2 and @yizhang-yiz Cross chain warmup
  3. The new variational inference algorithms Akash/Aki/JHuggins/Hyunji/Shin and crew are working on (I think this is also a multi-chain thing)
  4. Maybe useful for the adaptation algorithms Lu is working on?
  5. Allow OpenCL to share data on the GPU across multiple chains (and generally reduced memory consumption)

The Ugly

This is a rough prototype. While the results are correct there's probably some bads happening with the rng. I also wonder about writing to disk across multiple threads. It feels like that could thrash the I/O

How to Verify:

You can run the examples I've put in the examples folder like

# make sure STAN_THREADS=true is defined in make/local
# and 
# TBB_LIBRARIES=tbb tbbmalloc tbbmalloc_proxy
# for tbb_malloc 
export STAN_NUM_THREADS=8

make examples/dawid/dawid
examples/dawid/dawid sample num_samples=150 num_warmup=150 data file=examples/dawid/caries_dump.R chains n_chains=8

make examples/one_comp/one_comp sample
examples/one_comp/one_comp sample num_samples=150 num_warmup=150 data file=examples/one_comp/one_comp.data.R chains n_chains=8

Or use perf to check it out like

export STAN_NUM_THREADS=8
# If you get a warning about lost chunks turn down freq
perf record -s --per-thread --freq=2800 -g -o test_perf.data examples/dawid/dawid sample num_samples=150 num_warmup=150 data file=examples/dawid/caries_dump.R chains n_chains=8
# -g prints caller graph, -T by thread
perf report -g -T -i ./test_perf.data

perf record -s --per-thread --freq=2800 -g -o test_perf2.data examples/one_comp/one_comp sample num_samples=150 num_warmup=150 data file=examples/one_comp/one_comp.data.R chains n_chains=8

perf report -g -T -i ./test_perf2.data

dawid seems to be making a ton of calls to malloc ints. I still need to dive deeper into the perf to figure out where / why that's happening.

one_comp is spending most of it's time in boost::math::lanczos::lanczos_sum which is called from lgamma

Documentation:

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Steve Bronder

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@SteveBronder
Copy link
Contributor Author

Also running heap analysis now with

STAN_NUM_THREADS=8 heaptrack examples/dawid/dawid sample num_samples=150 num_warmup=150 data file=examples/dawid/caries_dump.R chains n_chains=8

and will report back once I have results from both non-parallel and parallel

@yizhang-yiz
Copy link

My main Q is whether there's enough clever stuff we can do once we have multiple chains in one Stan program to make up for any performance loss.

Working on cross-chain warmup tells me that as soon as chains start talking to each other there's lot of information to utilize. Another thing that I came to believe is that massive parallel chains, in the sense that multiple(many more than 4 as in standard practice) communicating chains + within-chain parallelization, plays a key role to bring large-scale problems I'm interested in to be reasonably within reach.

@SteveBronder
Copy link
Contributor Author

Yeah that's what I was kind of thinking with the part here where we can pretty easily take N warmup samples, do stuff, then run the chains again in an outer for loop around the parallel_for(). Once we think we've hit some level of convergence we could even spawn a bunch of new chains based on those points.

There's def a lot of neat things possible with this pattern, but on it's own it's a pretty heavy cost. If we can justify it with some fancy new parallel chain algorithm then I'm 100% for making this look pretty and staring in the deep abyss of perf records to figure out where the bottlenecks are

@yizhang-yiz
Copy link

yizhang-yiz commented Mar 6, 2021

There's def a lot of neat things possible with this pattern, but on it's own it's a pretty heavy cost.

Agreed. Other than giving us chance to explore, which is important in its own regard, from performance perspective maybe we need think what's the best bang for the buck for a limited # of cores when faced with parallel around chains vs parallel within chains: if a 4-core within-chain parallel run gives speedup 2.0, using that 4 cores for 4 parallel chains better do some real magic on the warmup/sampling.

@SteveBronder
Copy link
Contributor Author

Yes 100%, would you and/or @bbbales2 have time to use this as a base for the auto warmup? In my brain I think that's the minimum for this to be reasonable. Though we could also take it a step further and have the user call something like

mymodel sample num_warmup=auto num_samples=500 data file='blah.json' chains n_chains=8 max_chains=16

Where after the auto warmup we spin off (max_chains - n_chains) additional chains. I think if we can spin those up nicely + doing auto we can get the ESS up to a point that this becomes a very good idea. Going farther we could probably even have something like

mymodel sample num_warmup=auto num_samples_ess=300 data file='blah.json' chains n_chains=8 max_chains=16

Where we run sampling until the minimum ESS for a parameter is more than num_samples_ess

@SteveBronder
Copy link
Contributor Author

Also tagging @LuZhangstat as this may be of interest to her for the adaptation stuff

@bbbales2
Copy link
Member

bbbales2 commented Mar 6, 2021

have time to use this as a base for the auto warmup?

Short term no, long term yes. Once we're closer to the other side of some subset of ocl/varmat/tuples/closures/adjoint/laplace. (Edit: and I guess the go time is a function of which of the things on wishlist above are ready to go -- there's a lot of leverage here to motivate this change I think)

It's definitely very interesting and an encouragingly small amount of code. (Edit: And feeds into a lot of different things)

Where we run sampling until the minimum ESS

That's another idea that to the list.

@yizhang-yiz
Copy link

Yes 100%, would you and/or @bbbales2 have time to use this as a base for the auto warmup?

@SteveBronder I'm booked until May, after that would love to dig into this, hopefully sooner.

@LuZhangstat
Copy link

LuZhangstat commented Mar 8, 2021

Maybe useful for the adaptation algorithms Lu is working on?

Looks great! Thanks for sharing. I think the implementation of Pathfinder needs GPU parallel computing and clever memory management. Implementing Pathfinder in low-level parallel programming now has low priority in my to-do list but I plan to work on it in the near future. By the way, do you happen to know any good tutorial for GPU programming? Many thanks!

@SteveBronder
Copy link
Contributor Author

Looks great! Thanks for sharing. I think the implementation of Pathfinder needs GPU parallel computing and clever memory management. Implementing Pathfinder in low-level parallel programming now has low priority in my to-do list but I plan to work on it in the near future. By the way, do you happen to know any good tutorial for GPU programming? Many thanks!

Word! If you want GPU stuff I think you can use Stan math's OpenCL backend which has a pretty high level API to do most stuff.

What kinds of operations are you looking to do in particular? If your matrix sizes are not very large it may be better to do something with TBB so you don't pay the transfer cost of moving from the host->GPU

http://mc-stan.org/math/d5/de5/group__opencl.html

@rok-cesnovar we should go through and tag all the GPU functions so they show up there. We should probably do the same thing for the new matrix type then we'd just have an automatic list

For books OpenCL Programming By Example is pretty good. The OpenCL Guide is pretty tried and true. I'll try to look around and find what book I used for my GPU course.

@LuZhangstat
Copy link

@SteveBronder Thank you so much for sharing the GPU stuff! I will read them later.

What kinds of operations are you looking to do in particular? If your matrix sizes are not very large it may be better to do something with TBB so you don't pay the transfer cost of moving from the host->GPU

I will need matrix decomposition like thin QR decomposition and Cholesky decomposition. The matrix size depends on the number of parameters so it could be large.

@SteveBronder
Copy link
Contributor Author

icic We have the cholesky implemented for gpus in the math library but no thin QR

@LuZhangstat
Copy link

We have the cholesky implemented for gpus in the math library but no thin QR

I see, thanks a lot!

@SteveBronder
Copy link
Contributor Author

So, after fixing #842 in stanc3 I went back and tried these out again

# This PR
time -p examples/dawid/dawid sample data file=examples/dawid/caries_dump.R chains n_chains=8
# real 200.60
# user 1452.17
# sys 2.09

# cmdstan develop with gnu parallel
time -p seq -w 0 8 | parallel "examples/dawid/dawid sample data file=examples/dawid/caries_dump.R output file=outputs{}.csv"
# real 246.03
# user 1824.17
# sys 1.52

LoL, so that's cool. This went from being "good idea with buts" to just a good idea! I guess it makes since that current_statement__ constantly changing in the global scope was causing false sharing to happen. This will probably go even faster if I can do #736 in stanc3 since that will heavily cut down the programs overall memory footprint.

Anyway, this seems nice but I have no idea how to proceed. My main Qs are around how this should be impl'd in the stan repo. Like should there be a _par() equivalent for each of the functions in service? I'm not against writing a design doc for this but would like to have some sort of discussion beforehand.

tagging parties that might be interested @yizhang-yiz @bbbales2 @wds15

@bbbales2
Copy link
Member

current_statement__ constantly changing in the global scope

Wowowow that is a big boost. So now the multithreading is just flat faster than the multiple chains (by a big margin)?

Anyway, this seems nice but I have no idea how to proceed.

If multithreading is just faster then there is no need to tap into the algorithms above. Otherwise, for justification cross-chain warmup is probably the most ready of those.

As far as coding challenges, I think you're write it'd be a matter of working out how to support the interfaces. Seems like figuring out what potential cmdstan/cmdstanr/cmdstanpy support looks like would be the easiest and then we could work back what would need changed in services from there.

If we went cross-chain warmup route, then we'd need a way to compute cross-chain Rhat and ESS (this code would get used by the variational stuff too).

Yi said he'd be busy until May but it sounds like he'd like to work on it. I like the team projects, so seems fun to me and I'm happy to wait. I think it would be easier to develop on this once varmat is in the bag too since this would be a big thing.

@wds15
Copy link
Contributor

wds15 commented Mar 20, 2021

This is really awesome work! Given that this shows already benefits - maybe there is a way to get this in quickly in a first version? For example adding an argument int n_chains = 1 to the current functions so that the default argument will make things just work fine?

Now, there is one thing which I am worried about and that is the nested parallelism pattern which we will end up having when using this with programs based on reduce_sum or map_rect. The issue is that the TBB runs things unsequenced. Have a look here:

https://www.threadingbuildingblocks.org/docs/help/tbb_userguide/work_isolation.html

I do think that we are affected by this and need to be careful as we do use a thread-local ressource (the AD stack). From my understanding we need to ensure that anything within stan-math which fires off sub processes (like a reduce_sum call) must use work isolation. My suggestion is to wrap reduce calls (and anything else in stan-math which runs parallel things) into an isolated task group as demonstrated in the link above.

Makes sense?

EDIT: An additional point for headaches is MPI... maybe that should just be not allowed with multiple chains. The current MPI design would throw up on this (it would continue working, but only a single chain would ever get to use the workers at any given time).

@SteveBronder
Copy link
Contributor Author

However, in some cases such unsequenced execution may result in errors. For example, a thread-local variable might unexpectedly change its value after a nested parallel construct:

Ooof yeah if one thread can start writing to another threads stack allocator that could be very bad. But I'm also confused about how that work sharing interacts with thread local storage objects. Will the thread stealing work use the original threads thread local object or will it just not care and use it's own? If the latter then yeah we def need to program against this

Can we put the tbb::enumerable_thread_specific<int> ets; inside of the ChainableStack or where would that go? Looking at this it feels like we could just do tbb::this_task_arena::isolate inside of reduce_sum when we call the parallel reducer.

EDIT: An additional point for headaches is MPI... maybe that should just be not allowed with multiple chains. The current MPI design would throw up on this (it would continue working, but only a single chain would ever get to use the workers at any given time).

I've never looked at the MPI stuff but I'll take your word for it. Though how do we program against that so users don't goof? Would it literally just crash or would it just work poorly? I wonder how we can check before we run the chains that the user is not utilizing MPI.

@wds15
Copy link
Contributor

wds15 commented Mar 20, 2021

The TBB handles tasks and abstracts away threads entirely. The only guarantee you have about threads is that they are assigned to task_arena's. No thread can be part of two task_arena's at the same time, but threads can be moved between arenas. One solution is to run different chains in different arenas... but that will not take advantage of resource re-allocation as smoothly, I think (so early finishing chains giving their resources to still running ones).

So this is why we need to isolate any nested parallelism as outlined in the link above. This will give us the right gurantees for the thread local resources. The right action is to do so in reduce_sum and map_rect. At least this is how I understand it.

@SteveBronder
Copy link
Contributor Author

@bbbales2 latest changes should have everything in here. I think since I sorted all this out I'm going to add the dense and other samplers to the Stan PR

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.0 3.64 0.83 -21.06% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.15 13.35% faster
eight_schools/eight_schools.stan 0.1 0.09 1.12 11.04% faster
gp_regr/gp_regr.stan 0.16 0.14 1.11 9.68% faster
irt_2pl/irt_2pl.stan 5.85 5.15 1.13 11.86% faster
performance.compilation 89.57 89.52 1.0 0.05% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.59 8.24 1.04 4.01% faster
pkpd/one_comp_mm_elim_abs.stan 30.59 29.97 1.02 2.01% faster
sir/sir.stan 131.63 127.94 1.03 2.8% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.08 7.19% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.99 3.63 0.82 -21.64% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.42 0.91 -9.6% slower
arK/arK.stan 1.9 2.04 0.93 -7.02% slower
arma/arma.stan 0.83 0.23 3.55 71.84% faster
garch/garch.stan 0.53 0.6 0.88 -14.02% slower
Mean result: 1.17415320059

Jenkins Console Log
Blue Ocean
Commit hash: c03addb


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Contributor Author

@wds15 we should be good to go! Two little things here

  1. I had to modify the make files because when I turned threading on cmdstan was having a hard time compiling stansummary diagnostics etc.
  2. I added a little test that just runs a simple test model with 2 chains. I think setting this up for doing threading would be a bit much for once test and we do that testing at the stan repo level so I think having num_chains go off is reasonable.

I think this is ready for review!

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.08 3.48 0.88 -13.27% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.13 11.49% faster
eight_schools/eight_schools.stan 0.11 0.09 1.15 13.08% faster
gp_regr/gp_regr.stan 0.16 0.14 1.13 11.86% faster
irt_2pl/irt_2pl.stan 5.82 5.28 1.1 9.27% faster
performance.compilation 89.65 89.43 1.0 0.24% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.66 8.16 1.06 5.8% faster
pkpd/one_comp_mm_elim_abs.stan 29.55 29.6 1.0 -0.16% slower
sir/sir.stan 129.65 125.17 1.04 3.46% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.06 5.97% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.0 3.01 0.99 -0.65% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.38 1.04 3.67% faster
arK/arK.stan 1.88 2.05 0.92 -8.8% slower
arma/arma.stan 0.84 0.23 3.58 72.06% faster
garch/garch.stan 0.53 0.6 0.89 -12.37% slower
Mean result: 1.1987580311

Jenkins Console Log
Blue Ocean
Commit hash: b4b9953


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Copy link
Contributor

@wds15 wds15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run

./bernoulli-2 sample num_chains=4 id=100 output refresh=1000 data file=bernoulli.data.R num_threads=4

I'd expect to get output_100.csv, output_101.csv and so on...but I do get output_1.csv, output_2.csv and so on... I think we want the chain id to be the number, not 1...num_chains.

+ " sample num_warmup=200 num_samples=1 num_chains=2 random seed=1234"
+ " output file=" + cmdstan::test::convert_model_path(model_path)
+ ".csv";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test implicitly tests that the default init chain id is 1 (as we test that the filename ends with _1.csv)...do we want that? Or is it better to specify chain_id=1 as an argument to cmdstan? I don't know the answer, but some thought and doc would be nice here.

Copy link
Contributor Author

@SteveBronder SteveBronder Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this was wrong before good catch! I set this to id=10 in the test so that in the test we then look for files ending in 10.csv and 11.csv. I think I should add docs to the output file argument so that folks know that for id=N and num_chains=j the output will be output_{N + 1}.csv, output_{N + 2}.csv, ..., output_{N + j}.csv

model, *(init_contexts[0]), random_seed, id, init_radius,
grad_samples, elbo_samples, max_iterations, tol_rel_obj, eta,
adapt_engaged, adapt_iterations, eval_elbo, output_samples, interrupt,
logger, init_writers[0], sample_writers[0], diagnostic_writers[0]);
}
}
stan::math::profile_map &profile_data = get_stan_profile_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interaction of the multi chain stuff is probably not great with the profiling, right? At the moment you get a single profile.csv where multiple thread id's are listed.... or is this exactly what one wants? Maybe @rok-cesnovar has some thoughts? I'd guess we can go with this as it is presumably.

Copy link
Member

@rok-cesnovar rok-cesnovar Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single profile CSV file is what I would want to see. Its less files to deal with and all the info is in one place.

As I said in previous discussions, profiling is something one would typically run with a single chain, as there is no added information gained when running multiple chains (at least that I can think of). If anything, it just adds more noise to the per-gradient numbers.

So I would keep this simple.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.89 3.51 0.82 -21.58% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.11 10.04% faster
eight_schools/eight_schools.stan 0.11 0.09 1.14 12.4% faster
gp_regr/gp_regr.stan 0.16 0.14 1.13 11.81% faster
irt_2pl/irt_2pl.stan 5.89 5.13 1.15 12.82% faster
performance.compilation 89.08 88.79 1.0 0.32% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.71 8.22 1.06 5.65% faster
pkpd/one_comp_mm_elim_abs.stan 29.18 32.15 0.91 -10.16% slower
sir/sir.stan 125.22 121.41 1.03 3.04% faster
gp_regr/gen_gp_data.stan 0.04 0.03 1.09 8.25% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.03 3.09 0.98 -1.97% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.4 0.38 1.07 6.93% faster
arK/arK.stan 2.51 2.03 1.24 19.05% faster
arma/arma.stan 0.76 0.25 3.06 67.29% faster
garch/garch.stan 0.66 0.69 0.97 -3.47% slower
Mean result: 1.18420764211

Jenkins Console Log
Blue Ocean
Commit hash: e80e5eb


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Contributor Author

This is failing because of a recent change in stanc3. I opened up a revert for that so once that goes in we can rerun the tests and all should be good

stan-dev/stanc3#945

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.84 3.44 0.82 -21.28% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.11 9.82% faster
eight_schools/eight_schools.stan 0.1 0.09 1.14 12.15% faster
gp_regr/gp_regr.stan 0.15 0.14 1.11 9.52% faster
irt_2pl/irt_2pl.stan 5.9 5.13 1.15 13.05% faster
performance.compilation 88.75 88.6 1.0 0.16% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.57 8.24 1.04 3.83% faster
pkpd/one_comp_mm_elim_abs.stan 31.14 32.34 0.96 -3.85% slower
sir/sir.stan 142.09 129.84 1.09 8.62% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.06 5.5% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.0 3.05 0.98 -1.55% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.37 1.05 5.08% faster
arK/arK.stan 2.52 2.02 1.25 19.85% faster
arma/arma.stan 0.76 0.25 3.08 67.5% faster
garch/garch.stan 0.67 0.68 0.98 -2.22% slower
Mean result: 1.18834145433

Jenkins Console Log
Blue Ocean
Commit hash: e80e5eb


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Contributor Author

@wds15 ready for another look!

@wds15
Copy link
Contributor

wds15 commented Aug 23, 2021

This error msg is a bit confusing (not sure if it can be improved easily):

[20:55:19][sebi@sebastians-macbook-pro:~/work/cmdstan/examples/bernoulli]$ ./bernoulli sample num_chains=4 id=10 data file=bernoulli.data.R output file=samples.csv num_threads=0
0 is not a valid value for "num_threads"
  Valid values: All
Failed to parse command arguments, cannot run model.

Valid values: All is obviously not true...

@wds15
Copy link
Contributor

wds15 commented Aug 23, 2021

And I think here is a bug:

./bernoulli sample num_chains=2 id=200 data file=bernoulli.data.R output file=samples.csv diagnostic_file=diag.csv num_threads=1

gives these files:

-rw-r--r--   1 sebi  staff    67483 23 Aug 21:03 diag_201les.csv
-rw-r--r--   1 sebi  staff    67531 23 Aug 21:03 diag_200les.csv
-rw-r--r--   1 sebi  staff    49336 23 Aug 21:03 samples_201.csv
-rw-r--r--   1 sebi  staff    49435 23 Aug 21:03 samples_200.csv

The "les" included in the diagnostic filename seems wrong?

EDIT: We want a test for this being correct, I guess?

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.92 3.44 0.85 -17.6% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.05 5.08% faster
eight_schools/eight_schools.stan 0.11 0.09 1.15 13.3% faster
gp_regr/gp_regr.stan 0.16 0.14 1.14 11.97% faster
irt_2pl/irt_2pl.stan 5.82 5.05 1.15 13.24% faster
performance.compilation 89.14 88.85 1.0 0.33% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.63 8.09 1.07 6.26% faster
pkpd/one_comp_mm_elim_abs.stan 30.3 31.03 0.98 -2.4% slower
sir/sir.stan 125.37 136.13 0.92 -8.58% slower
gp_regr/gen_gp_data.stan 0.04 0.03 1.1 9.43% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.01 2.98 1.01 1.04% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.38 1.02 2.23% faster
arK/arK.stan 1.88 2.07 0.91 -9.99% slower
arma/arma.stan 0.82 0.27 3.0 66.66% faster
garch/garch.stan 0.7 0.69 1.02 1.9% faster
Mean result: 1.15861289082

Jenkins Console Log
Blue Ocean
Commit hash: 422b26f


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Contributor Author

I fixed up the error message so now we have

0 is not a valid value for "num_threads"
  Valid values: num_threads > 0 || num_threads == -1

and fixed up the error message for num_chains as well to be more helpful. Also added a test to make sure the diagnostic file name is written correctly. Before I was doing a goof and appending the end of the sample file name to the diagnostic file name.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.02 3.51 0.86 -16.22% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.12 10.97% faster
eight_schools/eight_schools.stan 0.1 0.09 1.12 10.33% faster
gp_regr/gp_regr.stan 0.15 0.14 1.12 10.94% faster
irt_2pl/irt_2pl.stan 5.83 5.13 1.14 12.01% faster
performance.compilation 89.17 88.76 1.0 0.45% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.74 8.07 1.08 7.64% faster
pkpd/one_comp_mm_elim_abs.stan 30.46 29.89 1.02 1.88% faster
sir/sir.stan 123.38 120.93 1.02 1.98% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.06 5.96% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.02 3.0 1.01 0.6% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.36 1.07 6.38% faster
arK/arK.stan 1.89 2.8 0.67 -48.67% slower
arma/arma.stan 0.83 0.28 3.01 66.78% faster
garch/garch.stan 0.71 0.68 1.04 3.49% faster
Mean result: 1.15611155001

Jenkins Console Log
Blue Ocean
Commit hash: bc4bc94


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@wds15
Copy link
Contributor

wds15 commented Aug 25, 2021

When I do

[22:35:34][sebi@sebastians-macbook-pro:~/work/cmdstan/examples/bernoulli]$ STAN_NUM_THREADS=4 ./bernoulli sample num_chains=2 id=30 data file=bernoulli.data.R output file=samples.csv diagnostic_file=diag.csv num_threads=1
STAN_NUM_THREADS= 4 but argument num_threads= 1. Please either only set one or make sure they are equal.

The error msg has an extra space "STAN_NUM_THREADS= 4" and "num_threads= 1" ... which is just a cosmetic issue...

Copy link
Contributor

@wds15 wds15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Hooray!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants