Missing support for precompiled header #151

aseyboldt · 2023-07-05T18:35:44Z

If I compile models using cmdstan, it seems to consistently finish about twice as fast as if I compile a model with bridgestan:

// justatest.stan
parameters {
    real x;
}
model {
    x ~ normal(0, 1);
}

Then

%%time
compiled = bridgestan.compile_model("justatest.stan")
# Wall time: 15.1 s

%%time
model = cmdstanpy.CmdStanModel("foo", "justatest.stan")
# Wall time: 7.18 s

I would have expected pretty much no difference, or even a bit faster comiles for bridgestan, because it doesn't have to also compile the sampler and the command line tool?

Update This seems to be because cmdstan precompiles the header

The text was updated successfully, but these errors were encountered:

bob-carpenter · 2023-07-05T18:47:23Z

To test compilation time, I'd put copies of the model in different places to make sure CmdStan isn't just reusing the BridgeStan compiled model. Or try to clean out the compiled versions in between. But if that's the difference, I'd expect CmdStanModel to be a split second.

In CmdStan, we only have to translate the Stan program to a C++ class. The sampler and command-line tools are compiled in separate translation units and then linked. The linking does need to happen, though, so that takes a bit of time.

Do you know if BridgeStan and CmdStanPy compile at the same optimization level? I hope CmdStanPy isn't default to O = 0, as that would be very slow. @mitzimorris should know.

WardBrian · 2023-07-05T18:50:41Z

I wonder if it’s slower to compile position independent code? It’s also possible some of the precompiled header logic in CmdStan didn’t make it into our makefile, though that stuff is very brittle in my experience

aseyboldt · 2023-07-05T18:59:11Z

I don't think the difference is due to caching, just to make sure I gave it slightly different models, and that didn't change the timings. I'm also seeing the pretty consistently in by benchmarks with posteriordb.

In CmdStan, we only have to translate the Stan program to a C++ class. The sampler and command-line tools are compiled in separate translation units and then linked. The linking does need to happen, though, so that takes a bit of time.

Ah, that does make sense. Still doesn't explain the difference I think though...

Do you know if BridgeStan and CmdStanPy compile at the same optimization level? I hope CmdStanPy isn't default to O = 0, as that would be very slow. @mitzimorris should know.

I don't see a big difference in time for a gradient evaluation between bridgestan and cmdstan, so I don't think that's it either. A small difference might be lost in all the other things that differ, but I'm pretty sure something like O0 would be very noticeable.

I guess it could also be because we also compile the hessian? Or something in a makefile isn't right and it recompiles something it shouldn't...

bob-carpenter · 2023-07-05T19:05:29Z

Hessians will add overhead. But it shouldn't be that much because the Hessian code itself is fast and there's nothing else to compile in forward mode other than the functions being used in the code. But then that's also true for the reverse mode functions, so maybe it is just the Hessians. The only way I know how to test that would be modifying the code, but if it really is just the Hessians slowing things down, then we probably want to be able to conditionally compile w/o Hessians.

WardBrian · 2023-07-05T19:07:37Z

If you compile with autodiff hessians you’re also including all of stan/math/fwd, which definitely adds a lot of files for the preprocessor to handle

mitzimorris · 2023-07-05T19:20:10Z

CmdStanPy doesn't change the default optimization level - it just runs the CmdStan makefile.

aseyboldt · 2023-07-05T19:22:34Z

Seems the hessian code is behind a compile time flag right now, and that was off in the code above. To make sure I just deleted all references to the hessian in model_rng.cpp in the .bridgestan/.../src/ dir, and that didn't change the compile (after compiling one model once, the first time compiling any model was ~40s). Assuming I didn't miss anything, I think that should exclude the hessian as an explanation.

WardBrian · 2023-07-05T19:38:29Z

I can take a closer look tomorrow, with that ruled out my questions would be:
Are there any additional commands used in our compilation compared to CmdStan (I think not, but good to check)
Which commands invoked by make specifically are slower, assuming it isn’t just “all of them”

aseyboldt · 2023-07-05T22:50:25Z

I was just running into missing symbol issues with models involving sundials, and that made be check the variables in the makefile. Things there don't seem right to me, and that might be the source of this issue as well. It looks like some setup code never ended up being executed. For instance the sundials libs aren't precompiled (no sundials shared libs anywhere in the ~.bridgestan directory), and the linker flags for those are empty. I think the ode example in the test modules doesn't seem to cover sundials either, as that is just using a runge kutta solver.

The output of make in the .bridgestan/bridgestan*/ dir is for me:

  Current configuration:
  - OS (Operating System):       Linux
  - CXX (Compiler):              g++
  - CXX_TYPE                     gcc
  - Compiler version:            12.3
  - O (Optimization Level):      3
  Library configuration:
  - EIGEN                        ./stan/lib/stan_math/lib/eigen_3.4.0
  - BOOST                        ./stan/lib/stan_math/lib/boost_1.78.0
  - SUNDIALS                     ./stan/lib/stan_math/lib/sundials_6.1.1
  - TBB                          ./stan/lib/stan_math/lib/tbb_2020.3
  - GTEST                        ./stan/lib/stan_math/lib/benchmark_1.5.1/googletest/googletest
  - STAN_THREADS                
  - STAN_OPENCL                 
  - STAN_MPI                    
  Compiler flags (each can be overriden separately):
  - CXXFLAGS_LANG                -std=c++1y
  - CXXFLAGS_WARNINGS            -Wno-sign-compare -Wno-ignored-attributes
  - CXXFLAGS_OPTIM              
  - CXXFLAGS_FLTO               
  - CXXFLAGS_BOOST              
  - CXXFLAGS_EIGEN              
  - CXXFLAGS_OS                  -pthread -D_REENTRANT
  - CXXFLAGS_GTEST              
  - CXXFLAGS_THREADS            
  - CXXFLAGS_OPENCL             
  - CXXFLAGS_TBB                 -I ./stan/lib/stan_math/lib/tbb_2020.3/include
  - CXXFLAGS_OPTIM_TBB          
  - CXXFLAGS_MPI                
  - CFLAGS_SUNDIALS             
  - CXXFLAGS_SUNDIALS            -pipe
  - CXXFLAGS_OPTIM_SUNDIALS     
  LDLIBS:
  - LDLIBS_LANG                 
  - LDLIBS_WARNINGS             
  - LDLIBS_EIGEN                
  - LDLIBS_BOOST                
  - LDLIBS_SUNDIALS             
  - LDLIBS_OS                   
  - LDLIBS_GTEST                
  - LDLIBS_OPENCL               
  - LDLIBS_TBB                   -Wl,-L,/home/adr/.bridgestan/bridgestan-2.1.1/stan/lib/stan_math/lib/tbb -Wl,-rpath,/home/adr/.bridgestan/bridgestan-2.1.1/stan/lib/stan_math/lib/tbb
  - LDLIBS_MPI                  
  LDFLAGS:
  - LDFLAGS_LANG                
  - LDFLAGS_WARNINGS            
  - LDFLAGS_EIGEN               
  - LDFLAGS_BOOST               
  - LDFLAGS_SUNDIALS            
  - LDFLAGS_OS                  
  - LDFLAGS_GTEST               
  - LDFLAGS_OPENCL              
  - LDFLAGS_TBB                  -Wl,-L,/home/adr/.bridgestan/bridgestan-2.1.1/stan/lib/stan_math/lib/tbb -Wl,-rpath,/home/adr/.bridgestan/bridgestan-2.1.1/stan/lib/stan_math/lib/tbb
  - LDFLAGS_MPI

WardBrian · 2023-07-06T14:12:35Z

I'm not able to re-create this with a warmed-up (meaning I've built at least one model previously) cmdstan and bridgestan:

CmdStan output

$ /usr/bin/time make examples/bernoulli/bernoulli

--- Translating Stan model to C++ code ---
bin/stanc  --o=examples/bernoulli/bernoulli.hpp examples/bernoulli/bernoulli.stan

--- Compiling, linking C++ code ---
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I src -I stan/src -I stan/lib/rapidjson_1.1.0/ -I lib/CLI11-1.9.1/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.4.0 -I stan/lib/stan_math/lib/boost_1.78.0 -I stan/lib/stan_math/lib/sundials_6.1.1/include -I stan/lib/stan_math/lib/sundials_6.1.1/src/sundials    -DBOOST_DISABLE_ASSERTS          -c -Wno-ignored-attributes   -x c++ -o examples/bernoulli/bernoulli.o examples/bernoulli/bernoulli.hpp
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I src -I stan/src -I stan/lib/rapidjson_1.1.0/ -I lib/CLI11-1.9.1/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.4.0 -I stan/lib/stan_math/lib/boost_1.78.0 -I stan/lib/stan_math/lib/sundials_6.1.1/include -I stan/lib/stan_math/lib/sundials_6.1.1/src/sundials    -DBOOST_DISABLE_ASSERTS               -Wl,-L,"/home/brian/Dev/cpp/cmdstan/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/brian/Dev/cpp/cmdstan/stan/lib/stan_math/lib/tbb"        examples/bernoulli/bernoulli.o src/cmdstan/main.o       -Wl,-L,"/home/brian/Dev/cpp/cmdstan/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/brian/Dev/cpp/cmdstan/stan/lib/stan_math/lib/tbb"     stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_idas.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_kinsol.a  stan/lib/stan_math/lib/tbb/libtbb.so.2 -o examples/bernoulli/bernoulli
rm -f examples/bernoulli/bernoulli.o
18.67user 0.68system 0:19.37elapsed 99%CPU (0avgtext+0avgdata 1322664maxresident)k
0inputs+7608outputs (0major+440066minor)pagefaults 0swaps

BridgeStan output

$ /usr/bin/time make test_models/bernoulli/bernoulli_model.so

--- Translating Stan model to C++ code ---
./bin/stanc  --o=test_models/bernoulli/bernoulli.hpp test_models/bernoulli/bernoulli.stan

--- Compiling C++ code ---
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I ./stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I ./stan/src -I ./stan/lib/rapidjson_1.1.0/ -I ./stan/lib/stan_math/ -I ./stan/lib/stan_math/lib/eigen_3.4.0 -I ./stan/lib/stan_math/lib/boost_1.78.0 -I ./stan/lib/stan_math/lib/sundials_6.1.1/include -I ./stan/lib/stan_math/lib/sundials_6.1.1/src/sundials -fPIC    -DBOOST_DISABLE_ASSERTS          -c -x c++ -o test_models/bernoulli/bernoulli.o test_models/bernoulli/bernoulli.hpp
--- Linking C++ code ---
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I ./stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I ./stan/src -I ./stan/lib/rapidjson_1.1.0/ -I ./stan/lib/stan_math/ -I ./stan/lib/stan_math/lib/eigen_3.4.0 -I ./stan/lib/stan_math/lib/boost_1.78.0 -I ./stan/lib/stan_math/lib/sundials_6.1.1/include -I ./stan/lib/stan_math/lib/sundials_6.1.1/src/sundials -fPIC    -DBOOST_DISABLE_ASSERTS               -Wl,-L,"/home/brian/Dev/cpp/bridgestan/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/brian/Dev/cpp/bridgestan/stan/lib/stan_math/lib/tbb"        -shared -lm -o  test_models/bernoulli/bernoulli_model.so test_models/bernoulli/bernoulli.o ./src/bridgestan.o       -Wl,-L,"/home/brian/Dev/cpp/bridgestan/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/brian/Dev/cpp/bridgestan/stan/lib/stan_math/lib/tbb"     ./stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_nvecserial.a ./stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_cvodes.a ./stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_idas.a ./stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_kinsol.a  ./stan/lib/stan_math/lib/tbb/libtbb.so.2
rm -f test_models/bernoulli/bernoulli.o
18.10user 0.76system 0:18.87elapsed 99%CPU (0avgtext+0avgdata 1339500maxresident)k
0inputs+5304outputs (0major+439114minor)pagefaults 0swaps

Both roughly 18s on my machine. If I set BRIDGESTAN_AD_HESSIAN=true the time doesn't change, which I actually should have expected since the new include doesn't happen in the model file itself but in the built-once bridgestan.cpp

aseyboldt · 2023-07-06T15:46:29Z

It seems the difference I'm observing is because there is a model_header.hpp.gch header cache file in the cmdstan stan directory, but not in the corresponding directory of bridgestan. If I move it, I also observed the same timings in both cases, and manually running the g++ commands shows the timing differences with and without the cache file pretty clearly. That file is ~800MB, so I guess if the compiler has to get that same information out of small individual files instead, I can see why that would take longer.

I'm not sure why the cache file was created in one case but not the other. I've been changing the g++ version quite a bit, maybe that had something to do with it (and if the cache file is not there I get deprecation warnings from boost with g++ 12 and 13).

Anyway, I don't think this really is a bridgestan issue after all, so I'm closing this

WardBrian · 2023-07-06T15:53:20Z

Yeah, that is the “precompiled header” file. I’m not sure why it would be one place but not the other, since the math library is the same in both

aseyboldt · 2023-07-06T16:28:23Z

Looking at it again, I think the code that creates the precompiled header is only in the cmdstan repo, not in the stan or stan_math repo, so it just isn't available in bridgestan. Given that it seems to give a pretty good speedup (at least for me, no idea why @WardBrian didn't see the same thing).

The code in the makefile is here. I guess it could be included in the bridegestan makefile as well? Or maybe the stan repo would be a better place for it, so that both projects can share the makefile rules?

WardBrian · 2023-07-06T16:39:01Z

I think I have the precompiled header disabled in my cmdstan installation.

It's a decent speedup, but if we do support it I would make the opposite choice of CmdStan and disable it by default. It's a pretty constant source of cmdstanpy bug reports due to things like xcode updating and the old file becoming invalid

aseyboldt · 2023-07-06T18:27:13Z

Ah, I see your point...

The file size of the precompiled header made me wonder what it could be that makes it so big, so I had a quick look at what headers get included when building the model. Wasn't all that that conclusive, and the prepocessor output is only(?) 18MB. I guess the most interesting part is that various parts of boost make up almost half of the 300_000 lines after the prepocessor runs.

Not sure it's useful to anyone, but just in case, here's the link: https://gist.github.com/aseyboldt/6841525395e37e8ba2fceeaa5021b547

WardBrian · 2023-07-06T18:40:49Z

I believe @rok-cesnovar knows more about pre-compiled headers/helped set them up for CmdStan. The doc for them always has lots of scary warnings, like

Some other command-line options starting with -f, -p, or -O must be defined in the same way as when the precompiled header was generated. At present, it’s not clear which options are safe to change and which are not

(emphasis mine)

Any macros defined before the precompiled header is included must either be defined in the same way as when the precompiled header was generated, or must not affect the precompiled header, which usually means that they don’t appear in the precompiled header at all.

This seems to me like it would be a huge problem with STAN_THREADS, which affects quite a bit of stuff in the Math library. However, it seems like it is not a problem in practice for CmdStan - for now, anyway

bob-carpenter · 2023-07-06T19:18:07Z

I'd like to see how much precompiled headers save in the way of time. They are a huge pain in the context of CmdStanPy because the PCH is constantly breaking and having to be regenerated, but the suggested fix (which takes a couple minutes to run) gets lost in a huge scrolling error message we've never been able to tamp down.

WardBrian · 2023-07-06T19:28:57Z

For me the precompiled header takes compiling a model from 17s to 7s

rok-cesnovar · 2023-07-06T19:57:59Z

The precompiled headers are huge because they include the entire math library plus all the header-only dependencies (boost, eigen, ...). The precomp. headers are much smaller with clang (see stan-dev/cmdstan#872), at least they were a few years back.

clang is particularly annoying with precompiled headers because it breaks for any minor or patch version bump. And its seems that macs just update clang automatically.

This seems to me like it would be a huge problem with STAN_THREADS

CmdStan handles this by having separate precompiled headers for threads. So when you compile the first model with threads it builds separate precompiled headers with _threads suffix.

For me the precompiled header takes compiling a model from 17s to 7s

Similar numbers for me. It has a huge effect. If it didn't I think we would have never made it the default in CmdStan.

WardBrian · 2023-07-06T20:02:18Z

Could we do something similar to the _threads trick and hack the compiler version number into the filename?

How does the compiler know to use model_header_threads.hpp.gch when the C++ file is still using #include <stan/model/model_header.hpp>?

rok-cesnovar · 2023-07-06T20:05:21Z

Could we do something similar to the _threads trick and hack the compiler version number into the filename?

hmmmm, that seems like a great idea at first sight!

How does the compiler know to use model_header_threads.hpp.gch when the C++ file is still using #include <stan/model/model_header.hpp>?

This makefile trick: stan-dev/cmdstan#882

aseyboldt · 2023-07-06T20:21:37Z

If I understand this correctly, the reason that the precompiled header helps so much is because without it the compiler has to parse ~300_000 lines of code in ~3000 different files each time a model is compiled. This would also explain why reducing the optimization level helps so little with compile times. The optimization step in general is just shorter than the parsing step?

Then maybe a cleaner solution would be to reduce the number of includes that aren't actually needed in a model. The special functions code from boost for instance is already 26_000 lines of code, but I'd guess only a very small subset of models actually uses more than one or two of those.

I guess that is easier said than done though, as that would require stanc to pick and choose includes depending on which functions are used in the model, instead of pulling everything in each time?

WardBrian · 2023-07-06T20:32:10Z

Not "including the world" is a project we've been talking about in the math library since I joined the project (and probably before). A fair number of things in math end up including really broad headers, which causes this problem and makes conditional testing really difficult (if you ask a test file "did any of your included headers change" the answer is "yes" with probability approaching 1, given the current state of things).

It's worthwhile to do, but it's a big undertaking and is one of those things that only starts showing real benefits when you've moved a very large fraction of things over to the different style

aseyboldt · 2023-07-06T20:39:54Z

Not "including the world" is a project we've been talking about in the math library since I joined the project (and probably before).

Ah, those projects are always great :D

But even if the math library itself does this perfectly, as long as stanc always includes everything anyway (I think that's what's going on at least...) this wouldn't help. If so, then maybe fixing the includes created by stanc first would make it easier to actually see at least some incremental improvement? I get your point about this being a larger project though :-)

WardBrian · 2023-07-06T20:51:28Z

Yep, the Stan repo and stanc3 would both also need changes to take advantage of it.

For now I'm curious if @rok-cesnovar and I can make the PCH experience sufficiently "nice". I didn't realize we could stuff arbitrary information into the filenames for these (like we do for .o files), so I think we can avoid the worst of the version incompatibilities we currently see

aseyboldt mentioned this issue Jul 6, 2023

Fix Sundials make config #152

Merged

aseyboldt closed this as completed Jul 6, 2023

aseyboldt reopened this Jul 6, 2023

aseyboldt changed the title ~~Twice the compile time compared to cmdstan~~ Missing support for precompiled header Jul 6, 2023

WardBrian mentioned this issue Jul 6, 2023

Add CXX version to precompiled header name stan-dev/cmdstan#1171

Merged

2 tasks

WardBrian mentioned this issue Sep 21, 2023

PCH breaking between calls, rebuild_cmdstan() errors out stan-dev/cmdstanpy#702

Closed

WardBrian added enhancement New feature or request cpp labels Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing support for precompiled header #151

Missing support for precompiled header #151

aseyboldt commented Jul 5, 2023 •

edited

Loading

bob-carpenter commented Jul 5, 2023

WardBrian commented Jul 5, 2023

aseyboldt commented Jul 5, 2023

bob-carpenter commented Jul 5, 2023

WardBrian commented Jul 5, 2023

mitzimorris commented Jul 5, 2023 •

edited

Loading

aseyboldt commented Jul 5, 2023

WardBrian commented Jul 5, 2023

aseyboldt commented Jul 5, 2023

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 6, 2023

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 6, 2023

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 6, 2023

WardBrian commented Jul 6, 2023

bob-carpenter commented Jul 6, 2023

WardBrian commented Jul 6, 2023

rok-cesnovar commented Jul 6, 2023

WardBrian commented Jul 6, 2023

rok-cesnovar commented Jul 6, 2023

aseyboldt commented Jul 6, 2023 •

edited

Loading

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 6, 2023

WardBrian commented Jul 6, 2023

Missing support for precompiled header #151

Missing support for precompiled header #151

Comments

aseyboldt commented Jul 5, 2023 • edited Loading

bob-carpenter commented Jul 5, 2023

WardBrian commented Jul 5, 2023

aseyboldt commented Jul 5, 2023

bob-carpenter commented Jul 5, 2023

WardBrian commented Jul 5, 2023

mitzimorris commented Jul 5, 2023 • edited Loading

aseyboldt commented Jul 5, 2023

WardBrian commented Jul 5, 2023

aseyboldt commented Jul 5, 2023

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 6, 2023

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 6, 2023

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 6, 2023

WardBrian commented Jul 6, 2023

bob-carpenter commented Jul 6, 2023

WardBrian commented Jul 6, 2023

rok-cesnovar commented Jul 6, 2023

WardBrian commented Jul 6, 2023

rok-cesnovar commented Jul 6, 2023

aseyboldt commented Jul 6, 2023 • edited Loading

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 6, 2023

WardBrian commented Jul 6, 2023

aseyboldt commented Jul 5, 2023 •

edited

Loading

mitzimorris commented Jul 5, 2023 •

edited

Loading

aseyboldt commented Jul 6, 2023 •

edited

Loading