Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing support for precompiled header #151

Open
aseyboldt opened this issue Jul 5, 2023 · 25 comments
Open

Missing support for precompiled header #151

aseyboldt opened this issue Jul 5, 2023 · 25 comments
Labels
cpp enhancement New feature or request

Comments

@aseyboldt
Copy link
Contributor

aseyboldt commented Jul 5, 2023

If I compile models using cmdstan, it seems to consistently finish about twice as fast as if I compile a model with bridgestan:

// justatest.stan
parameters {
    real x;
}
model {
    x ~ normal(0, 1);
}

Then

%%time
compiled = bridgestan.compile_model("justatest.stan")
# Wall time: 15.1 s

%%time
model = cmdstanpy.CmdStanModel("foo", "justatest.stan")
# Wall time: 7.18 s

I would have expected pretty much no difference, or even a bit faster comiles for bridgestan, because it doesn't have to also compile the sampler and the command line tool?

Update This seems to be because cmdstan precompiles the header

@bob-carpenter
Copy link
Collaborator

To test compilation time, I'd put copies of the model in different places to make sure CmdStan isn't just reusing the BridgeStan compiled model. Or try to clean out the compiled versions in between. But if that's the difference, I'd expect CmdStanModel to be a split second.

In CmdStan, we only have to translate the Stan program to a C++ class. The sampler and command-line tools are compiled in separate translation units and then linked. The linking does need to happen, though, so that takes a bit of time.

Do you know if BridgeStan and CmdStanPy compile at the same optimization level? I hope CmdStanPy isn't default to O = 0, as that would be very slow. @mitzimorris should know.

@WardBrian
Copy link
Collaborator

I wonder if it’s slower to compile position independent code? It’s also possible some of the precompiled header logic in CmdStan didn’t make it into our makefile, though that stuff is very brittle in my experience

@aseyboldt
Copy link
Contributor Author

I don't think the difference is due to caching, just to make sure I gave it slightly different models, and that didn't change the timings. I'm also seeing the pretty consistently in by benchmarks with posteriordb.

In CmdStan, we only have to translate the Stan program to a C++ class. The sampler and command-line tools are compiled in separate translation units and then linked. The linking does need to happen, though, so that takes a bit of time.

Ah, that does make sense. Still doesn't explain the difference I think though...

Do you know if BridgeStan and CmdStanPy compile at the same optimization level? I hope CmdStanPy isn't default to O = 0, as that would be very slow. @mitzimorris should know.

I don't see a big difference in time for a gradient evaluation between bridgestan and cmdstan, so I don't think that's it either. A small difference might be lost in all the other things that differ, but I'm pretty sure something like O0 would be very noticeable.

I guess it could also be because we also compile the hessian? Or something in a makefile isn't right and it recompiles something it shouldn't...

@bob-carpenter
Copy link
Collaborator

Hessians will add overhead. But it shouldn't be that much because the Hessian code itself is fast and there's nothing else to compile in forward mode other than the functions being used in the code. But then that's also true for the reverse mode functions, so maybe it is just the Hessians. The only way I know how to test that would be modifying the code, but if it really is just the Hessians slowing things down, then we probably want to be able to conditionally compile w/o Hessians.

@WardBrian
Copy link
Collaborator

If you compile with autodiff hessians you’re also including all of stan/math/fwd, which definitely adds a lot of files for the preprocessor to handle

@mitzimorris
Copy link

mitzimorris commented Jul 5, 2023

CmdStanPy doesn't change the default optimization level - it just runs the CmdStan makefile.

@aseyboldt
Copy link
Contributor Author

Seems the hessian code is behind a compile time flag right now, and that was off in the code above. To make sure I just deleted all references to the hessian in model_rng.cpp in the .bridgestan/.../src/ dir, and that didn't change the compile (after compiling one model once, the first time compiling any model was ~40s). Assuming I didn't miss anything, I think that should exclude the hessian as an explanation.

@WardBrian
Copy link
Collaborator

I can take a closer look tomorrow, with that ruled out my questions would be:
Are there any additional commands used in our compilation compared to CmdStan (I think not, but good to check)
Which commands invoked by make specifically are slower, assuming it isn’t just “all of them”

@aseyboldt
Copy link
Contributor Author

I was just running into missing symbol issues with models involving sundials, and that made be check the variables in the makefile. Things there don't seem right to me, and that might be the source of this issue as well. It looks like some setup code never ended up being executed. For instance the sundials libs aren't precompiled (no sundials shared libs anywhere in the ~.bridgestan directory), and the linker flags for those are empty. I think the ode example in the test modules doesn't seem to cover sundials either, as that is just using a runge kutta solver.

The output of make in the .bridgestan/bridgestan*/ dir is for me:

  Current configuration:
  - OS (Operating System):       Linux
  - CXX (Compiler):              g++
  - CXX_TYPE                     gcc
  - Compiler version:            12.3
  - O (Optimization Level):      3
  Library configuration:
  - EIGEN                        ./stan/lib/stan_math/lib/eigen_3.4.0
  - BOOST                        ./stan/lib/stan_math/lib/boost_1.78.0
  - SUNDIALS                     ./stan/lib/stan_math/lib/sundials_6.1.1
  - TBB                          ./stan/lib/stan_math/lib/tbb_2020.3
  - GTEST                        ./stan/lib/stan_math/lib/benchmark_1.5.1/googletest/googletest
  - STAN_THREADS                
  - STAN_OPENCL                 
  - STAN_MPI                    
  Compiler flags (each can be overriden separately):
  - CXXFLAGS_LANG                -std=c++1y
  - CXXFLAGS_WARNINGS            -Wno-sign-compare -Wno-ignored-attributes
  - CXXFLAGS_OPTIM              
  - CXXFLAGS_FLTO               
  - CXXFLAGS_BOOST              
  - CXXFLAGS_EIGEN              
  - CXXFLAGS_OS                  -pthread -D_REENTRANT
  - CXXFLAGS_GTEST              
  - CXXFLAGS_THREADS            
  - CXXFLAGS_OPENCL             
  - CXXFLAGS_TBB                 -I ./stan/lib/stan_math/lib/tbb_2020.3/include
  - CXXFLAGS_OPTIM_TBB          
  - CXXFLAGS_MPI                
  - CFLAGS_SUNDIALS             
  - CXXFLAGS_SUNDIALS            -pipe
  - CXXFLAGS_OPTIM_SUNDIALS     
  LDLIBS:
  - LDLIBS_LANG                 
  - LDLIBS_WARNINGS             
  - LDLIBS_EIGEN                
  - LDLIBS_BOOST                
  - LDLIBS_SUNDIALS             
  - LDLIBS_OS                   
  - LDLIBS_GTEST                
  - LDLIBS_OPENCL               
  - LDLIBS_TBB                   -Wl,-L,/home/adr/.bridgestan/bridgestan-2.1.1/stan/lib/stan_math/lib/tbb -Wl,-rpath,/home/adr/.bridgestan/bridgestan-2.1.1/stan/lib/stan_math/lib/tbb
  - LDLIBS_MPI                  
  LDFLAGS:
  - LDFLAGS_LANG                
  - LDFLAGS_WARNINGS            
  - LDFLAGS_EIGEN               
  - LDFLAGS_BOOST               
  - LDFLAGS_SUNDIALS            
  - LDFLAGS_OS                  
  - LDFLAGS_GTEST               
  - LDFLAGS_OPENCL              
  - LDFLAGS_TBB                  -Wl,-L,/home/adr/.bridgestan/bridgestan-2.1.1/stan/lib/stan_math/lib/tbb -Wl,-rpath,/home/adr/.bridgestan/bridgestan-2.1.1/stan/lib/stan_math/lib/tbb
  - LDFLAGS_MPI                 

@WardBrian
Copy link
Collaborator

I'm not able to re-create this with a warmed-up (meaning I've built at least one model previously) cmdstan and bridgestan:

CmdStan output

$ /usr/bin/time make examples/bernoulli/bernoulli

--- Translating Stan model to C++ code ---
bin/stanc  --o=examples/bernoulli/bernoulli.hpp examples/bernoulli/bernoulli.stan

--- Compiling, linking C++ code ---
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I src -I stan/src -I stan/lib/rapidjson_1.1.0/ -I lib/CLI11-1.9.1/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.4.0 -I stan/lib/stan_math/lib/boost_1.78.0 -I stan/lib/stan_math/lib/sundials_6.1.1/include -I stan/lib/stan_math/lib/sundials_6.1.1/src/sundials    -DBOOST_DISABLE_ASSERTS          -c -Wno-ignored-attributes   -x c++ -o examples/bernoulli/bernoulli.o examples/bernoulli/bernoulli.hpp
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I src -I stan/src -I stan/lib/rapidjson_1.1.0/ -I lib/CLI11-1.9.1/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.4.0 -I stan/lib/stan_math/lib/boost_1.78.0 -I stan/lib/stan_math/lib/sundials_6.1.1/include -I stan/lib/stan_math/lib/sundials_6.1.1/src/sundials    -DBOOST_DISABLE_ASSERTS               -Wl,-L,"/home/brian/Dev/cpp/cmdstan/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/brian/Dev/cpp/cmdstan/stan/lib/stan_math/lib/tbb"        examples/bernoulli/bernoulli.o src/cmdstan/main.o       -Wl,-L,"/home/brian/Dev/cpp/cmdstan/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/brian/Dev/cpp/cmdstan/stan/lib/stan_math/lib/tbb"     stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_idas.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_kinsol.a  stan/lib/stan_math/lib/tbb/libtbb.so.2 -o examples/bernoulli/bernoulli
rm -f examples/bernoulli/bernoulli.o
18.67user 0.68system 0:19.37elapsed 99%CPU (0avgtext+0avgdata 1322664maxresident)k
0inputs+7608outputs (0major+440066minor)pagefaults 0swaps

BridgeStan output

$ /usr/bin/time make test_models/bernoulli/bernoulli_model.so

--- Translating Stan model to C++ code ---
./bin/stanc  --o=test_models/bernoulli/bernoulli.hpp test_models/bernoulli/bernoulli.stan

--- Compiling C++ code ---
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I ./stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I ./stan/src -I ./stan/lib/rapidjson_1.1.0/ -I ./stan/lib/stan_math/ -I ./stan/lib/stan_math/lib/eigen_3.4.0 -I ./stan/lib/stan_math/lib/boost_1.78.0 -I ./stan/lib/stan_math/lib/sundials_6.1.1/include -I ./stan/lib/stan_math/lib/sundials_6.1.1/src/sundials -fPIC    -DBOOST_DISABLE_ASSERTS          -c -x c++ -o test_models/bernoulli/bernoulli.o test_models/bernoulli/bernoulli.hpp
--- Linking C++ code ---
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I ./stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I ./stan/src -I ./stan/lib/rapidjson_1.1.0/ -I ./stan/lib/stan_math/ -I ./stan/lib/stan_math/lib/eigen_3.4.0 -I ./stan/lib/stan_math/lib/boost_1.78.0 -I ./stan/lib/stan_math/lib/sundials_6.1.1/include -I ./stan/lib/stan_math/lib/sundials_6.1.1/src/sundials -fPIC    -DBOOST_DISABLE_ASSERTS               -Wl,-L,"/home/brian/Dev/cpp/bridgestan/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/brian/Dev/cpp/bridgestan/stan/lib/stan_math/lib/tbb"        -shared -lm -o  test_models/bernoulli/bernoulli_model.so test_models/bernoulli/bernoulli.o ./src/bridgestan.o       -Wl,-L,"/home/brian/Dev/cpp/bridgestan/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/brian/Dev/cpp/bridgestan/stan/lib/stan_math/lib/tbb"     ./stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_nvecserial.a ./stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_cvodes.a ./stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_idas.a ./stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_kinsol.a  ./stan/lib/stan_math/lib/tbb/libtbb.so.2
rm -f test_models/bernoulli/bernoulli.o
18.10user 0.76system 0:18.87elapsed 99%CPU (0avgtext+0avgdata 1339500maxresident)k
0inputs+5304outputs (0major+439114minor)pagefaults 0swaps

Both roughly 18s on my machine. If I set BRIDGESTAN_AD_HESSIAN=true the time doesn't change, which I actually should have expected since the new include doesn't happen in the model file itself but in the built-once bridgestan.cpp

@aseyboldt
Copy link
Contributor Author

It seems the difference I'm observing is because there is a model_header.hpp.gch header cache file in the cmdstan stan directory, but not in the corresponding directory of bridgestan. If I move it, I also observed the same timings in both cases, and manually running the g++ commands shows the timing differences with and without the cache file pretty clearly. That file is ~800MB, so I guess if the compiler has to get that same information out of small individual files instead, I can see why that would take longer.

I'm not sure why the cache file was created in one case but not the other. I've been changing the g++ version quite a bit, maybe that had something to do with it (and if the cache file is not there I get deprecation warnings from boost with g++ 12 and 13).

Anyway, I don't think this really is a bridgestan issue after all, so I'm closing this

@WardBrian
Copy link
Collaborator

Yeah, that is the “precompiled header” file. I’m not sure why it would be one place but not the other, since the math library is the same in both

@aseyboldt
Copy link
Contributor Author

Looking at it again, I think the code that creates the precompiled header is only in the cmdstan repo, not in the stan or stan_math repo, so it just isn't available in bridgestan. Given that it seems to give a pretty good speedup (at least for me, no idea why @WardBrian didn't see the same thing).

The code in the makefile is here. I guess it could be included in the bridegestan makefile as well? Or maybe the stan repo would be a better place for it, so that both projects can share the makefile rules?

@aseyboldt aseyboldt reopened this Jul 6, 2023
@aseyboldt aseyboldt changed the title Twice the compile time compared to cmdstan Missing support for precompiled header Jul 6, 2023
@WardBrian
Copy link
Collaborator

I think I have the precompiled header disabled in my cmdstan installation.

It's a decent speedup, but if we do support it I would make the opposite choice of CmdStan and disable it by default. It's a pretty constant source of cmdstanpy bug reports due to things like xcode updating and the old file becoming invalid

@aseyboldt
Copy link
Contributor Author

Ah, I see your point...

The file size of the precompiled header made me wonder what it could be that makes it so big, so I had a quick look at what headers get included when building the model. Wasn't all that that conclusive, and the prepocessor output is only(?) 18MB. I guess the most interesting part is that various parts of boost make up almost half of the 300_000 lines after the prepocessor runs.

Not sure it's useful to anyone, but just in case, here's the link: https://gist.github.com/aseyboldt/6841525395e37e8ba2fceeaa5021b547

@WardBrian
Copy link
Collaborator

I believe @rok-cesnovar knows more about pre-compiled headers/helped set them up for CmdStan. The doc for them always has lots of scary warnings, like

Some other command-line options starting with -f, -p, or -O must be defined in the same way as when the precompiled header was generated. At present, it’s not clear which options are safe to change and which are not

(emphasis mine)

Any macros defined before the precompiled header is included must either be defined in the same way as when the precompiled header was generated, or must not affect the precompiled header, which usually means that they don’t appear in the precompiled header at all.

This seems to me like it would be a huge problem with STAN_THREADS, which affects quite a bit of stuff in the Math library. However, it seems like it is not a problem in practice for CmdStan - for now, anyway

@bob-carpenter
Copy link
Collaborator

I'd like to see how much precompiled headers save in the way of time. They are a huge pain in the context of CmdStanPy because the PCH is constantly breaking and having to be regenerated, but the suggested fix (which takes a couple minutes to run) gets lost in a huge scrolling error message we've never been able to tamp down.

@WardBrian
Copy link
Collaborator

For me the precompiled header takes compiling a model from 17s to 7s

@rok-cesnovar
Copy link
Contributor

The precompiled headers are huge because they include the entire math library plus all the header-only dependencies (boost, eigen, ...). The precomp. headers are much smaller with clang (see stan-dev/cmdstan#872), at least they were a few years back.

clang is particularly annoying with precompiled headers because it breaks for any minor or patch version bump. And its seems that macs just update clang automatically.

This seems to me like it would be a huge problem with STAN_THREADS

CmdStan handles this by having separate precompiled headers for threads. So when you compile the first model with threads it builds separate precompiled headers with _threads suffix.

For me the precompiled header takes compiling a model from 17s to 7s

Similar numbers for me. It has a huge effect. If it didn't I think we would have never made it the default in CmdStan.

@WardBrian
Copy link
Collaborator

Could we do something similar to the _threads trick and hack the compiler version number into the filename?

How does the compiler know to use model_header_threads.hpp.gch when the C++ file is still using #include <stan/model/model_header.hpp>?

@rok-cesnovar
Copy link
Contributor

Could we do something similar to the _threads trick and hack the compiler version number into the filename?

hmmmm, that seems like a great idea at first sight!

How does the compiler know to use model_header_threads.hpp.gch when the C++ file is still using #include <stan/model/model_header.hpp>?

This makefile trick: stan-dev/cmdstan#882

@aseyboldt
Copy link
Contributor Author

aseyboldt commented Jul 6, 2023

If I understand this correctly, the reason that the precompiled header helps so much is because without it the compiler has to parse ~300_000 lines of code in ~3000 different files each time a model is compiled. This would also explain why reducing the optimization level helps so little with compile times. The optimization step in general is just shorter than the parsing step?

Then maybe a cleaner solution would be to reduce the number of includes that aren't actually needed in a model. The special functions code from boost for instance is already 26_000 lines of code, but I'd guess only a very small subset of models actually uses more than one or two of those.

I guess that is easier said than done though, as that would require stanc to pick and choose includes depending on which functions are used in the model, instead of pulling everything in each time?

@WardBrian
Copy link
Collaborator

Not "including the world" is a project we've been talking about in the math library since I joined the project (and probably before). A fair number of things in math end up including really broad headers, which causes this problem and makes conditional testing really difficult (if you ask a test file "did any of your included headers change" the answer is "yes" with probability approaching 1, given the current state of things).

It's worthwhile to do, but it's a big undertaking and is one of those things that only starts showing real benefits when you've moved a very large fraction of things over to the different style

@aseyboldt
Copy link
Contributor Author

Not "including the world" is a project we've been talking about in the math library since I joined the project (and probably before).

Ah, those projects are always great :D

But even if the math library itself does this perfectly, as long as stanc always includes everything anyway (I think that's what's going on at least...) this wouldn't help. If so, then maybe fixing the includes created by stanc first would make it easier to actually see at least some incremental improvement? I get your point about this being a larger project though :-)

@WardBrian
Copy link
Collaborator

Yep, the Stan repo and stanc3 would both also need changes to take advantage of it.

For now I'm curious if @rok-cesnovar and I can make the PCH experience sufficiently "nice". I didn't realize we could stuff arbitrary information into the filenames for these (like we do for .o files), so I think we can avoid the worst of the version incompatibilities we currently see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants