Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to cache compiled stan-models and re-use them across sessions #304

Closed
mikekaminsky opened this issue Oct 1, 2020 · 23 comments
Closed
Labels
feature New feature or request

Comments

@mikekaminsky
Copy link

Is your feature request related to a problem? Please describe.
I'd like to be able to save the compilation time for models that we run over-and-over-again across R sessions.

Describe the solution you'd like
Ideally, we could cache those executables somewhere and skip compilation if the executable has always been compiled once. I can see the paths to the executables and header files via model$exe_file() and model$hpp_file(), but what I don't see is a way to create a CmdStanModel object using an existing executable / header file. If this is already possible and I just missed it in the docs, I apologize!

Describe alternatives you've considered
I'm basically trying to do a small part of what rstantools does for rstan -- for a variety of reasons, our project doesn't work with rstan and so we are using cmdstanr instead -- I've got the shell of the code working for doing the caching, just need a way to plug the saved executables back into the code we're using (creating a CmdStanModel object).

@mikekaminsky mikekaminsky added the feature New feature or request label Oct 1, 2020
@mitzimorris
Copy link
Member

cmdstanpy lets you instantiate a CmdStanModel from just an exe file - also has logic to check timestamps on stan and exe file and if the latter is newer, doesn't compile. CmdStanR should have similar behaviors.

@rok-cesnovar
Copy link
Member

Hi, thanks for the request.

We currently do not have a way to create a CmdStanModel just from an executable. It might be reasonable to add that.

However, the executable should not recompile every session. Can you maybe provide an example of how you call cmdstan_model?

If I run

library(cmdstanr)


model_path <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
mod <- cmdstan_model(model_path)

close the session, restart RStudio or the PC, it does not recompile every time.

@rok-cesnovar
Copy link
Member

We compare the modified time of the .stan file and the executable and only recompile if the stan file was modified after the executable was created.

@jgabry
Copy link
Member

jgabry commented Oct 1, 2020

Yeah like @rok-cesnovar said I'm not sure that it's necessary to be able to create a CmdStanModel object from an executable if cmdstan_model() can be used without recompiling. That seems like it's just as convenient if not more convenient (because the user doesn't need to worry themselves about the executable or its path and whatnot). Although maybe there's something I'm overlooking.

@mikekaminsky is that not working properly for you? If it's not then we definitely need to fix something.
One possible issue is that if you're using a Stan program that it's a different directory than the existing executable you would need to specify the dir argument when compiling so it finds the executable. Otherwise it defaults to where the Stan program is located.

@mikekaminsky
Copy link
Author

Ah, fascinating! We are doing something hacky where we're combining different text files in R to make the Stan program on the fly. I'm not sure if there's a reason we did it that way instead of using #includes so this is a good reason to try that out.

However, given you're looking at file modified time, I think we might run into issues when we re-install the package. It's an internally-used package where the R code changes pretty frequently, although the underlying Stan models do not. I haven't tested that yet, but that will be next on my list.

@jgabry
Copy link
Member

jgabry commented Oct 1, 2020

However, given you're looking at file modified time, I think we might run into issues when we re-install the package. It's an internally-used package where the R code changes pretty frequently, although the underlying Stan models do not. I haven't tested that yet, but that will be next on my list.

Oh ok, yeah that could be an issue. Definitely let us know if you run into trouble. I'm not sure, but it might be that we will just need to do what @mitzimorris suggested all along and provide a direct way to create the CmdStan model object instead of relying on when the file was modified. We can definitely do that if necessary.

@mikekaminsky
Copy link
Author

Let me play with this some more today and tomorrow and I will report back. Thanks for the quick turnaround on the questions here!

@mike-lawrence
Copy link
Collaborator

Couldn't the content of the file be hashed and we just compare hashes? Seems at least better than modification times.

@jgabry
Copy link
Member

jgabry commented Oct 1, 2020

@mike-lawrence Yeah I like that idea. @rok-cesnovar or @mitzimorris any reason you can think of why we shouldn't do this by comparing hashes?

@rok-cesnovar
Copy link
Member

That is a great idea. but for that to work seamlessly, the executable should be able to return the hash of the Stan model. Otherwise we need to store it and thar becomes a mess with sessions etc.

That could be easily added as part of stan-dev/cmdstan#887
This issues has staled a bit, awaiting reviews on a Stan PR. Once that is in, will push that forward.

@mike-lawrence
Copy link
Collaborator

mike-lawrence commented Oct 1, 2020

Can the hash be in the name of the executable for now? (I don't know where that's stored across sessions)

@rok-cesnovar
Copy link
Member

That is also a temporary possibility.

One slight issue there is how we clean up executables, because 10 changes and compilations would create 10 executables in the output folder. This cab be worked around, we just need to be careful.

@mike-lawrence
Copy link
Collaborator

I presume cmdstanr currently assumes a standard filename since as described earlier here it's able to check the modification times of previously compiled files. Now it would use that same standardized filename as a basename and so long as appending the hash is standardized, it should find the old file basename_oldhash, split off the hash bit, see it's different, delete basename_oldhash and create basename_newhash. Thus no left-over files; or am I missing something?

@mikekaminsky
Copy link
Author

FWIW mike-lawrence's suggestion is exactly what I have pseudo-implemented in my package for doing this, so seems reasonable to me!

@mike-lawrence
Copy link
Collaborator

I implemented the hash-suffixed executables and made a pull request, but see now that the test coverage is so good that there are new failing tests with new scheme. Neat! (I'm new to tests, PRs, etc) I have to shift gears for tonight, but if anyone wants to take a look at what I did in the PR (esp. my regex stuff before deleting old exes; very important to get that right and not commit unintended deletions!) or work on updating the tests, go for it. I'll notify in the PR thread when I return to this (probably on the weekend) so I don't overlap with anyone else working on the tests.

@mitzimorris
Copy link
Member

FWIW, I totally agree that models w/out corresponding source code are a very dangerous thing. it's just the that there's a use case for folks building the kind of system described above - CmdStanPy request came from Prophet folks - stan-dev/cmdstanpy#70

@mikekaminsky
Copy link
Author

Some follow-ups from my side here: I was able to get cmdstanr caching working as expected once I switched to compiling from the stanfiles instead of stan code I was assembling on-the-fly in R (maybe this could be documented better?).

However, as we suspected, the cached stan files do not persist after re-installing the package from a different commit. I think @mike-lawrence's solution will address this as long as we use the dir argument to cmdstan_model to write the executables to a known central location so that they don't get blown away by package re-install.

@jgabry
Copy link
Member

jgabry commented Oct 5, 2020

@mike-lawrence thanks for the PR!

@mikekaminsky Thanks a lot for following up, that's good to know. I think you're right that @mike-lawrence's solution would work in your case.

@rok-cesnovar rok-cesnovar added this to the v1.0.0 - release milestone Oct 17, 2020
@jgabry jgabry linked a pull request Nov 11, 2020 that will close this issue
2 tasks
@mikekaminsky
Copy link
Author

Hello everyone! I found another issue related to caching so figured I'd add it here, but let me know if I should open another issue. This might also be addressed in the PRs mentioned above -- I took a look and I think it maybe is, in which case we can note this when we describe the new caching strategy.

The Issue:

It seems that the caching is wonky when using #include. In particular, if I have two stan programs outer.stan and included.stan where outer.stan has the line #include included.stan, then the cache will only be invalidated if outer.stan changes.

As a user, I would expect the cache to be based on the contents of the entire program inclusive of included files, so if either outer.stan or included.stan change, then we'd invalidate the cache and recompile.

I think we're moving in that direction with #306, but wanted to mention it in case this isn't handled and is easy to add or if it's a surprise benefit :)

> packageVersion("cmdstanr")
[1] ‘0.2.2’

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Dec 12, 2020

Yeah, this is one shortcoming on relying on modified time of only the main stan file. Good call!
This is something that cmdstan suffers from.

cmdstanr does not know which files were included in the Stan model or their locatin. The only way of realistically getting around that is by relying on hashes of the generated C++ code, not auto-format.

There are other options but they are on the hacky-side.

@mitzimorris
Copy link
Member

This is something that cmdstan suffers from.

#includes are handled by the stanc compiler, in order for the GNU-make utility to do the right thing,
there would need to exist the kind of functionality that we're using for the c++ dependencies -
file https://github.com/stan-dev/math/blob/develop/make/dependencies has rules to populate
variables DEPTARGETS and DEPFLAGS.

if the Stanc3 parser has a way to identify the full path to the include files, perhaps we could
create similar makefile rules for compiling Stan program files. that would be elegant and
would make things a lot easier for the user.

@mike-lawrence
Copy link
Collaborator

mike-lawrence commented Jan 1, 2021

One thought I just had was to write the hash of the auto-formatted code as a comment at the end of the stan file itself, which in turn provides a simple mechanism for forcing a re-compile (user can simply delete that line) for the scenario @mikekaminsky describes (where the main stan file is unchanged but an included file does change).

@rok-cesnovar
Copy link
Member

This issue isnt solved but we have a few of them open on this topic. Lets discuss this further in #423 Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants