Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recompile only on changes to output of stanc3 auto-formatter (feature request & design discussion) #423

Open
mike-lawrence opened this issue Jan 2, 2021 · 6 comments
Labels
feature New feature or request
Milestone

Comments

@mike-lawrence
Copy link
Collaborator

mike-lawrence commented Jan 2, 2021

(While an issue already exists related to this, as well as a PR, I thought it would be cleaner to start a new issue to discuss things now that I've had time to work on it and have learned about the option space.)

At present, when an exe already exists, cmdstanr will recompile if either:

  1. it's a new session so we have no record of last compile time
  2. it's an old session but the stanfile has a modification time that is later than our record of the last compile time

1 makes for annoyance when one is coming back to a project and starting a new session (but also for packages that bundle exes, as evidenced by the original issue that spawned all this), and 2 is an annoyance when the stanfile has changed but only in non-functional ways (touched, whitespace, comments, etc)

Granted the Stan compiler has become very fast nowadays, so the above annoyances are in turn diminishing in magnitude. However, I think it's similarly trivial to solve them in a way that has little chance of going awry.

Specifically, I propose that any compilation of the model start with passing it through the stanc3 auto-formatter, yielding a character string that is then hashed to a digest. A record of this hash is then kept somewhere permanent across sessions, and only when this record is absent or changes do we recompile (and obvs "re"compile if there's no exe).

There was some back-and-forth on where to store this digest previously; at first I thought it made sense to append it to the name of the exe itself and still think that's a fairly bulletproof approach, but this had the downside of having some potential issues related to detecting and deleting outdated exes, plus aesthetic issues with exes having longer more obscure names. An alternative was suggested whereby the hash is stored in a file located in a directory, and where some other recent FRs suggest implementations that involve use of a stan_scratch folder, we could use this directory for storing this file too. I came up with a third option whereby the hash is written back to the stan file itself as a comment at the end, but it occurs to me that this does not cover the case where the user is writing code for a stan model as a string in an R script. So I think use of the stan_scratch folder would be best after all. (n.b. said folder is involved in the proposed implementations of these FRs as well: Background/asynchronous sampling, During-sampling diagnostics )

Note that it has been reported that it would be desirable for the hash to recompute on changes to any #include files used in the stan file. To achieve this, I imagine it would be pretty straightforward to parse the stan file for includes, confirm they refer to files and hash the contents of those files too. But this might be best left for a separate later feature.

@mike-lawrence
Copy link
Collaborator Author

Oh! just thought I'd note that I just encountered a "bug" in the released version of how we do things. I asked for a model to compile, then immediately made a change to the stan file, saving before the compilation finished. I then tried to recompile and clearly the save time of the exe was still later than the save time of the stanfile, so it claimed that the exe was up to date. Low-probability edge case obviously, and one that an aware user can get around by re-saving or using the force_recompile argument, but I thought I'd mention it as something that wouldn't happen with this feature's proposed implementation.

@gowerc
Copy link
Contributor

gowerc commented Aug 18, 2023

Hey @rok-cesnovar / @mike-lawrence ,

Just to say we have run into this issue today. Our scenario is that we are generating and running stan models as part of our CICD unit tests. In order to get the run times down we were hoping to cache the compiled exes (whilst compiling isn't long it does add up with compiling multiple models). However the way our package works is that we generate our stan code on the fly and then compile it as such:

mod <- cmdstanr::cmdstan_model(
    stan_file = cmdstanr::write_stan_file(stan_code),
    exe_file = file.path(cache_dir, "<arbitrary file name>")
)

Under this current setup the models are forced to be recompiled in every new session which kinda completely defeats the point of using a cache directory for CICD builds.

Are there any potential work arounds ? If not is there anyway I could convince you guys to add some priority to this as it would be a really nice QOL update (at least for developers)

@martinmodrak
Copy link
Contributor

@gowerc For your use case, there is a workaround. Note that write_stan_file already avoids recompilation, if the Stan code didn't change from previous calls (resolved in #495). The problem in your case is that you store the Stan file in a temporary directory, which gets cleared between session. However, you can either set the dir argument to write_stan_file , or change the cmdstanr_write_stan_file_dir option to a directory that does not get cleared. This should be enough to avoid recompiling models that didn't change.

I'll also note that if we agree on implementing stan-dev/cmdstan#1178 then the executable itself would contain some information about the code making implementation of this feature easier (though there is still some discussion to be made about that issues).

@mike-lawrence
Copy link
Collaborator Author

mike-lawrence commented Aug 18, 2023

See how aria uses a local for and hashing of the stanc outputs to avoid unnecessary recompilation.

Edit: I initially had “In the interim” at the beginning of this comment, but had missed that a solution does exist (per Martin’s response). I think however there’s a difference between what made it into cmdstanr and what aria does whereby aria skips recompile when comments and whitespace change (thanks to passing the code through stanc before hashing), while I think cmdstanr will recompile under those conditions.

@gowerc
Copy link
Contributor

gowerc commented Aug 22, 2023

@mike-lawrence & @martinmodrak ,

First off thank you for the suggested workaround; seems to be working as expected !

Apologies though I am confused as to what the outstanding issue is here with regards to this ticket.

  1. it's a new session so we have no record of last compile time
  2. it's an old session but the stanfile has a modification time that is later than our record of the last compile time

Based on (1) I wouldn't have expected the proposed workaround to work as the unit tests will always be run in a new session ?

@martinmodrak
Copy link
Contributor

I think the description is inaccurate - cmdstanr compares the timestamps of the Stan file and the executable, this has nothing to do with the session. If you are using write_stan_file to create your Stan file, the function will avoid overwriting the file if it already exists and has the correct content and thus keep the timestamp and avoid recompilation.

However, this fails if the Stan file was deleted in the meantime (e.g. because it was in the temp directory as in your case) and it also fails if a change is made to the file, but the change is then reverted (which changes timestamp) or if a whitespace only/comment only change is made.

The proposal is to store enough information to be able to detect the need for recompilation without relying on timestamps. This could be external (as Mike suggests) or it could be stored internally in the executable (as stan-dev/cmdstan#1178) suggests. Storing in the executable is IMHO more preferable as you don't have to worry that the storage becomes out of sync with your executables, but the downside is that this requires changes to cmdstan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants