Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this action save Bazel's built output artifacts to a cache? #18

Open
seh opened this issue May 19, 2024 · 18 comments
Open

Does this action save Bazel's built output artifacts to a cache? #18

seh opened this issue May 19, 2024 · 18 comments
Labels
question Further information is requested

Comments

@seh
Copy link

seh commented May 19, 2024

For several years I've seen advice for caching Bazel's built artifacts in GitHub Actions by using actions/cache and including ~/.cache/bazel as one of the captured directory paths. When doing so, with the right cache key and set of hierarchical "restore keys", we can coax Bazel into reusing a lot of what it's already built or tested and avoid it needing to run actions in subsequent workflow runs when the action inputs haven't changed.

Using this setup-bazel action, I see that we can enable a repository cache and a disk cache, and the action takes care of setting Bazel's output base directory, but it doesn't appear to save the built artifacts to a cache. My reading of the disk cache documentation suggests that it includes "build artifacts", but even when I enable setup-bazel's use of the disk cache and I see my GitHub Actions workflow run restore the disk cache successfully, it still appears that Bazel winds up running many actions for which I expected to find the outputs already available in the cache.

Do I need to use actions/cache separately to cache more of these action outputs, or should the disk cache configured by setup-bazel already take care of that?

@p0deje
Copy link
Member

p0deje commented May 19, 2024

No, there should be no need to use actions/cache if you use setup-bazel. Once you enable disk-cache, setup-bazel should be saving your build outputs in the cache so you don't have to worry about re-building everything. If for some reason it doesn't work as you'd expect, please provide more details - maybe there is a bug or a misconfiguration on your side.

A common problem would be to use the same disk cache for different jobs/workflows so the last one that runs would override the disk cache of the others. This can be easily solved via https://github.com/bazel-contrib/setup-bazel#separate-disk-caches-between-workflows.

@p0deje p0deje added the question Further information is requested label May 19, 2024
@seh
Copy link
Author

seh commented May 20, 2024

Thank you explaining that. Now I think I was misinterpreting the problem.

As I understand the current design, the cache name suffix for the disk cache is the hash of all of my BUILD and BUILD.bazel files. If my BUILD.bazel files settle down but I keep on changing, say, my Go files in each set of new Git commits pushed to my pull request's branch, then setup-bazel finds an existing cache for this set of BUILD.bazel files, restores it, and doesn't bother saving a new cache that then reflects the built artifacts from the current Go files. If that's true, then I'll keep on starting out with the same restored cache even as my source files drift away from their prior state when we last changed our BUILD.bazel files.

I recognize that an alternate approach of saving a new cache for each distinct Git commit is also expensive; it winds up taking a very long time to save the cache, and you wind up grinding through your cache quota quickly, evicting older and still-useful caches that are used for different purposes.

Is there an approach between these positions that you have considered? Do you see the current approach as having the liability I've described here, or am I using it incorrectly and suffering unduly?

@p0deje
Copy link
Member

p0deje commented May 20, 2024

I am honestly not sure what would be the best approach here. We could potentially list out the tree of disk cache and upload individual pieces (or folders):

$ tree /Users/p0deje/.cache/bazel-disk
/Users/p0deje/.cache/bazel-disk/cas/
├── 00
│   ├── 000020e27f52efd462f08d678435bd2825371906794c804d4f38c8d0d6db7506
│   ├── 0000dce2b2b50b507467ebef705ac2962cb0612d13ffb2a2bd8c8563ffc4594a
│   ├── 000264bb97b837f0da9f3f2c9e89138ddb5857a3c5cafe2c2c3249805306a98d
│   ├── 00033deeb0323f9f6490f75dadbeeec141114710b029e6e175b1c1e146f2618d
│   ├── 0003aa49d5eccb6c2e50b5017f5ac9e82c2335f7feaa202b54b3835f9bc3ae89
│   ├── 0005b569d8449cdae01c26171d5e6bb6fbdc9cd1af7bb575bae6ad8bd4e385d7
│   ├── 00083d81014fe8f6cab9fab6f126983e8ad0466784eed2204a71b3a0a2d37807
│   ├── 00086d30431bf78669d822ab30f718620afb75c25cea798e241adbb38ec175fb
...

However, it would probably be too network-heavy. It's also not clear how such a cache should be cleared to avoid overgrowing it.

Another approach would be to force-overwrite the cache even when it's been hit. It could prove useful when running on the main branch, as PRs to the main branch could fetch it but not save it:

- uses: bazel-contrib/[email protected]
  with:
    disk-cache: true
    disk-cache-always-save: ${{ github.ref == 'refs/heads/main' }}

@seh
Copy link
Author

seh commented May 20, 2024

An approach that I used with a previous project overlapped with these ideas.

Out on topic branches for PRs, I used actions/cache with a cache key as follows:
my-name-${{ runner.os }}-${{github.ref}}-${{github.sha}}

That creates a new cache for every distinct Git commit at the head of the topic branch. For my "restore keys", I used a cascading sequence back down to the cache that might be saved against the repository's base branch ("main"):

  • my-name-${{ runner.os }}-${{github.ref}}-
  • my-name-${{ runner.os }}-

That is, if the head commit on the topic branch changed, look for the most recent cache from this same branch. Failing that, look for the most recent cache from the base branch.

Now, I had a separate GitHub Actions workflow that ran on pushes to the "main" branch, meaning whenever we'd merge a PR against it. That workflow also used actions/cache to build all the same artifacts as those used in the aforementioned workflow for PRs. Its cache key was my-name-${{ runner.os }}-${{github.sha}}. Its lone "restore key" was my-name-${{ runner.os }}-. Since GitHub only allows restoring from caches along the same branch or a base branch, creating these caches from the "main" branch was preparing to supply later PRs.

Between the PR-focused and this push-to-"main" workflow, these caches worked nicely to provide mostly fresh Bazel action output and built artifacts to subsequent workflow runs. They came with a few liabilities, though:

  • As mentioned earlier in Does this action save Bazel's built output artifacts to a cache? #18 (comment), it takes a long time to save each of these caches—sometimes three minutes.
  • Each cache was around 2.5 GB in size, so we could only afford to keep a few of them around within GitHub's current cache budget.
    If a PR author was changing things frequently and inducing many workflow runs, they'd wind up creating, say, ten caches within a short amount of time, which would then force GitHub to evict most of the older caches—including the very valuable ones built along the "main" branch.
  • There's an extra workflow that runs upon merging each PR that serves only to optimize subsequent PRs' workflow runs.
    Of course, these caches created along the "main" also consume a significant portion of our GitHub cache budget.

@p0deje
Copy link
Member

p0deje commented May 20, 2024

Thank you for explaining your setup. I believe these all are workaround to a fundamental limitation of current cache implementation - it's coarse-grained. The best solution to this problem I can think of is to implement a small HTTP server that is started by the action, which supports being used as Bazel remote caching (https://bazel.build/remote/caching#http-caching) and it would internally be backed by GitHub Actions cache. This would allow storing the fine-grained caches in GHA and also avoid over-downloading. A similar approach is taken by buildkit (https://github.com/moby/buildkit/blob/master/cache/remotecache/gha/gha.go) in which they translate Docker build cache into GHA cache.

Unfortunately, I don't have time to work on this at the moment, but if anyone is up to implementing this, I would be happy to expand on how it might work.

For now, we can work around the fundamental limitation by saving cache only on a main branch or allowing to customize the cache-key and restore-keys to support scenarios described by @seh.

@p0deje
Copy link
Member

p0deje commented Jun 2, 2024

I had some time to prototype an idea of running an HTTP server as part of an action is compatible with Bazel remote caching. The server does not store anything, but simply translates Bazel remote caching REST API calls to @actions/cache API calls, essentially delegating storing and retrieving cache files to GitHub Actions cache.

In simple examples it worked great and created hundreds of cache entries that would match Bazel outputs. Upon re-running the build, I could see that remote cache is being used for build/test.

However, when I started testing more complex scenarios with thousands of cache entries, I obviously stumbled upon rate limiting of GitHub Actions cache API and what seemed to be unhandled rate limits directly from Microsoft Azure Storage. I could still see cache created and available eventually, but the builds were failing with "Missing digest for cache" errors.
Screenshot 2024-06-01 at 20 31 44

Unfortunately, I don't have time to dig more into this and build something robust and available for general use. I also had issues with remote cache on Windows, which I am not sure about - whether the errors are in Bazel itself or in my implementation.

The proof-of-concept can be seen in #21. If anyone wants to pick it up and continue working, I'll be happy to collaborate.

@bentekkie
Copy link

Would a PR to implement a disk-cache-always-save like below be accepted?

- uses: bazel-contrib/[email protected]
  with:
    disk-cache: true
    disk-cache-always-save: ${{ github.ref == 'refs/heads/main' }}

@p0deje
Copy link
Member

p0deje commented Jun 24, 2024

@bentekkie Yes, but let's ensure the API is correct. We need a way to disable uploading caches from PRs, isn't that what we want?

- uses: bazel-contrib/[email protected]
  with:
    disk-cache: true
    disk-cache-save: ${{ github.ref == 'refs/heads/main' }}

@bentekkie
Copy link

Shouldn't that be a decision left to users? With an option like this a user can use a condition like in the example here to only run for main, but they could also change that condition if they want to restrict based on other variables

@p0deje
Copy link
Member

p0deje commented Jun 24, 2024

@bentekkie Yes, I'm just saying that it should be called disk-cache-save rather than disk-cache-always-save. The latter assume that it might still be saved even if condition is false.

@bentekkie
Copy link

Ah missed that, sounds good, I will try and send a pr for this

@classner
Copy link

Not sure if related, but I have a similar question: I am using setup-bazel in combination with rules_python with the pinned pypi requirements and am surprised to find that even when I set the disk-cache option, it seems like the cache does not contain the downloaded (and possibly built) dependencies, severely slowing down the github action.

  1. Is this related?
  2. What can I do about it? Should I open a separate issue for this?

Thanks for the great package!

@p0deje
Copy link
Member

p0deje commented Jul 31, 2024

Not sure if related, but I have a similar question: I am using setup-bazel in combination with rules_python with the pinned pypi requirements and am surprised to find that even when I set the disk-cache option, it seems like the cache does not contain the downloaded (and possibly built) dependencies, severely slowing down the github action.

I think that rules_python keeps pypi dependencies in external repositories which means they are not part of disk cache. You can try enabling both repository cache and see if it helps for your case. If not, consider enabling external caches as well.

@ratnikov
Copy link

Hey folks, do you have any prototypes for this functionality? I've recently ported azure pipelines to github actions an am running into cache invalidation again so I'm hoping a "yes, definitely save" lever can help....

@p0deje
Copy link
Member

p0deje commented Aug 21, 2024

Hey folks, do you have any prototypes for this functionality? I've recently ported azure pipelines to github actions an am running into cache invalidation again so I'm hoping a "yes, definitely save" lever can help....

What exact functionality are you referring to? This issue mentioned several different problems and approaches to solving those.

@ratnikov
Copy link

Hey folks, do you have any prototypes for this functionality? I've recently ported azure pipelines to github actions an am running into cache invalidation again so I'm hoping a "yes, definitely save" lever can help....

What exact functionality are you referring to? This issue mentioned several different problems and approaches to solving those.

disk-cache-save parameter you folks discussed a few months ago.

@p0deje
Copy link
Member

p0deje commented Aug 21, 2024

disk-cache-save parameter you folks discussed a few months ago.

No, unfortunately, there were no prototypes for this. If you are up to try making a PR for this, I'd be happy to merge it.

@ratnikov
Copy link

No worries, thanks for confirming. Absolutely, if I am able to nail my problem down and requires adjustment within your plugin, I'll ping this thread with a PR.

c16a added a commit to c16a/pouch that referenced this issue Aug 24, 2024
There is an open issue on bazel-contrib/setup-bazel#18 to track the failure to save updated caches.

Signed-off-by: Chaitanya Munukutla <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants