Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAS: Existence Caching in Intermediate Caches (user experience report) #252

Open
sbunce opened this issue May 12, 2023 · 2 comments
Open

Comments

@sbunce
Copy link

sbunce commented May 12, 2023

I work for Cruise. I was talking to our Google Account Manager JP Guerra a while back about this issue and he thought it'd be useful to share our experience with upstream. I liked the idea, so here I am. 😁

We built an in-house RBE service. We diverged from the "ContentAddressableStorage" service API to support existence caching in intermediate caches. I wanted to share what we did and why we did it. We prefer to not diverge from the upstream API but we needed to in this case.

On our internal CAS we register an additional gRPC service called "CruiseContentAddressableStorage". It has a method that is the inverse of the "ContentAddressableStorage" service "FindMissingBlobs" method. Instead of finding blobs which do not exists, the "FindBlobs" method finds blobs which do exist, so that we can return metadata associated with each object. Specifically, we return a timestamp called "expires_at" which is the wall clock time of how long intermediate existence caches may record that an object exists (we never cache non-existence, because that'd cause inconsistency). This enables intermediate CAS (which proxy for another CAS) to cache existence.

service CruiseContentAddressableStorage {
  rpc FindBlobs(FindBlobsRequest) returns (FindBlobsResponse) {}
}

message FindBlobsRequest {
  string instance_name = 1;
  repeated Digest blob_digests = 2;
}

message FindBlobsResponse {
  repeated FoundBlobMetadata found_digest_metadata = 1;
}

message FoundBlobMetadata {
  Digest digest = 1;
  google.protobuf.Timestamp expires_at = 2;
}

We had thought of using gRPC metadata for the "expire_at" timestamps, but we have batches of around 15,000 digests (each one has its own timestamp) which would have been a challenge to pack into gRPC metadata (due to HTTP2 header size limits). So we registered the additional "CruiseContentAddressableStorage" service, and we dynamically fallback to doing no intermediate existence caching if the server returns the gRPC code "Unimplemented" (meaning the service or method doesn't exist). So we're still compatible with upstream with this fallback.

The underlying reason why we're doing existence caching in intermediate caches is because our underlying database Spanner cannot handle the required read rate, or write rate for updating atime (access time) used for expiration. Our CAS which talks to Spanner has a memory existence cache which also cannot scale high enough so we need to propagate the fact that blobs exist to cache levels that are closer to bazel. We also have to jitter atime to avoid bursts of writes on Spanner which necessitates that our "expire_at" times be different for every object to spread atime update load on Spanner over time.

I think the key idea is that the API is missing information needed to do existence caching in intermediate caches.

@bergsieker
Copy link
Collaborator

bergsieker commented May 12, 2023 via email

@sbunce
Copy link
Author

sbunce commented May 12, 2023

Hey Steven,

You're welcome. The API is well designed. I had several "ah hah" moments during implementation when I realized why the API was designed in a specific way.

The subtle part of what we're doing is that we're not returning how long the object will exist, we're returning how long existence of the object may be cached (which is <= how long the object exists). We're doing this such that existence checks reach the layer capable of bumping atime in a jittered way to avoid overloading our database. I think that's a generic piece that doesn't constrain expiration implementations, but it may be too niche for anyone to want. People can always do what we're doing if they have this situation if they control everything. But if they don't control everything there'd be no way to do intermediate existence caching without introducing inconsistency.

But anyways, I just wanted to share this little thing we ran into in case it was useful to someone. You can feel free to mark the issue appropriately and close it.

Seth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants