-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Gateway cache #630
WIP: Gateway cache #630
Conversation
There are even more parsing of functions in other parts of worker-service, after more investigation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a first set of comments. One of them question the whole approach so let's make sure we discuss those before moving on.
Addition to that I think what we also need is to prove this all works with integration tests, which are invoking workers through worker service both directly and via the API gateway (I'm still a bit confused by what do we call what now), with different scenarios
- invoking existing worker first time, then again when things are cached
- invoking new worker first time, then again
- above scenarios but new component versions added
- above scenarios but worker itself gets updated
- ...
this is something that's complicated and we don't have any coverage for on this level
self.get_export_function(parsed) | ||
} | ||
|
||
pub fn get_export_function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not a change in this PR, just for the record as now I'm looking at it, it is full of unnecessary clones. (functions()
clones exports then clones the function again, returns a new collection, then here it is cloned again, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a refactoring of what was existing there, as in, moved the code, but sure I can change those clone bits
version: ComponentVersion, | ||
) -> Result<ComponentDetails, MetadataFetchError>; | ||
|
||
async fn get_active_component_in_worker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async fn get_active_component_in_worker( | |
async fn get_worker_component_version( |
// Incase there is a discrepancy between the association of worker-id -> version details | ||
// and the component_elements cache, it will get fixed in the next call. | ||
// ------------------------------------------------------------------------------------------- | ||
// The Race condition of worker-executor getting updated with another version of component-id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a bit confusing because I think this (get_component_elements_from_cache_latest_version
) will only be used when there is no worker yet. So we cannot say "worker-executor gets updated with another version of component-id" - there is no worker at all at this point. What happens is that here we assume that the new worker will get component version C1 and then finally make the invocation, and there the worker executor will also fetch the latest component version C2 because it is not part of the invocation request and it is a new worker. There is no guarantee that C2 == C1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have put this comment in the function above. I didn't mean to put this here, will change it straight away
.await | ||
} | ||
|
||
pub(crate) async fn get_component_elements_from_cache_latest_version( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have some more descriptive name as it is doing something very specific.
How about assume_latest_version_for_new_worker
or something?
&self, | ||
worker_id: &WorkerId, | ||
) -> Result<ComponentElements, MetadataFetchError> { | ||
let latest_component_version_details = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this still means that for the "dynamic invoke to create new worker" we do two remote calls:
- first to figure out the worker does not exist (calling worker executor)
- then here to figure out the latest component version
I don't see any obvious improvements to this within the context of this PR but let's take a step back and think about do we need all this?
IF we would not need component metadata (exported functions) for the Rib evaluator then we could just forward the invocation to the worker executor, which is a single point to decide what version it takes etc and also easier to do cache invalidation there (for example component service can notify all worker executors if a component is updated).
So why do we need to know the component interface for Rib? Can't it just treat everything that's a function invocation as a potential function invocation, and then worker executor returns with an error if it's not an existing one or not typechecking? It checks these things anyway (even on wasmtime level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, previous to this approach, we were simply forwarding everything to worker executor. That is Rib evaluation context hardly knew anything about this.
Having no symbol-table info for evaluating Rib (meaning we simply forward everything to worker-executor) is something I am a bit hesitant to do (that is revert). Even to forward everything to worker-executor and not type checking is a an approach that's always after the fact, meaning, the rib evaluator can still (possibly) make mistakes because it couldn't decide things properly on what to do due to missing type informations and worker-executor will always return error. I think forwarding things as such is also hacky approach, if we think about any future improvements for Rib. If the boundary of Rib is well set (for example: foo(1)
is always a function call and not constructing a variant), then may be forwarding everything to worker may make some more sense though.
Open to discussions there, but it will be more of reverting to where we were before , than changing this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate if it would be a (partial) revert but I feel like we did not properly think through every aspect of this and that's why we are in this current state. We can continuously learn and reevaluate or idea of what we need to do.
I don't agree that leaving the validation to the worker executor is "hacky" - that's a perfectly valid validation point and done where every necessary information must be available anyway to run the worker.
Could you, to help the discussion, collect a more concrete list of what do we need the above mentioned symbol table info for - and what do we sacrifice if we would not have it? It is not clear to me. Is it just because of the "worker invocation looks like an inline function call" feature that I remember we were arguing about, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we will support the new dynamic dispatch feature, we won't have any metadata for those components.
The "worker proxy" part of worker service shouldn't require anything else than the cached routing table for forwarding invocations. Right now it requires type information but only for the json invocation api because it's ambiguous and cannot be interpreted without knowing the types. But that interpreter can be moved one layer down to the executor.
The worker bridge part requires the metadata for evaluating Rib and I more and more think that it's a mistake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worker bridge part requires the metadata for evaluating Rib and I more and more think that it's a mistake
Is this a mistake? Or it is becoming difficult due to dynamic dispatch feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a mistake because it requires a very complex (or even impossible) caching logic and/or performance penalty, as I tried to pointing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do feel like there is some place where validation must be possible, even though it doesn't have to be Worker Gateway. The reason is tooling and DX: when a user is writing a Rib script, auto-complete should be possible, and type errors, to the extent they can be known in advance, should be eagerly reported. However, these could occur in some service that is separate from Worker Gateway, and which exists to do pre-compilation of Rib, typing, validation, auto-complete, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we would have such a feature it would be tied to a specific component version and not a specific worker. So I could imagine an explicit user triggered validation that contains the targeted component version as a parameter and would be triggered during deployments or live editing in the UI.
We could also permanently associate a component version with a api definition but that would lead to the same problem as we don't have per worker information (what version they use) on the worker service level
pub struct NoopComponentMetadataService; | ||
|
||
#[async_trait] | ||
impl ComponentMetadataService for NoopComponentMetadataService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way: why do we have a Noop
implementation for all the things in worker/component services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carry forward of some history. Hope I can delete.
@@ -418,6 +503,57 @@ where | |||
Ok(get_json_from_typed_value(&typed_value.result)) | |||
} | |||
|
|||
async fn invoke_and_await_parsed_function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need this, shouldn't all the other variants in this file use this to avoid redundancy?
(And it also feels weird that we need to parse the function name on this level - do we?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean do we need to have it parsed at all instead of just passing it down to worker executor where it is parsed anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rib already parsed a call as a function-call. Obviously, in future it will be more like a PreparedStatement as John mentioned. But logically, yes, it already parsed most of the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more notes:
There is already a type-checker in the peripherals (component-service/worker-service), before calling worker-executor which we decided last year May. If that's the case, then why is it that having the full parsed-function-name in the same service is a wrong/suboptimal choice now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not answer my question, why do we need a ParsedFunctoinName
for the invocation where the invoke API gets the function name as a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It implies the worker-executor should be responsible for mapping the Val against the Metadata and return a TypeAnnotatedValue
This is one possibility, and it allows to keep the current JSON response format, but increases the response payload size unnecessarily for cases when it is not needed.
There are (at least) two more possibilities to consider:
- if types for the result are only for the JSON format we can potentially change the JSON format itself, but in this case without any field and constructor names (basically
Val
directly serialized to JSON), it would be a very user experience - We could make
Val
a bit more self-describing - basically having field/enum/flag/constructor names that would make it enough alone to convert to JSON but is probably a bit less information, but then it would be always coupled with this information even where it's not needed.
So probably your suggestion is the best out of these three.
Note that I think it is a good place to do this mapping in the worker-executor because it is natural to have the component metadata available (as it even has the whole component binary downloaded and cached).
This definitely implies, the function-invocation should support accepting type annotated value
I don't understand why it would need that. It currently works by just getting
repeated wasm.rpc.Val input = 3;
so it means we are not sending any type information to it anyway as it already downloads and uses it internally.
If we somehow have a TypeAnnotatedValue
(from Rib evaluation?) we can just convert it to Val
as you wrote.
The problem is if we have a JSON from the REST API, and what I wrote above (and seems like you also writing it) is that the responsibility of understanding that JSON format can be moved to the worker executor as an optimisation, basically the same way as we had a gRPC endpoint in worker service for a while accepting JSON, during the migration from Scala. (It is deleted now).
Yes it feels more correct to have this mapping from JSON to Val in the worker bridge as it is similar a bit (just very simplified) to evaluating a Rib on a request.
But I think it worths it if it can avoid us running into a very difficult cache invalidation / race condition problem which triggered the whole discussion.
If the responsibility of the PR is avoid calling metadata download for Rib, sure I can do, but we are doing that anyway in worker-service. So services that don't use Rib, but use these interfaces end up smashing component service.
I think the original ticket's purpose was not "Rib only" but to solve the repeated metadata download in worker service by caching (no matter which part of worker service). This might have been confusing because we have a confusing naming around worker service (proxy, bridge, api gateway, rib) that we should improve. (We have two "components" merged into one service and some old and new naming and we are quite randomly using them in discussions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is caching metadata easy in worker executor and hard in the worker service?
Component metadata is per component version and which one we need depends on the worker:
- a worker gets an associated component version when it is first created, by the worker executor
- it can be updated later using the hot update feature
In would seem like the first one could be moved into the worker service level, so it would always figure out the component version before calling the worker executor. But it's not good because there are cases where a worker is spawn without touching the worker executor, for example when doing a local RPC call within the executor.
The second is even more tricky because even if all hot update request goes through worker service, it is just an attempt to update the worker, which gets enqueued and later tried, and can potentially fail. So the exact time when a worker starts using a new component version cannot be determined only by polling the worker executor.
This is all complicated by the fact that worker service is horizontally scalable and load balanced so requests for the same worker potentially go through different nodes of worker service which all need to properly deal with these invalidation scenarios etc.
Even if we would ignore the above points about moving the "which component version should I use for a new worker" question and somehow move that logic to the worker service, then we have the issue of invalidating this knowledge in our system. Which would require enumerating all worker service nodes and the problem is that cannot be done atomically, and requests to the same worker are potentially arriving into different worker service nodes because it is load balanced.
Now compare all this with just doing this on the worker executor level:
- it is sharded - if we cache something for a worker that's always in the same executor (until rebalancing)
- it is already caching whole wasmtime components (a result of downloading and compiling a component from component service)
- it knows exactly when an update succeeds and can update its internal caches - and no need to do any cross-node invalidation because of it being sharded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is still missing from the conversation is to know what exactly would we loose by not having component metadata for the Rib evaluation. Could you describe that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It offers a better reliable developer experience, there must be some place where we can do type checking, validation, resolving references; pre-compilation, validating references to functions, etc.
Say this for example:
let x = foo(1)
is my Rib expression. What exactly is foo
here? A variant or a function call? Rib doesn't know it, unless it knows about the type details of the component user is referring to?
Another example
let x = my_exported_function(1, {})
. Is {}
an empty flag or an empty record ? Both flag and record starts and ends with a curly braces. The answer depends on the function parameter types which Rib doesn't know without downloading the metadata.
Anyway this explanation doesn't mean I want this download again.
I think for performance benefits, which is super important, I am removing these downloads of metadata at Rib side.
And as we discussed, we may need to do the same for some other functionalities of worker-service which you explained in detail to my proposal/steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It offers a better reliable developer experience, there must be some place where we can do type checking, validation, resolving references; pre-compilation, validating references to functions, etc.
But all these things doesn't make sense to me in this context of invocation
When you develop the rib expression you don't have a particular worker you are invoking. So this validation at the time of invocation doesn't help with it at all
.iter() | ||
.map(|x| x.clone().into()) | ||
.collect() | ||
} | ||
} | ||
|
||
async fn get_proto_invoke_result( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a weird function name, as under the hood it does invoke a worker. Isn't this the actual implementation of invoking a worker?
@@ -107,7 +105,7 @@ impl ResolvedWorkerBinding { | |||
pub async fn execute_with<R>( | |||
&self, | |||
evaluator: &Arc<dyn Evaluator + Sync + Send>, | |||
worker_metadata_fetcher: &Arc<dyn WorkerMetadataFetcher + Sync + Send>, | |||
symbol_fetch: &Arc<dyn ComponentElementsFetch + Sync + Send>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rename this in the next commit
Why is it all in integration tests to test these logic? Can you help resolve the following doubts/suggestions?
"Worker exist can be simulated by a test interface that actually returns a worker metadata". With that we invoke first and invoke again, to see if cache is being used
Again, this is the case when worker is not returning me any metadata
If this is the case, where we preloaded the execution context with some versions and metadata and made some runs and later, however, worker is working with a different version can also be simulated, and see if it retries and invalidates the stale metadata. Couldn't it be? I understand the trustability of integration tests, but feeding this complex logic to integration test - I am not that fully convinced. If this is absolutely not reproducible in a unit test is what you think, probably you are right and I am yet to discover :) |
I'm sure you can test this in the ways you described, but for me it feels like it would require writing tons of test code to somehow simulate all these scenarios with test implementations / mocks etc, while with our integration test framework it's straightforward and you just have to write down the tests (with statements like upload component, start worker, invoke worker etc through the test dsl (which matches the component and worker api more or less). Even if it would be tested in the way you describe, I would feel more confident if we also have integration tests showing this all works with the real implementations as well. |
On tests, let's have a discussion first on the following and then I spend time there:
|
I think I am in terms with this explanation |
As agreed, there will be a separate PR, which will delete the metadata. I am closing this PR :) |
This is still WIP due to: https://zivergeteam.slack.com/archives/C057S2E4XT5/p1719840912374849
Fix #586
ParsedFunctionName
(because Rib already parsed the input function toParsedFunctionName
.