-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Split operation #33
Conversation
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
/// If the chain's response implements the [`Splittable`] trait, then this | ||
/// will insert a split and provide a container for its available outputs. | ||
/// To build connections to these outputs later, use [`SplitOutputs::build`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help in the docs to indicate that split_outputs
, split_as_list
and split_as_list_outputs
, split_as_map
, split_as_map_outputs
etc are convenient wrappers and is equivalent to some short snippets of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 8d53e85
src/chain/split.rs
Outdated
|
||
/// Return true if the key is feasible for this type of split, otherwise | ||
/// return false. Returning false will cause the user to receive a | ||
/// [`SplitConnectionError::IndexOutOfBounds`]. This will also cause iterating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe SplitConnectionError::IndexOutOfBounds
is refactored to SplitConnectionError::KeyOutOfBounds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed: 6690273
src/chain/split.rs
Outdated
/// value will be sent to the target associated with the key. | ||
/// | ||
/// If there is no connection associated with the specified key, the value | ||
/// will be returned as [`Err`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it now returns None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact the entire description of the function is out of date. Fixed here: 6690273
for (index, value) in self.contents.into_iter().enumerate() { | ||
match dispatcher.outputs_for(&ListSplitKey::Sequential(index)) { | ||
Some(outputs) => { | ||
outputs.push((index, value)); | ||
} | ||
None => { | ||
if let Some(outputs) = dispatcher.outputs_for(&ListSplitKey::Remaining) { | ||
outputs.push((index, value)); | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc we iterate through everything even if some of the indexes are not used, since SplitDispatcher
keeps track of the connections, is it possible to use that to avoid iterating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an obvious alternative that will achieve better performance in all situations. We either have to iterate through the elements in the collection (as we are currently doing) or we have to iterate through the keys, and there's no guarantee for which will be larger. As long as we support Sequential
, iterating through the keys will still require us to iterate through the collection, at least up until the highest sequence number in the keys. And if there's a Remaining
key then we have to iterate through every item in the collection no matter what.
I'd suggest we move forward with this implementation as-is and we can introduce benchmarks in the future when our features are stabilized and maximizing performance becomes the top priority. I believe if we do want to optimize the implementation, we'll be able to do it without breaking the current API.
pub(crate) connections: &'a HashMap<Key, usize>, | ||
pub(crate) outputs: &'a mut Vec<Vec<Item>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the reasoning for this structure, at first glance, it seems that it should be possible to use HashMap<Key, Vec<Item>>
instead of mapping to an index of a vec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few reasons for this:
HashMap
can have scatter the allocation of its buckets whereasVec
is guaranteed to use contiguous memory which reduces cache misses- We know the exact size of
ForkTargetStorage
in advance so we can pre-size thisVec
to match its elements withForkTargetStorage
- Inside
OperateSplit::execute
we can iterate over thisVec
zipped withForkTargetStorage
to drain the outputs and send them to their appropriate targets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll also mention that it's very important for us to store the target entities in ForkTargetStorage
to ensure that the connect operation and reachability calculations work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashMap can have scatter the allocation of its buckets whereas Vec is guaranteed to use contiguous memory which reduces cache misses
I think this case is abit different, outputs
is a vec of vecs, each of the inner Vec<Item>
would have its own memory allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but Vec<Vec<T>>
will still be less scattered than HashMap<K, Vec<T>>
. I've considered using Vec<SmallVec<[T; 1]>>
since most of the time I expect only one output per key, but that can be saved for future iterations on performance.
But the main reason to prefer Vec
here is because HashMap
would force us to do a second key lookup when it's time to drain the outputs.
Aside from the performance overhead, this would mandate that only serializable responses can be fork cloned, unzipped, split etc. This affects not just unzip, atm any operations that spawn new entities cannot be chained without the original type information. The solution in #27 is to store function pointers at registration time where we still have the type information, but the downside is that any chain operations will require registering a new function pointer. |
I'm personally okay with this limitation, especially for an initial implementation. But I'm also okay with using a function pointer registry (essentially a vtable) if it's feasible for us to do so.
By this I assume you mean if the user wants multiple transformations to happen while passing the output of one node to the input slot of another node, then there's a possibility that some of those intermediate data types won't have their vtables registered and therefore we won't be able to generate the transformations? That's the situation I was afraid of, and that's the biggest reason that I'd suggest we start out only allowing transformation operations to be performed on serializable data, because we're guaranteed to be able to implement all the operations for
|
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Depends on the definition of a "transformation" something like fork clone would in theory never have this issue, unzip would only have the issue when there is an unzip to terminate. Unzipping from one node to another should always work because we don't need to do any serialization, but when we unzip to terminate we need to serialize something which may not be registered. Transformations where we go from one type to another (like Foo -> Bar) would depend on the context, you could have a node (a map block at the simplest) that does the transformation then it should always work as we registered that node. But if we want to transform Foo -> {unknown}, where the target type is only known at runtime then I don't see any way for it to work without serialization or reflection. Even with serialization, we still need a vtable-like registry to deserialize to the target type.
I think that would be very hard to implement, instead of checking if the typeid matches, we need to know the traits T implements, I don't believe |
Right, when I'm talking about "transformation operations" I mean non-node operations like unzip, split, join, and eventually CEL transformations once we eventually support those (technically CEL will be implemented as nodes, but we'll be generating those nodes at runtime rather than having a user pre-register it). For non-serializable transformations, I agree with exactly what you said: The user needs to register a precompiled node to do it. Then the serialized/dynamic workflow can just move that non-serializable data directly from the output of one node to the input of their transformation node.
I think the user would need to explicitly register that this trait is implemented while compiling. That's clunky enough that I'm happy to shelf that idea. |
This PR introduces a new operation: split.
Split fills in a gap that has existed between unzip and spread:
Split is like an unzip that acts on collections. The split operation supports keys that you can latch downstream connections onto. The nature of these keys can vary based on the specific type of data that is being split. For example splitting a list-like object lets you make different connections based on the sequence of items, while splitting a map-like object lets you make different connections based on specific keys in the map and/or based on the order that items appear in the map.
Unlike unzip, all items produced in a split have to have the same output type. This will be the core differentiator between the two operations.
@koonpeng after this is merged in to main, I'll merge it into your branch for #27 and then I'll implement
split
forserde_json::Value
. After that I think we shouldn't need to worry about unzip for serialized workflows: If users want to split a value, they can serialize it, split it, then deserialize it. We can mandate (at least for our initial releases) that values which can't be serialized also can't be split/unzipped by a serialized workflow.