Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Split operation #33

Merged
merged 8 commits into from
Nov 27, 2024
Merged

Introduce Split operation #33

merged 8 commits into from
Nov 27, 2024

Conversation

mxgrey
Copy link
Contributor

@mxgrey mxgrey commented Nov 25, 2024

This PR introduces a new operation: split.

Split fills in a gap that has existed between unzip and spread:

  • unzip breaks apart a tuple and allows each element to be sent to a different connection
  • spread separates the elements of a collection and feeds them to a single connection one at a time

Split is like an unzip that acts on collections. The split operation supports keys that you can latch downstream connections onto. The nature of these keys can vary based on the specific type of data that is being split. For example splitting a list-like object lets you make different connections based on the sequence of items, while splitting a map-like object lets you make different connections based on specific keys in the map and/or based on the order that items appear in the map.

Unlike unzip, all items produced in a split have to have the same output type. This will be the core differentiator between the two operations.

@koonpeng after this is merged in to main, I'll merge it into your branch for #27 and then I'll implement split for serde_json::Value. After that I think we shouldn't need to worry about unzip for serialized workflows: If users want to split a value, they can serialize it, split it, then deserialize it. We can mandate (at least for our initial releases) that values which can't be serialized also can't be split/unzipped by a serialized workflow.

Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
@mxgrey mxgrey requested a review from koonpeng November 25, 2024 13:28
Comment on lines +625 to +627
/// If the chain's response implements the [`Splittable`] trait, then this
/// will insert a split and provide a container for its available outputs.
/// To build connections to these outputs later, use [`SplitOutputs::build`].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help in the docs to indicate that split_outputs, split_as_list and split_as_list_outputs, split_as_map, split_as_map_outputs etc are convenient wrappers and is equivalent to some short snippets of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 8d53e85


/// Return true if the key is feasible for this type of split, otherwise
/// return false. Returning false will cause the user to receive a
/// [`SplitConnectionError::IndexOutOfBounds`]. This will also cause iterating
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe SplitConnectionError::IndexOutOfBounds is refactored to SplitConnectionError::KeyOutOfBounds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed: 6690273

/// value will be sent to the target associated with the key.
///
/// If there is no connection associated with the specified key, the value
/// will be returned as [`Err`].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it now returns None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact the entire description of the function is out of date. Fixed here: 6690273

Comment on lines +468 to +479
for (index, value) in self.contents.into_iter().enumerate() {
match dispatcher.outputs_for(&ListSplitKey::Sequential(index)) {
Some(outputs) => {
outputs.push((index, value));
}
None => {
if let Some(outputs) = dispatcher.outputs_for(&ListSplitKey::Remaining) {
outputs.push((index, value));
}
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc we iterate through everything even if some of the indexes are not used, since SplitDispatcher keeps track of the connections, is it possible to use that to avoid iterating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an obvious alternative that will achieve better performance in all situations. We either have to iterate through the elements in the collection (as we are currently doing) or we have to iterate through the keys, and there's no guarantee for which will be larger. As long as we support Sequential, iterating through the keys will still require us to iterate through the collection, at least up until the highest sequence number in the keys. And if there's a Remaining key then we have to iterate through every item in the collection no matter what.

I'd suggest we move forward with this implementation as-is and we can introduce benchmarks in the future when our features are stabilized and maximizing performance becomes the top priority. I believe if we do want to optimize the implementation, we'll be able to do it without breaking the current API.

Comment on lines +351 to +352
pub(crate) connections: &'a HashMap<Key, usize>,
pub(crate) outputs: &'a mut Vec<Vec<Item>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the reasoning for this structure, at first glance, it seems that it should be possible to use HashMap<Key, Vec<Item>> instead of mapping to an index of a vec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few reasons for this:

  1. HashMap can have scatter the allocation of its buckets whereas Vec is guaranteed to use contiguous memory which reduces cache misses
  2. We know the exact size of ForkTargetStorage in advance so we can pre-size this Vec to match its elements with ForkTargetStorage
  3. Inside OperateSplit::execute we can iterate over this Vec zipped with ForkTargetStorage to drain the outputs and send them to their appropriate targets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also mention that it's very important for us to store the target entities in ForkTargetStorage to ensure that the connect operation and reachability calculations work correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashMap can have scatter the allocation of its buckets whereas Vec is guaranteed to use contiguous memory which reduces cache misses

I think this case is abit different, outputs is a vec of vecs, each of the inner Vec<Item> would have its own memory allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but Vec<Vec<T>> will still be less scattered than HashMap<K, Vec<T>>. I've considered using Vec<SmallVec<[T; 1]>> since most of the time I expect only one output per key, but that can be saved for future iterations on performance.

But the main reason to prefer Vec here is because HashMap would force us to do a second key lookup when it's time to drain the outputs.

@koonpeng
Copy link
Collaborator

After that I think we shouldn't need to worry about unzip for serialized workflows: If users want to split a value, they can serialize it, split it, then deserialize it.

Aside from the performance overhead, this would mandate that only serializable responses can be fork cloned, unzipped, split etc. This affects not just unzip, atm any operations that spawn new entities cannot be chained without the original type information. The solution in #27 is to store function pointers at registration time where we still have the type information, but the downside is that any chain operations will require registering a new function pointer.

@mxgrey
Copy link
Contributor Author

mxgrey commented Nov 26, 2024

this would mandate that only serializable responses can be fork cloned, unzipped, split etc

I'm personally okay with this limitation, especially for an initial implementation. But I'm also okay with using a function pointer registry (essentially a vtable) if it's feasible for us to do so.

any chain operations will require registering a new function pointer.

By this I assume you mean if the user wants multiple transformations to happen while passing the output of one node to the input slot of another node, then there's a possibility that some of those intermediate data types won't have their vtables registered and therefore we won't be able to generate the transformations?

That's the situation I was afraid of, and that's the biggest reason that I'd suggest we start out only allowing transformation operations to be performed on serializable data, because we're guaranteed to be able to implement all the operations for serde_json::Value. For non-serializable data, we can only support passing directly from the output of one node to the input of another node (maybe we can support connecting Output<T> to InputSlot<U> if T: Into<U>, but I wouldn't prioritize this for an initial draft). Maybe also support fork_clone for data that implements Clone since that's likely to be an important one. My rationale is this:

  1. I think it would be a confusing user experience if they can perform a single operation on an output and then just hit a brick wall.
  2. A single go / no-go criteria (serializable) to unlock all transformations at once should be easy enough for us to convey to users.
  3. Non-serializable structures are somewhat unusual in the context of dynamically built workflows, and there are two likely reasons for them to appear:
    1. The fields of the data structure represent ownership of a unique resource rather than plain data, in which case it probably should not be transformed
    2. The author of the struct forgot to #[derive(Serializable, Deserializable)] in which case they should be asked to do so
  4. For dynamic workflows, the primary concern is flexibility while performance is a secondary concern. If a user wants to avoid serializing to achieve maximum performance then they can always compile portions of their workflow.

@koonpeng
Copy link
Collaborator

By this I assume you mean if the user wants multiple transformations to happen while passing the output of one node to the input slot of another node, then there's a possibility that some of those intermediate data types won't have their vtables registered and therefore we won't be able to generate the transformations?

Depends on the definition of a "transformation" something like fork clone would in theory never have this issue, unzip would only have the issue when there is an unzip to terminate. Unzipping from one node to another should always work because we don't need to do any serialization, but when we unzip to terminate we need to serialize something which may not be registered.

Transformations where we go from one type to another (like Foo -> Bar) would depend on the context, you could have a node (a map block at the simplest) that does the transformation then it should always work as we registered that node. But if we want to transform Foo -> {unknown}, where the target type is only known at runtime then I don't see any way for it to work without serialization or reflection. Even with serialization, we still need a vtable-like registry to deserialize to the target type.

maybe we can support connecting Output to InputSlot if T: Into

I think that would be very hard to implement, instead of checking if the typeid matches, we need to know the traits T implements, I don't believe bevy_reflect can extract that info either. It might be impossible without some very complex build steps, proc macros by design dont have that info.

@mxgrey
Copy link
Contributor Author

mxgrey commented Nov 26, 2024

you could have a node (a map block at the simplest) that does the transformation then it should always work as we registered that node. But if we want to transform Foo -> {unknown}, where the target type is only known at runtime then I don't see any way for it to work without serialization or reflection

Right, when I'm talking about "transformation operations" I mean non-node operations like unzip, split, join, and eventually CEL transformations once we eventually support those (technically CEL will be implemented as nodes, but we'll be generating those nodes at runtime rather than having a user pre-register it).

For non-serializable transformations, I agree with exactly what you said: The user needs to register a precompiled node to do it. Then the serialized/dynamic workflow can just move that non-serializable data directly from the output of one node to the input of their transformation node.

It might be impossible without some very complex build steps, proc macros by design dont have that info.

I think the user would need to explicitly register that this trait is implemented while compiling. That's clunky enough that I'm happy to shelf that idea.

@mxgrey mxgrey merged commit 85501df into main Nov 27, 2024
5 checks passed
@mxgrey mxgrey deleted the split branch November 27, 2024 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants