Introduce Split operation #33

mxgrey · 2024-11-25T13:28:53Z

This PR introduces a new operation: split.

Split fills in a gap that has existed between unzip and spread:

unzip breaks apart a tuple and allows each element to be sent to a different connection
spread separates the elements of a collection and feeds them to a single connection one at a time

Split is like an unzip that acts on collections. The split operation supports keys that you can latch downstream connections onto. The nature of these keys can vary based on the specific type of data that is being split. For example splitting a list-like object lets you make different connections based on the sequence of items, while splitting a map-like object lets you make different connections based on specific keys in the map and/or based on the order that items appear in the map.

Unlike unzip, all items produced in a split have to have the same output type. This will be the core differentiator between the two operations.

@koonpeng after this is merged in to main, I'll merge it into your branch for #27 and then I'll implement split for serde_json::Value. After that I think we shouldn't need to worry about unzip for serialized workflows: If users want to split a value, they can serialize it, split it, then deserialize it. We can mandate (at least for our initial releases) that values which can't be serialized also can't be split/unzipped by a serialized workflow.

Signed-off-by: Michael X. Grey <[email protected]>

koonpeng · 2024-11-26T04:23:27Z

src/chain.rs

+    /// If the chain's response implements the [`Splittable`] trait, then this
+    /// will insert a split and provide a container for its available outputs.
+    /// To build connections to these outputs later, use [`SplitOutputs::build`].


I think it would help in the docs to indicate that split_outputs, split_as_list and split_as_list_outputs, split_as_map, split_as_map_outputs etc are convenient wrappers and is equivalent to some short snippets of code.

Added in 8d53e85

koonpeng · 2024-11-26T04:35:12Z

src/chain/split.rs

+
+    /// Return true if the key is feasible for this type of split, otherwise
+    /// return false. Returning false will cause the user to receive a
+    /// [`SplitConnectionError::IndexOutOfBounds`]. This will also cause iterating


I believe SplitConnectionError::IndexOutOfBounds is refactored to SplitConnectionError::KeyOutOfBounds

Good catch, fixed: 6690273

koonpeng · 2024-11-26T05:35:40Z

src/chain/split.rs

+    /// value will be sent to the target associated with the key.
+    ///
+    /// If there is no connection associated with the specified key, the value
+    /// will be returned as [`Err`].


I believe it now returns None

In fact the entire description of the function is out of date. Fixed here: 6690273

koonpeng · 2024-11-26T05:38:31Z

src/chain/split.rs

+        for (index, value) in self.contents.into_iter().enumerate() {
+            match dispatcher.outputs_for(&ListSplitKey::Sequential(index)) {
+                Some(outputs) => {
+                    outputs.push((index, value));
+                }
+                None => {
+                    if let Some(outputs) = dispatcher.outputs_for(&ListSplitKey::Remaining) {
+                        outputs.push((index, value));
+                    }
+                }
+            }
+        }


iiuc we iterate through everything even if some of the indexes are not used, since SplitDispatcher keeps track of the connections, is it possible to use that to avoid iterating?

I don't see an obvious alternative that will achieve better performance in all situations. We either have to iterate through the elements in the collection (as we are currently doing) or we have to iterate through the keys, and there's no guarantee for which will be larger. As long as we support Sequential, iterating through the keys will still require us to iterate through the collection, at least up until the highest sequence number in the keys. And if there's a Remaining key then we have to iterate through every item in the collection no matter what.

I'd suggest we move forward with this implementation as-is and we can introduce benchmarks in the future when our features are stabilized and maximizing performance becomes the top priority. I believe if we do want to optimize the implementation, we'll be able to do it without breaking the current API.

koonpeng · 2024-11-26T05:42:32Z

src/chain/split.rs

+    pub(crate) connections: &'a HashMap<Key, usize>,
+    pub(crate) outputs: &'a mut Vec<Vec<Item>>,


I don't know the reasoning for this structure, at first glance, it seems that it should be possible to use HashMap<Key, Vec<Item>> instead of mapping to an index of a vec?

There are a few reasons for this:

HashMap can have scatter the allocation of its buckets whereas Vec is guaranteed to use contiguous memory which reduces cache misses

We know the exact size of ForkTargetStorage in advance so we can pre-size this Vec to match its elements with ForkTargetStorage

Inside OperateSplit::execute we can iterate over this Vec zipped with ForkTargetStorage to drain the outputs and send them to their appropriate targets

I'll also mention that it's very important for us to store the target entities in ForkTargetStorage to ensure that the connect operation and reachability calculations work correctly.

HashMap can have scatter the allocation of its buckets whereas Vec is guaranteed to use contiguous memory which reduces cache misses

I think this case is abit different, outputs is a vec of vecs, each of the inner Vec<Item> would have its own memory allocation.

That's true, but Vec<Vec<T>> will still be less scattered than HashMap<K, Vec<T>>. I've considered using Vec<SmallVec<[T; 1]>> since most of the time I expect only one output per key, but that can be saved for future iterations on performance.

But the main reason to prefer Vec here is because HashMap would force us to do a second key lookup when it's time to drain the outputs.

koonpeng · 2024-11-26T05:54:30Z

After that I think we shouldn't need to worry about unzip for serialized workflows: If users want to split a value, they can serialize it, split it, then deserialize it.

Aside from the performance overhead, this would mandate that only serializable responses can be fork cloned, unzipped, split etc. This affects not just unzip, atm any operations that spawn new entities cannot be chained without the original type information. The solution in #27 is to store function pointers at registration time where we still have the type information, but the downside is that any chain operations will require registering a new function pointer.

mxgrey · 2024-11-26T10:36:44Z

this would mandate that only serializable responses can be fork cloned, unzipped, split etc

I'm personally okay with this limitation, especially for an initial implementation. But I'm also okay with using a function pointer registry (essentially a vtable) if it's feasible for us to do so.

any chain operations will require registering a new function pointer.

By this I assume you mean if the user wants multiple transformations to happen while passing the output of one node to the input slot of another node, then there's a possibility that some of those intermediate data types won't have their vtables registered and therefore we won't be able to generate the transformations?

That's the situation I was afraid of, and that's the biggest reason that I'd suggest we start out only allowing transformation operations to be performed on serializable data, because we're guaranteed to be able to implement all the operations for serde_json::Value. For non-serializable data, we can only support passing directly from the output of one node to the input of another node (maybe we can support connecting Output<T> to InputSlot<U> if T: Into<U>, but I wouldn't prioritize this for an initial draft). Maybe also support fork_clone for data that implements Clone since that's likely to be an important one. My rationale is this:

I think it would be a confusing user experience if they can perform a single operation on an output and then just hit a brick wall.
A single go / no-go criteria (serializable) to unlock all transformations at once should be easy enough for us to convey to users.
Non-serializable structures are somewhat unusual in the context of dynamically built workflows, and there are two likely reasons for them to appear:
1. The fields of the data structure represent ownership of a unique resource rather than plain data, in which case it probably should not be transformed
2. The author of the struct forgot to #[derive(Serializable, Deserializable)] in which case they should be asked to do so
For dynamic workflows, the primary concern is flexibility while performance is a secondary concern. If a user wants to avoid serializing to achieve maximum performance then they can always compile portions of their workflow.

Signed-off-by: Michael X. Grey <[email protected]>

koonpeng · 2024-11-26T12:21:20Z

By this I assume you mean if the user wants multiple transformations to happen while passing the output of one node to the input slot of another node, then there's a possibility that some of those intermediate data types won't have their vtables registered and therefore we won't be able to generate the transformations?

Depends on the definition of a "transformation" something like fork clone would in theory never have this issue, unzip would only have the issue when there is an unzip to terminate. Unzipping from one node to another should always work because we don't need to do any serialization, but when we unzip to terminate we need to serialize something which may not be registered.

Transformations where we go from one type to another (like Foo -> Bar) would depend on the context, you could have a node (a map block at the simplest) that does the transformation then it should always work as we registered that node. But if we want to transform Foo -> {unknown}, where the target type is only known at runtime then I don't see any way for it to work without serialization or reflection. Even with serialization, we still need a vtable-like registry to deserialize to the target type.

maybe we can support connecting Output to InputSlot if T: Into

I think that would be very hard to implement, instead of checking if the typeid matches, we need to know the traits T implements, I don't believe bevy_reflect can extract that info either. It might be impossible without some very complex build steps, proc macros by design dont have that info.

mxgrey · 2024-11-26T12:31:58Z

you could have a node (a map block at the simplest) that does the transformation then it should always work as we registered that node. But if we want to transform Foo -> {unknown}, where the target type is only known at runtime then I don't see any way for it to work without serialization or reflection

Right, when I'm talking about "transformation operations" I mean non-node operations like unzip, split, join, and eventually CEL transformations once we eventually support those (technically CEL will be implemented as nodes, but we'll be generating those nodes at runtime rather than having a user pre-register it).

For non-serializable transformations, I agree with exactly what you said: The user needs to register a precompiled node to do it. Then the serialized/dynamic workflow can just move that non-serializable data directly from the output of one node to the input of their transformation node.

It might be impossible without some very complex build steps, proc macros by design dont have that info.

I think the user would need to explicitly register that this trait is implemented while compiling. That's clunky enough that I'm happy to shelf that idea.

mxgrey added 6 commits November 25, 2024 00:51

Introducing split operation

19efa18

Signed-off-by: Michael X. Grey <[email protected]>

Tweak API

890efe3

Signed-off-by: Michael X. Grey <[email protected]>

Include keys with split items and add tests

b35d23e

Signed-off-by: Michael X. Grey <[email protected]>

Finish map split test

915cfa8

Signed-off-by: Michael X. Grey <[email protected]>

Fix formatting

189aeb9

Signed-off-by: Michael X. Grey <[email protected]>

Apply cargo clippy

7741550

Signed-off-by: Michael X. Grey <[email protected]>

mxgrey requested a review from koonpeng November 25, 2024 13:28

koonpeng reviewed Nov 26, 2024

View reviewed changes

mxgrey added 2 commits November 26, 2024 11:30

Document the long-form equivalent of the different split variations

8d53e85

Signed-off-by: Michael X. Grey <[email protected]>

Fix documentation

6690273

Signed-off-by: Michael X. Grey <[email protected]>

koonpeng approved these changes Nov 27, 2024

View reviewed changes

mxgrey merged commit 85501df into main Nov 27, 2024
5 checks passed

mxgrey deleted the split branch November 27, 2024 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Split operation #33

Introduce Split operation #33

mxgrey commented Nov 25, 2024

koonpeng Nov 26, 2024

mxgrey Nov 26, 2024

koonpeng Nov 26, 2024

mxgrey Nov 26, 2024

koonpeng Nov 26, 2024

mxgrey Nov 26, 2024

koonpeng Nov 26, 2024

mxgrey Nov 26, 2024

koonpeng Nov 26, 2024

mxgrey Nov 26, 2024

mxgrey Nov 26, 2024

koonpeng Nov 27, 2024

mxgrey Nov 27, 2024

koonpeng commented Nov 26, 2024

mxgrey commented Nov 26, 2024 •

edited

Loading

koonpeng commented Nov 26, 2024

mxgrey commented Nov 26, 2024

		pub(crate) connections: &'a HashMap<Key, usize>,
		pub(crate) outputs: &'a mut Vec<Vec<Item>>,

Introduce Split operation #33

Introduce Split operation #33

Conversation

mxgrey commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koonpeng commented Nov 26, 2024

mxgrey commented Nov 26, 2024 • edited Loading

koonpeng commented Nov 26, 2024

mxgrey commented Nov 26, 2024

mxgrey commented Nov 26, 2024 •

edited

Loading