Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depth first node traversal can cause transform node having merge node as input to be re-evaluated several times with transient values #167

Open
TheCoconutChef opened this issue Dec 11, 2022 · 13 comments

Comments

@TheCoconutChef
Copy link
Contributor

TheCoconutChef commented Dec 11, 2022

Disclaimer: If it turns out that I missed something about the framework and the problem I believe I have identified here isn't actually legitimate, then I apologize.

The problem as I see it

Consider the following situation:

    using model_t = std::pair<int, int>;
    auto state    = lager::make_state(model_t{}, lager::transactional_tag{});
    auto ca       = state[&model_t::first].make();
    auto cb       = state[&model_t::second].make();
    auto tr       = lager::with(ca, cb)
                  .make()
                  .map([&tr_count](const auto& tuple) {
                      auto [a, b] = tuple;
                      // some pre condition
                      assert(a <= b);
                      return std::sqrt(b - a);
                  })
                  .make();
    // values that are consistent with the assert statement
    ca.set(11);
    cb.set(21);
    lager::commit(state); // <-- will trigger an assert failure down the line

which can be examined here.

Under the current node traversal policy -- i.e. depth first -- the assert contained in the transform lambda will fail as the transform node created by the call to map will attempt to recompute its value on the basis of a send_down call triggered by the auto ca cursor node used in lager::with(ca, cb).

This is because the send_down on a merge node having two input will first propagate a new value once it has detected that the ca node has changed, and then will do so a second time once it has realized that cb has changed as well.

Even if the transform function did not have a failing assert, the ca send_down call would still trigger the computation of a value -- using 11 and 0 as inputs -- that would not be of interest to anyone. Indeed, the only value anyone is ultimately going to be interested in is going to be the one in which 11 and 21 are used as arguments. Any intermediate values other than these will ultimately be discarded. The only thing propagation of such values may cause is further computation of intermediate values downstream from the merge node that will also ultimately have to be discarded. (That is, if they do not themselves see their preconditions be violated in a fatal way because they managed to see a transient state.)

Thus, the current node traversal policy might force the user to be overly cautious when using transform node downstream of merge nodes. He cannot rely on the fact that he mutated the state in a coherent manner within his transform logic.

If a merge node sees several of its input change values within the same transaction, it will trigger a series of computation downstream from itself for transient values which will have to be discarded, or may cause the witnessing a some transient state too unsightly for some transform node to handle, i.e. they may throw or fail on assert.

A proof of concept solution

If the depth first traversal is the problem, then the solution would be to traverse the nodes some other way. This is what I've attempted to do as a proof-of-concept in the following branch.

It's a variation of breadth-first traversal (edit: I've just realized this is topological traversal.) that leverage the fact that we're operating in a DAG. The basic strategy consists in:

  1. Giving reader_node_base a rank method. The rank is: (a) 0 if a node has no parent (b) the max rank of the parents + 1 for any other node.
  2. Using this node rank to schedule node traversal in such a way that a node is visited (recomputed) only if all of its parent have already been revisited (recomputed).

(2) is accomplished through the use of a traversal object injected into a modified send_down method taking it as an argument. Instead of directly calling the send_down of their children, nodes schedule them for visitation at some later point determined by their rank.

We can contrast the traversal sequence of the depth first strategy with the rank based sequence using the example presented above. We use the following naming convention for the various nodes:

S = the root state node
A = the cursor lense node on model_t::first
B = the cursor lense node on model_t::second
M = the merge node created through lager::with(ca, cb)
T = the xform node created through map(...)

Under depth first traversal and changing the model_t::first and model_t::second values within the same transaction, we have the traversal:

S - A - M - T - B - M - T

but under the rank based traversal we have:

S - A - B - M - T

i.e. less call to recompute and send_down and less witnessing of transient values. We can anticipate that the gains get bigger as more nodes are appended to the tree as downstream from M.

Of course, the tradeoff is that we now have to manipulate a kind of traversal object to schedule the nodes of higher rank whose input nodes have changed, but I believe the benefit may be worthwhile.

From PoC to the real thing?

I wanted to correctly present the problem as I saw it and the sketch of a solution for the sake of completeness. However, before going further in this direction, I wanted to know if:

  1. This actually a genuine problem. Perhaps I'm not using the framework as intended or I missed something that would have made this a non issue.
  2. Provided that it is a genuine problem, is changing the traversal policy the way to go?

I have some anticipation that the answer to both of these questions is yes, but I wanted to make sure before starting to bother anybody with a PR and fine tuning it. (For instance, I might want to look at the performance impact of a new traversal policy.)

Cheers!

@TheCoconutChef TheCoconutChef changed the title Depth first node traversal can cause transform node having merge node as input to be re-evaluated several times Depth first node traversal can cause transform node having merge node as input to be re-evaluated several times with transient values Dec 11, 2022
@arximboldi
Copy link
Owner

Hi @TheCoconutChef!

Happy holdays and sorry I haven't replied about this eaerlier, it's been a busy month!

Thanks a lot for finding and reporting this issue, and spending the time in finding a solution. It is indeed a nuanced issue!

First, whether this is a genuine problem: it is somewhat a problem. I must admit I haven't really found it an issue myself. I have been aware of the double-evaluation issue, but never considered it an invariant-breaking issue. In transformations I do not normally rely on invariants like that, and since they have no effects and the values "converge" to the right result, it feels "ok". But it is true that it feels like an issue... it could be a performance issue if a transform is complex, or if it relies on invariants. Ideally, we would not have this issue.

The problem I see with your suggested implementation is that building a multi-map during propagation feels like a performance regression. Normally propagation does not even need to allocate memory... It would be interesting to bench-mark some non-trivial networks with this change. I guess, things could be tweaked by using a better data-structure. A boost::flat_multimap could help, since after an initial propagation, the size of the multi-map would not "change". Another option is to use Boost's instrusive containers since we anyway already have a "node" structure (I would probably suggest using this with this). Or maybe there are other smarter things we could do at algorithm level... Part of the trickyness resides in that it can some of the branches (in your example, what if A changed but not B).

What do you think?

@TheCoconutChef
Copy link
Contributor Author

TheCoconutChef commented Dec 25, 2022

I'd be very happy to investigate this now that I have more confidence that there is an interest.

I think it would be interesting for me to have some benchmark harness so that I could:

  1. Experiment with different implementation / data structure
  2. Experiment with having the traversal policy be a configuration point

Indeed, if DFS based traversal has never been a problem for someone, and if topological traversal incurs an unavoidable performance cost, why would this be imposed on the user? The issue is mostly about merge node anyway and some model may not rely on them very much.

I will report back with what I find. (And happy holidays.)

@arximboldi
Copy link
Owner

Thank you a lot for the investigation.

Having some benchmarking tools included in the repository is in itself a very valuable contribution, that can help inform future decissions on the best implementation propagation alternatives, or apply further optimizations. That code has never been really much optimized, but also it never caused a great deal of issues, even though there are improvements like #132 that we will want to add at some point.

If you want examples, I use the Nonius library for benchmarking in Immer, some examples here

@TheCoconutChef
Copy link
Contributor Author

Giving a pulse on this.

  1. I've defined 5 traversal strategies for the benchmarks: depth first (DFS, the current one), topological traversal with custom multi map (T-CMM), topological traversal with std::unordered_multimap (T-SUMM), topological traversal with boost unordered multimap (T-BUMM) and topological traversal with boost intrusive multi set (T-BIMM).
  2. These strategies traverse two kinds of node graphs. One is a "simple chain" (SC) which is simply a series of nodes having a single child connected "in a line". The other is a "diamond chain" (DC) which is made up of a set of "diamond" each having 4 node. The diamond performs a "split" on an initial value and then a merge.

The tl;dr is that the T-SUMM strategy appears to be faster than any other strategy in both the simple chain and the diamond chain cases. This is a bit surprising since I expected DFS to be optimal for the simple chain case. I don't know what the cause of this is. It's arguably suspect.

The second thing to note is that the DC case was explicitly made to break the DFS strategy and it does. Because of the way the DFS traversal works, the diamond chain with its chaining of split and merge guarantees that the last node of the n-th link in the chain will be visited 2^n times.

This is admittedly a very degenerate case, and I used some weirdness to ensure that the values would always change on the == comparison during the node traversal, but none the less I feel it illustrates the shortcomings of the strategy.

Sample output of nonius benchmarks:

  N = 15
  DFS-SC	 2.682	 0.061
  DFS-DC	 765.976	 11.492
  T-CMM-SC	 0.634	 0.09
  T-CMM-DC	 0.604	 0.007
  T-SUMM-SC	 0.61	 0.079
  T-SUMM-DC	 0.598	 0.032
  T-BUMM-SC	 2.44	 0.125
  T-BUMM-DC	 2.787	 0.12
  T-BIMS-SC	 2.458	 0.135
  T-BIMS-DC	 2.879	 0.26
  N = 16384
  DFS-SC	 113.137	 13.654
  T-CMM-SC	 49.085	 4.934
  T-CMM-DC	 359.964	 27.771
  T-SUMM-SC	 37.387	 2.699
  T-SUMM-DC	 296.357	 21.723
  T-BUMM-SC	 195.854	 20.373
  T-BUMM-DC	 1422.088	 158.536
  T-BIMS-SC	 193.884	 16.102
  T-BIMS-DC	 1522.972	 149.081

DFS-DC stops being practical around N=20. Admittedly these differences are rather marginal in absolute terms tho they can be substantial in relative terms.

And with that, i have some questions.

  1. Am I using boost intrusive correctly? I basically used this as recommended, using the node rank as the key. The intrusie traversal strategy is defined here.
  2. I thought previously about having the traversal strategy be a configuration point, but if T-SUMM is faster in all cases, I don't see why we would do that. If the benchmarking results prove to be true, then I see no reason to offer anything other than topological traversal based on std::unordered_multimap for now. It both fixes the transient value issue that prompted me to explore this and it happens to be marginally faster.

@arximboldi
Copy link
Owner

Super intersting. Before I dig deeper: what do the numbers in the two columns mean?

@TheCoconutChef
Copy link
Contributor Author

It's the mean time (in ns) to perform the function in the benchmark and the std dev of the collected samples. So, for instance, in the nonius generated html report for the N=16384 case we can see the summary view:

image

and the sample view:

image

@arximboldi
Copy link
Owner

Very impressive how thorough you've been with this @TheCoconutChef. Also the results are looking very promising, with regards to solving the issues without inducing any penalty (actually potentially making propagation faster than ever!)

Just before green lighting this and moving to polishing the T-SUMM, I'd love to review the benchmark code and try replicate the results locally, in order to verify if there are any blind spots. Where can I access the code and what would be the replication instructions?

Thank you so much again for all the effort you are putting into this, this is high quality work!

@TheCoconutChef
Copy link
Contributor Author

Thank you very much. I think the lager code base is pretty amazing and I just don't want to degrade the overall quality. I think the effort is worth it.

As to your question, the work is here in the rank-based-node-traversal branch, including the wip commits. The benchmark is in benchmark/detail/traversal.cpp (the detail/ directory is probably superfluous I was just imitating what's in immer in a naive way.)

There's also a little script in a new report/ directory (not really intended for a final version) which will generate the .out and .html files for the nonious benchmark report. It automatically drops DFS-DC after N=17. It must be invoked with the benchmark executable as an argument, i.e. ./benchmarks.sh ../build/benchmark/benchmark-detail-traversal or wherever you end up putting the executable.

The benchmark also executes through the make check target with whatever values one puts in the BENCHMARK_PARAM and BENCHMARK_SAMPLES cmake variables. This will also generate something in the report/ dir.

Cheers!

@arximboldi
Copy link
Owner

arximboldi commented Feb 7, 2023

Sadly there were a couple of issues with the benchmark setup. I have pushed a couple of commits that address these issues here Additionally, I have fixed traversal based on intrusive containers.

Once these issues are addressed, the results match what we could consider intuitively.

You can check some results here. In order to generate such plot where N increases exponentially, I have passed -p N:*:2:2:10 to the benchmark runner. Here is a screenshot too:

image

What we see is that DC is actually the fastest in all cases. Boost Intrusive containers follow and then other approaches. I wonder whether there could be a hybrid method that we could use, or dynamically switch between approaches when we detect diamond-like shapes for subtrees...

@TheCoconutChef
Copy link
Contributor Author

Let's go for another pulse.

  1. Very cool nonius feature. Obviously I'm juste learning nonius.
  2. Thanks for fixing the boost intrusive usage. The result didn't seem right at first blush, tho it would have been cool for them to be accurate.
  3. Thanks for fixing what nonius was measuring.

Taken together, what I get from this is that the possibility of configuring the traversal policy is probably back on the menu, whether it be tag based or more dynamic.

HOWEVER, before talking more about this, I do believe the modification to the unique_value structure in the benchmark might not produce the kind of behavior I was gunning for.

The formula:

std::get<0>(tuple).val + std::get<1>(tuple).val) / 2 + 1

will produce the same value for a tuple [n, n] and [n, n+1], as we can see here

I understand why the definition of unique_value was changed (side effects aren't exactly canonical here) but the fact that the value is actually == for [n,n] and [n,n+1] does prevent the exponential visit trick from working.

If I change the combine function definition for

    const auto& [a, b] = tuple;
    return a != b ? unique_value{std::max(a.val, b.val) + 2}
                  : unique_value{a.val + 1};

then I do get the intuitive result for the DFS-DC case, i.e.

image

and this is actually the intuitive result to me since the DC case was designed such that on a sufficiently "unstable" value propagation dynamic, the n-th node would be hit 2^n times in the traversal.

Now, hybrid methods. I suppose the problem here is that I'm basically gunning for a topological traversal. Topological traversal is basically only necessary for the merge nodes, i.e. those arising out of a lager::with call. What that means is that node scheduling only ever matters for node having more than one parent. Any other node can be visited eagerly in the DFS fashion.

I think it would be interesting to experiment with this, i.e. only merged node can be scheduled. Everyone else is as before, without any scheduling or call to a structure, be it intrusive. I'll try to do something on the weekend.

@TheCoconutChef
Copy link
Contributor Author

TheCoconutChef commented Mar 12, 2023

The code.
Key commits:

  1. detail:nodes: add schedule_or_send_down_method
  2. detail:traversal: make scheduling opportunistic
  3. introduce treap based taversal
  4. benchmark: add a "god chain" benchmark and improve treap

Ok so I did check out what could be done with scheduling only merged node last week and expanded on that this weekend and I think it's mostly good news. First:

sc

Scheduling only merged node and proceeding in DFS otherwise means that pretty much all traversal strategies end up reducing to the DFS baseline for the SC benchmarks, as would be expected.

Second:
dc

The DC case has already been improved compared to the first version. Only scheduling merged node probably helps, but on top of that the intrusive traversal classes were modified to take advantage of some of boost::intrusive methods such as erase_and_dispose. On my machine, BIMSRB appear to be ~3.5x faster.

I'd be curious to see what it looks like on your end.

Third:
gc

You may have noticed "TREAP" in the previous graphs. Basically, we can use an intrusive treap in order to visit our scheduled nodes through a priority queue. The implementation ends up being rather simple, doesn't require any bucket information based on the tree size, and it has performed "the best" so far under my admittedly limited case set.

HOWEVER I'm still trying to challenge the treap traversal. It's susceptible to perform poorly if it has to schedule merge node in a certain way. The god tree benchmarks was introduced to try to break the treap traversal, and as a result I ended up visiting a node's children in reverse order.

The branch is here and I'd say there's a fair bit of information contained in the commit messages.

I think I'm (we're?) at the point of honing on one or two strategies somehow. This might require more sophistitcated topologies or more sophisticated DAG update dynamic on my part.

Edit

I added a random DAG based benchmark. The node update function isn't "canonical" (it's not pure) but I still think the benchmark is valid conceptually.

@arximboldi
Copy link
Owner

arximboldi commented Jul 31, 2023

Hi @TheCoconutChef !

Sorry for the very long time since my last comment... I wanted to book some time to carefully study your solutions but I have failed to do so.

Before I dig deeper, one question: the benchmarks do show the cumulative effects of all the commits or each single commit in isolation? (I think the latter?)

Either way, skimming through the commits and benchmarks results, it seems to me that we are almost ready to have a workable solution here. The simple linear case works as fast as it used to. And even for arbitrary DAGs it looks like you've managed to work down to very low overheads. Do you want to integrate your favourite approaches (considering both performance and simplicity of the implementation) and make a PR that we can already polish to merge? This would be probably TREAP + bypass of single-parented nodes?

@TheCoconutChef
Copy link
Contributor Author

Long time since the last comment? Why, I myself would never do that! Hmm, is it November already? Anyway...

First, your question: the benchmarks included all the changes I had made, but I'm not sure that's relevant anymore. More below.

Second, the treap didn't actually produce the best results. This became much more evident once I started running benchmark on a randomly generated DAG. In retrospect, the ordered multi set produced the best result out of all the topo traversals I tried.

Second, the reason I took a such a long time to answer is that I wasn't happy with the performance I was getting out of the topo traversal on a random dag. The problem was that node scheduling in a treap or a multiset required log(size) type complexity operations on insertion and on rank visit. That basically has to do with the fact that every insertion is sorted in those scenarios.

But that's too much sorting. Nodes having the same rank don't need to be sorted relative to one another. The order in which they're visited doesn't matter at all. If we use an explicit node_schedule object for a specific rank, we can register the nodes that need to be visited within a rank in an intrusive list.

Using this strategy, we have:

image

And the same graph zoomed in at the 0.1-0.5 "entropy" range:

image

These are results for a DAG having 1024 nodes, 50% are nodes having two parents and 50% having 1, and the X-axis represents the % chance a node will change value within the context of a graph traversal and thus propagate different values to its children.

The evaluated strategies are DFS (Blue), boost intrusive multi set (green) and node schedule (teal).

The price we pay in order to have this new strategy is that we need to instantiate one node_schedule object per rank, and have every node of the same rank point to the node schedule that correspond to their rank. This implies one more shared_ptr allocation when creating a node.

This differs enough from the previous approaches that I placed it in a new branch

The only problem I have with it now is that the node init penalty is ~25-30%, which makes me hesitant to do a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants