-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Depth first node traversal can cause transform node having merge node as input to be re-evaluated several times with transient values #167
Comments
Hi @TheCoconutChef! Happy holdays and sorry I haven't replied about this eaerlier, it's been a busy month! Thanks a lot for finding and reporting this issue, and spending the time in finding a solution. It is indeed a nuanced issue! First, whether this is a genuine problem: it is somewhat a problem. I must admit I haven't really found it an issue myself. I have been aware of the double-evaluation issue, but never considered it an invariant-breaking issue. In transformations I do not normally rely on invariants like that, and since they have no effects and the values "converge" to the right result, it feels "ok". But it is true that it feels like an issue... it could be a performance issue if a transform is complex, or if it relies on invariants. Ideally, we would not have this issue. The problem I see with your suggested implementation is that building a multi-map during propagation feels like a performance regression. Normally propagation does not even need to allocate memory... It would be interesting to bench-mark some non-trivial networks with this change. I guess, things could be tweaked by using a better data-structure. A What do you think? |
I'd be very happy to investigate this now that I have more confidence that there is an interest. I think it would be interesting for me to have some benchmark harness so that I could:
Indeed, if DFS based traversal has never been a problem for someone, and if topological traversal incurs an unavoidable performance cost, why would this be imposed on the user? The issue is mostly about merge node anyway and some model may not rely on them very much. I will report back with what I find. (And happy holidays.) |
Thank you a lot for the investigation. Having some benchmarking tools included in the repository is in itself a very valuable contribution, that can help inform future decissions on the best implementation propagation alternatives, or apply further optimizations. That code has never been really much optimized, but also it never caused a great deal of issues, even though there are improvements like #132 that we will want to add at some point. If you want examples, I use the Nonius library for benchmarking in Immer, some examples here |
Giving a pulse on this.
The tl;dr is that the T-SUMM strategy appears to be faster than any other strategy in both the simple chain and the diamond chain cases. This is a bit surprising since I expected DFS to be optimal for the simple chain case. I don't know what the cause of this is. It's arguably suspect. The second thing to note is that the DC case was explicitly made to break the DFS strategy and it does. Because of the way the DFS traversal works, the diamond chain with its chaining of split and merge guarantees that the last node of the n-th link in the chain will be visited 2^n times. This is admittedly a very degenerate case, and I used some weirdness to ensure that the values would always change on the Sample output of nonius benchmarks:
DFS-DC stops being practical around N=20. Admittedly these differences are rather marginal in absolute terms tho they can be substantial in relative terms. And with that, i have some questions.
|
Super intersting. Before I dig deeper: what do the numbers in the two columns mean? |
Very impressive how thorough you've been with this @TheCoconutChef. Also the results are looking very promising, with regards to solving the issues without inducing any penalty (actually potentially making propagation faster than ever!) Just before green lighting this and moving to polishing the T-SUMM, I'd love to review the benchmark code and try replicate the results locally, in order to verify if there are any blind spots. Where can I access the code and what would be the replication instructions? Thank you so much again for all the effort you are putting into this, this is high quality work! |
Thank you very much. I think the lager code base is pretty amazing and I just don't want to degrade the overall quality. I think the effort is worth it. As to your question, the work is here in the rank-based-node-traversal branch, including the wip commits. The benchmark is in There's also a little script in a new The benchmark also executes through the Cheers! |
Sadly there were a couple of issues with the benchmark setup. I have pushed a couple of commits that address these issues here Additionally, I have fixed traversal based on intrusive containers. Once these issues are addressed, the results match what we could consider intuitively. You can check some results here. In order to generate such plot where N increases exponentially, I have passed What we see is that DC is actually the fastest in all cases. Boost Intrusive containers follow and then other approaches. I wonder whether there could be a hybrid method that we could use, or dynamically switch between approaches when we detect diamond-like shapes for subtrees... |
Let's go for another pulse.
Taken together, what I get from this is that the possibility of configuring the traversal policy is probably back on the menu, whether it be tag based or more dynamic. HOWEVER, before talking more about this, I do believe the modification to the The formula: std::get<0>(tuple).val + std::get<1>(tuple).val) / 2 + 1 will produce the same value for a tuple I understand why the definition of If I change the const auto& [a, b] = tuple;
return a != b ? unique_value{std::max(a.val, b.val) + 2}
: unique_value{a.val + 1}; then I do get the intuitive result for the DFS-DC case, i.e. and this is actually the intuitive result to me since the DC case was designed such that on a sufficiently "unstable" value propagation dynamic, the n-th node would be hit 2^n times in the traversal. Now, hybrid methods. I suppose the problem here is that I'm basically gunning for a topological traversal. Topological traversal is basically only necessary for the merge nodes, i.e. those arising out of a I think it would be interesting to experiment with this, i.e. only merged node can be scheduled. Everyone else is as before, without any scheduling or call to a structure, be it intrusive. I'll try to do something on the weekend. |
The code.
Ok so I did check out what could be done with scheduling only merged node last week and expanded on that this weekend and I think it's mostly good news. First: Scheduling only merged node and proceeding in DFS otherwise means that pretty much all traversal strategies end up reducing to the DFS baseline for the SC benchmarks, as would be expected. The DC case has already been improved compared to the first version. Only scheduling merged node probably helps, but on top of that the intrusive traversal classes were modified to take advantage of some of boost::intrusive methods such as I'd be curious to see what it looks like on your end. You may have noticed "TREAP" in the previous graphs. Basically, we can use an intrusive treap in order to visit our scheduled nodes through a priority queue. The implementation ends up being rather simple, doesn't require any bucket information based on the tree size, and it has performed "the best" so far under my admittedly limited case set. HOWEVER I'm still trying to challenge the treap traversal. It's susceptible to perform poorly if it has to schedule merge node in a certain way. The god tree benchmarks was introduced to try to break the treap traversal, and as a result I ended up visiting a node's children in reverse order. The branch is here and I'd say there's a fair bit of information contained in the commit messages. I think I'm (we're?) at the point of honing on one or two strategies somehow. This might require more sophistitcated topologies or more sophisticated DAG update dynamic on my part. Edit I added a random DAG based benchmark. The node update function isn't "canonical" (it's not pure) but I still think the benchmark is valid conceptually. |
Hi @TheCoconutChef ! Sorry for the very long time since my last comment... I wanted to book some time to carefully study your solutions but I have failed to do so. Before I dig deeper, one question: the benchmarks do show the cumulative effects of all the commits or each single commit in isolation? (I think the latter?) Either way, skimming through the commits and benchmarks results, it seems to me that we are almost ready to have a workable solution here. The simple linear case works as fast as it used to. And even for arbitrary DAGs it looks like you've managed to work down to very low overheads. Do you want to integrate your favourite approaches (considering both performance and simplicity of the implementation) and make a PR that we can already polish to merge? This would be probably TREAP + bypass of single-parented nodes? |
Long time since the last comment? Why, I myself would never do that! Hmm, is it November already? Anyway... First, your question: the benchmarks included all the changes I had made, but I'm not sure that's relevant anymore. More below. Second, the treap didn't actually produce the best results. This became much more evident once I started running benchmark on a randomly generated DAG. In retrospect, the ordered multi set produced the best result out of all the topo traversals I tried. Second, the reason I took a such a long time to answer is that I wasn't happy with the performance I was getting out of the topo traversal on a random dag. The problem was that node scheduling in a treap or a multiset required log(size) type complexity operations on insertion and on rank visit. That basically has to do with the fact that every insertion is sorted in those scenarios. But that's too much sorting. Nodes having the same rank don't need to be sorted relative to one another. The order in which they're visited doesn't matter at all. If we use an explicit node_schedule object for a specific rank, we can register the nodes that need to be visited within a rank in an intrusive list. Using this strategy, we have: And the same graph zoomed in at the 0.1-0.5 "entropy" range: These are results for a DAG having 1024 nodes, 50% are nodes having two parents and 50% having 1, and the X-axis represents the % chance a node will change value within the context of a graph traversal and thus propagate different values to its children. The evaluated strategies are DFS (Blue), boost intrusive multi set (green) and node schedule (teal). The price we pay in order to have this new strategy is that we need to instantiate one node_schedule object per rank, and have every node of the same rank point to the node schedule that correspond to their rank. This implies one more shared_ptr allocation when creating a node. This differs enough from the previous approaches that I placed it in a new branch The only problem I have with it now is that the node init penalty is ~25-30%, which makes me hesitant to do a PR. |
Disclaimer: If it turns out that I missed something about the framework and the problem I believe I have identified here isn't actually legitimate, then I apologize.
The problem as I see it
Consider the following situation:
which can be examined here.
Under the current node traversal policy -- i.e. depth first -- the assert contained in the transform lambda will fail as the transform node created by the call to
map
will attempt to recompute its value on the basis of asend_down
call triggered by theauto ca
cursor node used inlager::with(ca, cb)
.This is because the
send_down
on a merge node having two input will first propagate a new value once it has detected that theca
node has changed, and then will do so a second time once it has realized thatcb
has changed as well.Even if the transform function did not have a failing assert, the
ca
send_down
call would still trigger the computation of a value -- using11
and0
as inputs -- that would not be of interest to anyone. Indeed, the only value anyone is ultimately going to be interested in is going to be the one in which11
and21
are used as arguments. Any intermediate values other than these will ultimately be discarded. The only thing propagation of such values may cause is further computation of intermediate values downstream from the merge node that will also ultimately have to be discarded. (That is, if they do not themselves see their preconditions be violated in a fatal way because they managed to see a transient state.)Thus, the current node traversal policy might force the user to be overly cautious when using transform node downstream of merge nodes. He cannot rely on the fact that he mutated the state in a coherent manner within his transform logic.
If a merge node sees several of its input change values within the same transaction, it will trigger a series of computation downstream from itself for transient values which will have to be discarded, or may cause the witnessing a some transient state too unsightly for some transform node to handle, i.e. they may throw or fail on assert.
A proof of concept solution
If the depth first traversal is the problem, then the solution would be to traverse the nodes some other way. This is what I've attempted to do as a proof-of-concept in the following branch.
It's a variation of breadth-first traversal (edit: I've just realized this is topological traversal.) that leverage the fact that we're operating in a DAG. The basic strategy consists in:
reader_node_base
a rank method. The rank is: (a) 0 if a node has no parent (b) the max rank of the parents + 1 for any other node.(2) is accomplished through the use of a traversal object injected into a modified
send_down
method taking it as an argument. Instead of directly calling thesend_down
of their children, nodesschedule
them for visitation at some later point determined by their rank.We can contrast the traversal sequence of the depth first strategy with the rank based sequence using the example presented above. We use the following naming convention for the various nodes:
Under depth first traversal and changing the
model_t::first
andmodel_t::second
values within the same transaction, we have the traversal:but under the rank based traversal we have:
i.e. less call to
recompute
andsend_down
and less witnessing of transient values. We can anticipate that the gains get bigger as more nodes are appended to the tree as downstream fromM
.Of course, the tradeoff is that we now have to manipulate a kind of traversal object to schedule the nodes of higher rank whose input nodes have changed, but I believe the benefit may be worthwhile.
From PoC to the real thing?
I wanted to correctly present the problem as I saw it and the sketch of a solution for the sake of completeness. However, before going further in this direction, I wanted to know if:
I have some anticipation that the answer to both of these questions is yes, but I wanted to make sure before starting to bother anybody with a PR and fine tuning it. (For instance, I might want to look at the performance impact of a new traversal policy.)
Cheers!
The text was updated successfully, but these errors were encountered: