Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow when merging. #93

Closed
sageserpent-open opened this issue Oct 7, 2024 · 15 comments
Closed

Stack overflow when merging. #93

sageserpent-open opened this issue Oct 7, 2024 · 15 comments
Labels
bug Something isn't working

Comments

@sageserpent-open
Copy link
Owner

sageserpent-open commented Oct 7, 2024

Seen as part of working on #91.

Reproduced in the same manner, only disable the failing invariant check discussed on that ticket. (Also seen doing WIP on fixing the failing assertion).

This is a known issue because the merge algorithm isn't tail-recursive; so far this hasn't been an issue as the use of sections derived mostly from matches keeps the size of the merge input sequences down. It seems the day has finally arrived to cutover to something stack-safe....

@sageserpent-open sageserpent-open added the bug Something isn't working label Oct 7, 2024
sageserpent-open added a commit that referenced this issue Oct 7, 2024
…g invariant seen in #91 so that execution proceeds to merging. This then exhibits the bug targeted by #93.
@sageserpent-open
Copy link
Owner Author

It's not just stack usage to worry about - the spikes at the end of running Kinetic Merge as evidence for #91 show an alarming increase in heap and CPU usage when finding the longest common subsequence as part of merging:

Screenshot 2024-10-18 at 08 42 33

Screenshot 2024-10-18 at 08 45 40

This might be from the enclosing merge algorithm, though...

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 20, 2024

The story so far...

There are two different approaches been taken, currently leading to head commits of 7850425 and 8f7683f.

The first approach changes LongestCommonSubsequence.of to use an exhaustive dynamic programming approach, avoiding recursive calls to gain stack safety. This works (and has been tried out with the original example of running the Kinetic Merge application), but still requires 10G of heap to avoid an out-of-memory fault.

There is a rather crude hack that tries to dynamically purge the cache of partial results that is still used by the dynamic programming implementation, it is instructive to compare the size of the cache at the end of computing the top-level result with the theoretical upper bound on the number of entries in the cache without purging...

(NOTE: using dynamic programming results in a performance increase compared with the original implementation, but the purging technique is crude and drastically reduces performance to well below the original's.)

From running LongestCommonSubsequenceTest.theLongestCommonSubsequenceUnderpinsAllThreeResults:

(4001,85140)
(408,5508)
(324,1620)
(966,12992)
(1250,20000)
(2296,44772)
(1288,18676)
(2552,52316)
(1800,40500)
(1892,44462)
(1144,25168)
(2158,36504)
(297,1200)
(2813,42398)
(1807,21840)
(544,5984)
(4675,63168)
(3276,75348)
(801,8800)
(715,7150)
(200,3900)
(1508,22620)
(1920,32640)
(986,15283)

From running Kinetic Merge:

(117,288)
(8,8)
(57,144)
(36936,2123250)
(1303,14283)
(57,144)
(8,8)
(8,8)
(8,8)
(8,8)
(43,64)
(63,125)
(43,64)
(267,1331)
(213111,33775416)
(1422,14112)
(8,8)
(454,2940)
(87,216)
(929,8800)
(13782,558092)
(8,8)

So there is definitely some trimming going on, although perhaps it is only towards the end that it really starts to make a difference? Those entries:

(36936,2123250)
(1303,14283)

(213111,33775416)

(13782,558092)

are cause for concern. I had no idea that so many sections could be produced...

@sageserpent-open
Copy link
Owner Author

OK, so the second approach goes back to what the original implementation did, doing a straightforward recursive decomposition (that is unfortunately not tail recursive) and using memoisation to imply the same benefits as using dynamic programming, in that any given subproblem isn't computed twice.

It turned out that the original implementation was only somewhat successful at memoisation, this has been cutover and the results are good. Let's revisit LongestCommonSubsequenceTest.theLongestCommonSubsequenceUnderpinsAllThreeResults, this time the first figure reflects the number of subproblem evaluations performed:


(85139,85140)
(5481,5508)
(1612,1620)
(12990,12992)
(19998,20000)
(6894,44772)
(16606,18676)
(52285,52316)
(40497,40500)
(44456,44462)
(25165,25168)
(5513,36504)
(1193,1200)
(16976,42398)
(21793,21840)
(1343,5984)
(63165,63168)
(22388,75348)
(446,8800)
(7149,7150)
(3862,3900)
(22617,22620)
(32637,32640)
(15282,15283)
(15707,15708)
(21915,21945)
(27986,28014)
(98788,98838)
(32719,47040)
(19962,19992)
(80353,80401)

What's going on here is that the subproblems are evaluated top-down by recursion, and only if they are actually needed - the algorithm chooses a fork in whether it should drop a common element and thus perform just one subproblem, or needs to try dropping differing elements on each side, doing multiple subproblems. This can avoid having to calculate every single subproblem, whereas the dynamic programming approach works bottom-up and thus doesn't have the calling context to avoid exploring subproblems.

Of course, the second approach is still recursive, and while it can avoid subproblems, it doesn't purge the cache.

@sageserpent-open
Copy link
Owner Author

Checking with the first approach, the cache is being dynamically purged throughout the execution of the dynamic programming algorithm. So that's not to blame for the heap explosion - although it may be that the partial results (which are built out of Vector instances for each side of the result) aren't efficiently sharing structure, thus exploding the heap...

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 20, 2024

Would using a difference list in the first approach (or just a plain list followed by a reversal) allow better structure sharing between partial results? If so, that might get us out of the heap explosion problem.

(This is probably also going to affect work done on the other branch too).

Another thing - the existing dynamic purging caches across index variations for the left and right sides, keeping the base index constant. In the interest of minimising the cache side, shouldn't we optimise the combination of sides? Not sure how much difference that would make in real life, though...

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 20, 2024

Meanwhile, back with the second approach, an idea is to stick with top-down memoisation, but to use continuations (in the form of Eval) to avoid stack overflow when executing subproblems. There is an extra twist in that the cached partial results are unevaluated lazy values, so the subproblem DAG can be evaluated in a breadth-first rather than depth-first fashion.

Hopefully this will allow the cache to be purged dynamically as the algorithm then works in horizontal swathes, so we can detect when all three sides have lost interest in cached partial results at a certain distance from the current 'wavefront'.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 21, 2024

Good news on the first approach, commit: 9f7cb5d...

From running LongestCommonSubsequenceTest.theLongestCommonSubsequenceUnderpinsAllThreeResults:

(3870,85140)
(408,5508)
(324,1620)
(896,12992)
(1250,20000)
(2296,44772)
(1288,18676)
(2552,52316)
(1800,40500)
(1892,44462)
(1144,25168)
(2028,36504)
(240,1200)
(2494,42398)
(1680,21840)
(544,5984)
(3948,63168)
(3276,75348)
(800,8800)
(650,7150)
(200,3900)
(1508,22620)
(1920,32640)
(986,15283)
(1848,15708)
(2310,21945)
(1932,28014)
(3468,98838)
(2940,47040)
(1428,19992)
(4346,80401)
(2310,48510)
(1152,18432)
(1134,9639)
(3956,85054)
(384,4416)
(1302,27342)
(2516,22644)
(2070,18630)
(2080,36400)
(462,8547)
(1456,24752)
(2100,32550)
(2418,47151)
(4784,100464)
(1664,23296)
(3864,63756)
(322,5796)
(1040,9880)
(2128,39368)
(3404,73186)
(112,1288)
(372,1302)
(2100,45150)
(2376,29700)
(1320,27720)
(4472,105092)
(1280,5760)
(696,12528)
(1472,16192)
(2204,20938)

From running Kinetic Merge:

(96,288)
(8,8)
(48,144)
(33972,2123250)
(1242,14283)
(48,144)
(8,8)
(8,8)
(8,8)
(8,8)
(32,64)
(50,125)
(32,64)
(242,1331)
(212424,33775416)
(1344,14112)
(8,8)
(420,2940)
(72,216)
(880,8800)
(13612,558092)
(8,8)

From Visual VM:
Screenshot 2024-10-21 at 12 54 53

So it's within 3G of heap during the merge / longest common subsequence when running Kinetic Merge.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 21, 2024

Furthermore, there was no need to optimise structure sharing between the partial solutions - just trimming the cache was enough.

Commit 7850425 uses a map to implement the cache, whereas 9f7cb5d uses an array-backed deque, so possibly that also helps things?

So what would happen if this was applied to the other approach?

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 21, 2024

NOTE: prior to closing this ticket, it might be worth revisiting the breakdown of filler sections that was fudged around in #43. That had to adopt a compromise where fillers are broken down into a prefix, a potential common part and a suffix, but perhaps we could be bolder and use one-token sections now. Hopefully the latest incarnation won't suffer from the merge alignment problems that were also present...

@sageserpent-open
Copy link
Owner Author

Back on the second approach, 69a8c70 brings us use of Eval layered in StateT to avoid stack overflow when doing a top-down memoised recursive breakdown of the subproblems.

Performance is lousy - LongestCommonSubsequenceTest suffers an eightfold slowdown compared with 38891f6, which is more or less what we started with, albeit with some housekeeping done on the old code.

This code uses a fussy pure-FP approach to caching, using an immutable map implementation. What happens if we:

  1. Go back to using an imperative cache implementation?
  2. Try to drop cached entries using a strategy similar to what was done for the first approach?

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 23, 2024

Commit 9cbe260 gives us back the imperative cache, and performance is improved, albeit not to the original level.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 23, 2024

Running times for LongestCommonSubsequenceTest.

Baseline, 62dd55e: 20 seconds.

First approach...

Dynamic programming, a9d3f7f: 20 seconds.

Dynamic programing with aggressive purging of the cache, 9f7cb5d: 23 seconds.

Same but without lazy values, fa04ea4: 21 seconds.

Second approach...

Centralize caching, 8f7683f: 15 seconds.

More or less the same but without lazy values, 38891f6: 14 seconds.

Use an immutable cache implementation with StateT over Id, 1f77026: 58 seconds!

Same but with Eval instead of Id to gain stack-safety, 69a8c70: 115 seconds!

Go back to using an imperative cache, 9cbe260: 39 seconds.

@sageserpent-open
Copy link
Owner Author

sageserpent-open commented Oct 23, 2024

It's fair to say that the second approach pulled ahead early but as soon as an immutable cache plus monadic workflow was adopted, performance went to the wall. It might have been interesting to see how StateT over Id with an imperative cache would have fared, but ultimately we need Eval.

Even putting the imperative cache back into the mix couldn't quite get back to earlier performance levels.

Now, there is still the business of trying to purge the cache that is outstanding on the second approach's branch. To do this when working top-down requires the subproblems to be evaluated breadth-first in reverse order. For that to work using a continuation-style means that all of the nodes in the call tree have to encoded as continuations - so that is all the subproblems evaluated top-down.

That means that even if the cache is incrementally purged, we still have to use heap space proportional to the number of evaluated subproblems - and that number can be high sometimes, even when taking the possible savings of the top-down approach into account.

That also ignores the added complexity of implementing a breadth-first evaluation strategy - unlike depth-first, it doesn't naturally emerge from the monad translation of top-down calls.

So it is with regret that I have to say, "Second approach, you're fired".

@sageserpent-open
Copy link
Owner Author

Merged on to main in: 4ced27f.

@sageserpent-open
Copy link
Owner Author

Went out in Release 1.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant