-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match merging has performance issues with submission sets of larger programs #1430
Comments
I profiled match merging for a large dataset (>400 submissions, ~1500 loc each, with basecode on): My current theory: Due to the nature of the match merging, smaller matches below the MTM are generated during greedy string tiling. This is required, as neighbors with lengths below MTM can be merged into eligible matches again. However, this effectively forces the greedy string tiling to execute to a subsequence length of 2 (default neighbor length is 2), producing exponentially more subsequences, and thus the matching is slowed. Thus, most of the overhead is algorithmic. For my dataset, enabling match merging comes with a ~4000% runtime overhead. Most samples come from Besides that issue, I see two minor performance issues in the
Hotifx: Setting |
One idea to work around this problem would be to set match merging as activated by default but increase the neighbor length dynamically depending on the average tokens per submission. We would need measurements for different datasets for different neighbor lengths. to see when the slow-down factor gets excessive. Something like 2x to 4x is okay, but 40x is not. Then, we would have a CLI parameter for match merging: Off, auto, manual. |
When submissions are > 1000 LoC, the performance degrades. I have not yet investigated whether this is CPU load or memory/paging issues, or a general algorithmic flaw.
The text was updated successfully, but these errors were encountered: