-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizer performance of APPROX_TOP_K function #51834
Comments
Hi @murphyatwork, |
Hi @murphyatwork, |
Hi @murphyatwork , I would love to know the difference in the time taken if you execute this? |
hey dude, thanks for noticing this issue.
Why ?
The idea to optimize the performance:
|
And also, if this algorithm(space-saving) is fundamentally inaccurate, we can consider other algorithms. But before that we can focus on the performance optimization to make it practical. |
Got it. Thanks for explaining. |
Hi @murphyatwork I would like to proceed with the solution using the approach you suggested. Please find my atttached proposal and assign me the issue. Proposal: Optimizing Approximate Top-K Algorithm (Space Saving)Objective:To improve the performance and accuracy of the approximate top-K algorithm by optimizing the current sorting mechanism and increasing the number of counters tracked. These changes aim to optimize the execution time and improve the approximation accuracy without altering the core behavior of the algorithm. Changes:Optimize Sorting with a Heap-Based Approach:The current _maintain_ordering() function relies on a linear bubble sort to keep the counters sorted by frequency. I propose replacing this with a min-heap data structure, which will allow for more efficient insertion and deletion operations, reducing the time complexity from O(n) to O(log n) when maintaining the top-K counters. Increase Counter Size for Better Accuracy:Currently, the algorithm uses approximately 2 * K counters, where K is the number of top elements to track. I propose increasing this number to 4 * K counters (or a configurable multiple of K) to improve the accuracy of the approximation. While this introduces a slight memory overhead, it will allow the algorithm to track more frequent elements, reducing the chances of incorrectly replacing important elements in memory-limited situations. This adjustment will be made within the get_k_and_counter_num() function, keeping the allocation within reasonable limits while boosting accuracy. Expected Impact:Performance: The change from a linear sorting algorithm to a heap-based approach should significantly reduce the time complexity when updating counters, especially in high-frequency scenarios, without affecting the core logic. Accuracy: Increasing the number of counters will allow the algorithm to handle more frequent items, thereby improving the precision of the top-K approximation with minimal memory overhead. Compatibility: These optimisations will not alter the core functionality of the approximate top-K algorithm. The same input/output behaviour will be maintained, and the existing interface for the algorithm remains unchanged. All serialisation, merging, and update operations should continue to work as expected. Next Steps:
Any feedback and suggestions are welcome before I proceed with the implementation! |
cool! can't wait for it |
Hi @murphyatwork The current implementation of the space-saving algorithm already employs localised bubbling (and not bubble sort) to maintain the ordering of counters efficiently. This means that when a counter's count is updated (typically incremented by 1), it swaps positions with its immediate neighbours only if necessary. Because the changes in counts are usually small, the counter moves only a few positions, if at all. This localised swapping ensures that the per-update operation is effectively O(1) in most cases. Key Points: - Localised Bubbling:Minimises the number of swaps needed to maintain the sorted order of counters based on counts. Per-Update Operation Complexity:Average Case: O(1) time complexity due to minimal movement of counters. - Optimisation Status:The algorithm is already optimised for both time and space complexity within the constraints of the space-saving algorithm. If you think this is incorrect we can do a quick catchup and we can discuss on this further. |
According to my previous profiling, I can still see the bottleneck is counter
I think your analysis is mostly correct, the algorithm complexity is not a problem in the average case, but it can be in the worst case. That means if all items in the dataset have similar frequency, the counters need to be removed and added very frequently , that can be a bottleneck. For instance:
I have no quick idea about how to handle this case, looks like we need a new strategy for it. |
Hi @murphyatwork |
Enhancement
It can be used to calculate top-k from a large dataset quickly, which is expected to be much faster than plain TOP-K.
But actually it's not, and even slower than the GROUP-BY. Calculate top-k from ssb.lineorder:
The text was updated successfully, but these errors were encountered: