Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer performance of APPROX_TOP_K function #51834

Open
murphyatwork opened this issue Oct 12, 2024 · 11 comments
Open

Optimizer performance of APPROX_TOP_K function #51834

murphyatwork opened this issue Oct 12, 2024 · 11 comments
Assignees
Labels
good first issue Good for newcomers type/enhancement Make an enhancement to StarRocks

Comments

@murphyatwork
Copy link
Contributor

Enhancement

It can be used to calculate top-k from a large dataset quickly, which is expected to be much faster than plain TOP-K.

But actually it's not, and even slower than the GROUP-BY. Calculate top-k from ssb.lineorder:

  • GROUP-BY: 1.413s
  • APPROX_TOP_K: 54.462s
MySQL root@127.1:ssb> select lo_orderkey, count(*) from lineorder group by lo_orderkey order by count(*) limit 5;
+-------------+----------+
| lo_orderkey | count(*) |
+-------------+----------+
| 30182       | 4        |
| 25441       | 4        |
| 13888       | 4        |
| 27172       | 4        |
| 1990        | 4        |
+-------------+----------+
5 rows in set
Time: 1.413s
MySQL root@127.1:ssb> select APPROX_TOP_K(lo_orderkey) from lineorder;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| approx_top_k(lo_orderkey)                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [{"item":5998592,"count":32568},{"item":5988930,"count":32568},{"item":5977440,"count":32568},{"item":5995557,"count":32564},{"item":5991074,"count":32564}] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
Time: 54.462s
@murphyatwork murphyatwork added type/enhancement Make an enhancement to StarRocks good first issue Good for newcomers labels Oct 12, 2024
@alphashiv27
Copy link

Hi @murphyatwork,
I'm new to this project and would love guidance on how to start. Could you assign me this issue to get going?
Thank you!

@alphashiv27
Copy link

Hi @murphyatwork,
The output of both the queries are different and count of each item is also different.
The table on which both queries ran are same. Yet the result is different.
If the APPROX_TOP_K is just slow, the results should match right?
Am I misunderstanding something here?

@alphashiv27
Copy link

Hi @murphyatwork ,
Your SQL query should be
select lo_orderkey, count(*) from lineorder group by lo_orderkey order by count(*) desc limit 5;

I would love to know the difference in the time taken if you execute this?
Can you please execute and share the results?

@murphyatwork
Copy link
Contributor Author

Hi @murphyatwork , Your SQL query should be select lo_orderkey, count(*) from lineorder group by lo_orderkey order by count(*) desc limit 5;

I would love to know the difference in the time taken if you execute this? Can you please execute and share the results?

hey dude, thanks for noticing this issue.

  1. The sql: you're correct, it should be order by count(*) desc. but anyway it doesn't affect performance
  2. Different result: yes, the result is totally different even if I correct the above sql. Considering that the APPROX_TOP_K is merely approximate, it's understandable that the result can be a little bit different, but such a result is out of my mind

Why ?

  1. The APPROX_TOP_K is using the space-saving algorithm
  2. The algorithm uses a frequency inheritance assumption, which would replace the least-frequent item with frequency+1. As a result, the estimated frequency can be much larger than the real value, if the dataset has similar frequency. For the ssb.lineorder.lo_orderkey, it's exactly the case
  3. The performance: the algorithm needs to find the least-frequent item in counters each time, currently it's accomplished with a sorted-array, so it needs to be re-sorted after each update. As I know that's why it's super slow

The idea to optimize the performance:

  1. We can use a heap-based sort to optimize the sort performance, no need to do a linear bubbling
  2. The accuracy: we can keep more counters than required to improve the accuracy, with a little bit space waste

@murphyatwork
Copy link
Contributor Author

And also, if this algorithm(space-saving) is fundamentally inaccurate, we can consider other algorithms. But before that we can focus on the performance optimization to make it practical.
If you want to do some contribution, i would suggest look at the file be/src/exprs/agg/approx_top_k.h

@alphashiv27
Copy link

Got it. Thanks for explaining.
I will have a look into the file.

@alphashiv27
Copy link

alphashiv27 commented Oct 14, 2024

Hi @murphyatwork I would like to proceed with the solution using the approach you suggested. Please find my atttached proposal and assign me the issue.

Proposal: Optimizing Approximate Top-K Algorithm (Space Saving)

Objective:

To improve the performance and accuracy of the approximate top-K algorithm by optimizing the current sorting mechanism and increasing the number of counters tracked. These changes aim to optimize the execution time and improve the approximation accuracy without altering the core behavior of the algorithm.

Changes:

Optimize Sorting with a Heap-Based Approach:

The current _maintain_ordering() function relies on a linear bubble sort to keep the counters sorted by frequency. I propose replacing this with a min-heap data structure, which will allow for more efficient insertion and deletion operations, reducing the time complexity from O(n) to O(log n) when maintaining the top-K counters.
This will improve the performance of updating and managing the counters during the process() method, especially when dealing with large datasets and frequent updates.

Increase Counter Size for Better Accuracy:

Currently, the algorithm uses approximately 2 * K counters, where K is the number of top elements to track. I propose increasing this number to 4 * K counters (or a configurable multiple of K) to improve the accuracy of the approximation. While this introduces a slight memory overhead, it will allow the algorithm to track more frequent elements, reducing the chances of incorrectly replacing important elements in memory-limited situations.

This adjustment will be made within the get_k_and_counter_num() function, keeping the allocation within reasonable limits while boosting accuracy.

Expected Impact:

Performance: The change from a linear sorting algorithm to a heap-based approach should significantly reduce the time complexity when updating counters, especially in high-frequency scenarios, without affecting the core logic.

Accuracy: Increasing the number of counters will allow the algorithm to handle more frequent items, thereby improving the precision of the top-K approximation with minimal memory overhead.

Compatibility: These optimisations will not alter the core functionality of the approximate top-K algorithm. The same input/output behaviour will be maintained, and the existing interface for the algorithm remains unchanged. All serialisation, merging, and update operations should continue to work as expected.

Next Steps:

  1. Implement the heap-based sorting within the _maintain_ordering() function.
  2. Modify the counter allocation logic to allow for a configurable and slightly larger number of counters (i.e., 4 * K counters).
  3. Ensure that these changes do not impact the correctness by running all relevant unit tests and adding additional tests for performance and edge cases if required.

Any feedback and suggestions are welcome before I proceed with the implementation!

@murphyatwork
Copy link
Contributor Author

Hi @murphyatwork I would like to proceed with the solution using the approach you suggested. Please find my atttached proposal and assign me the issue.

Proposal: Optimizing Approximate Top-K Algorithm (Space Saving)

Objective:

To improve the performance and accuracy of the approximate top-K algorithm by optimizing the current sorting mechanism and increasing the number of counters tracked. These changes aim to optimize the execution time and improve the approximation accuracy without altering the core behavior of the algorithm.

Changes:

Optimize Sorting with a Heap-Based Approach:

The current _maintain_ordering() function relies on a linear bubble sort to keep the counters sorted by frequency. I propose replacing this with a min-heap data structure, which will allow for more efficient insertion and deletion operations, reducing the time complexity from O(n) to O(log n) when maintaining the top-K counters. This will improve the performance of updating and managing the counters during the process() method, especially when dealing with large datasets and frequent updates.

Increase Counter Size for Better Accuracy:

Currently, the algorithm uses approximately 2 * K counters, where K is the number of top elements to track. I propose increasing this number to 4 * K counters (or a configurable multiple of K) to improve the accuracy of the approximation. While this introduces a slight memory overhead, it will allow the algorithm to track more frequent elements, reducing the chances of incorrectly replacing important elements in memory-limited situations.

This adjustment will be made within the get_k_and_counter_num() function, keeping the allocation within reasonable limits while boosting accuracy.

Expected Impact:

Performance: The change from a linear sorting algorithm to a heap-based approach should significantly reduce the time complexity when updating counters, especially in high-frequency scenarios, without affecting the core logic.

Accuracy: Increasing the number of counters will allow the algorithm to handle more frequent items, thereby improving the precision of the top-K approximation with minimal memory overhead.

Compatibility: These optimisations will not alter the core functionality of the approximate top-K algorithm. The same input/output behaviour will be maintained, and the existing interface for the algorithm remains unchanged. All serialisation, merging, and update operations should continue to work as expected.

Next Steps:

  1. Implement the heap-based sorting within the _maintain_ordering() function.
  2. Modify the counter allocation logic to allow for a configurable and slightly larger number of counters (i.e., 4 * K counters).
  3. Ensure that these changes do not impact the correctness by running all relevant unit tests and adding additional tests for performance and edge cases if required.

Any feedback and suggestions are welcome before I proceed with the implementation!

cool! can't wait for it

@alphashiv27
Copy link

Hi @murphyatwork
I started working on the proposed solution. However I noticed something -

The current implementation of the space-saving algorithm already employs localised bubbling (and not bubble sort) to maintain the ordering of counters efficiently. This means that when a counter's count is updated (typically incremented by 1), it swaps positions with its immediate neighbours only if necessary. Because the changes in counts are usually small, the counter moves only a few positions, if at all. This localised swapping ensures that the per-update operation is effectively O(1) in most cases.

Key Points:

- Localised Bubbling:

Minimises the number of swaps needed to maintain the sorted order of counters based on counts.
Swapping is limited to immediate neighbours, reducing overhead.
Efficiently handles frequent updates typical in data streams.

Per-Update Operation Complexity:

Average Case: O(1) time complexity due to minimal movement of counters.
Worst Case: O(k) time complexity (where k is the number of counters), but this might never happen by the construct of the algorithm

- Optimisation Status:

The algorithm is already optimised for both time and space complexity within the constraints of the space-saving algorithm.
Further optimisations at the algorithmic level (like replacing localised bubbling with a heap) are unlikely to yield significant performance gains.
Using a heap could introduce additional overhead due to the complexity of maintaining the heap property during frequent count updates.(this is inevitable due to count update nature of space saving)

If you think this is incorrect we can do a quick catchup and we can discuss on this further.
Would love to have a quick chat!

@murphyatwork
Copy link
Contributor Author

murphyatwork commented Oct 17, 2024

According to my previous profiling, I can still see the bottleneck is counter maintain_ordering

-   44.99%     1.72%  pip_exec_com     starrocks_be          [.] starrocks::AggregateFunctionBatchHelper<starrocks::ApproxTopKState<(starrocks::LogicalType)5>, starrocks::ApproxTopKAggregateFunction<(starrocks::LogicalType)5, int> >::upd▒
   - 43.26% starrocks::AggregateFunctionBatchHelper<starrocks::ApproxTopKState<(starrocks::LogicalType)5>, starrocks::ApproxTopKAggregateFunction<(starrocks::LogicalType)5, int> >::update_batch_single_state                               ▒
      - 35.39% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::_maintain_ordering                                                                                                                                                     ▒
           7.21% phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<int, starrocks::ApproxTopKState<(starrocks::LogicalType)5>::Counter*>, starrocks::PhmapDefaultHashFunc<(starrocks::LogicalType)5, (starrocks::PhmapSeed)0>, phmap::▒
      - 6.72% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::process<true>                                                                                                                                                           ▒
         + 1.61% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::_min_index                                                                                                                                                           ▒
        0.85% phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<int, starrocks::ApproxTopKState<(starrocks::LogicalType)5>::Counter*>, starrocks::PhmapDefaultHashFunc<(starrocks::LogicalType)5, (starrocks::PhmapSeed)0>, phmap::Equ▒
   + 1.72% start_thread                                                                                                                                                                                                                     

I think your analysis is mostly correct, the algorithm complexity is not a problem in the average case, but it can be in the worst case. That means if all items in the dataset have similar frequency, the counters need to be removed and added very frequently , that can be a bottleneck.

For instance:

  1. we have counters like [{a:1},{b:1},{c:1},{d:1},{e:1}]
  2. process a new item f, find the min counter and increase the counter, then it will be [,{b:1},{c:1},{d:1},{e:1},{f:2}]
    3. this process needs to iterate and swap all counters
  3. if a new item come in, it will repeat this process and put it in the back
  4. so basically in this case, processing each item needs to iterate all counters
  5. and also, in this case the frequency is totally incorrect, which is supposed to be 1, but as a result it can be NumItems

I have no quick idea about how to handle this case, looks like we need a new strategy for it.

@alphashiv27
Copy link

Hi @murphyatwork
Are we looking to optimize this by writing a new algorithm other than Space Saving or we are going to ignore considering the worst case is less frequent ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers type/enhancement Make an enhancement to StarRocks
Projects
None yet
Development

No branches or pull requests

2 participants