Reduce overhead and optimise partial sorting algorithm #53

mosullivan93 · 2023-06-26T23:22:37Z

The original implementation of the partial sorting algorithm was very basic: an initial avx512_qselect pass followed by avx512_qsort. This works, but unfortunately it can cause unnecessary comparisons and movement of data because the second sweep (using qsort) does not know what region (that surrounds $k$) is already sorted and what partitions had already been made during the first sweep.

This PR changes the way avx512_partial_qsort works so that it can specialise its behaviour while recursing. Compared with main, there was an almost 6% improvement in the partial sorting related methods.

Benchmarks were performed in the cloud on a VM with 8 vCPUs of an Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz.

Main vs PR Comparison

mosullivan@sprvm-1:~/x86-simd-sort/builddir$ $COMPARE benchmarks main.json pr.json
Comparing main.json to pr.json
Benchmark                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------
avx512_partial_qsort<float>/10                     -0.1554         -0.1548          4145          3501          4124          3486
avx512_partial_qsort<float>/100                    -0.1357         -0.1376          4359          3767          4339          3741
avx512_partial_qsort<float>/1000                   -0.0847         -0.0855          7363          6739          7344          6717
avx512_partial_qsort<float>/5000                   +0.0324         +0.0326         20576         21243         20558         21227
avx512_partial_qsort<uint32_t>/10                  -0.1730         -0.1745          3468          2868          3445          2844
avx512_partial_qsort<uint32_t>/100                 -0.1932         -0.1947          3553          2867          3529          2842
avx512_partial_qsort<uint32_t>/1000                -0.1125         -0.1130          5999          5324          5976          5301
avx512_partial_qsort<uint32_t>/5000                -0.0625         -0.0623         18569         17408         18552         17396
avx512_partial_qsort<int32_t>/10                   -0.1699         -0.1714          3412          2833          3391          2810
avx512_partial_qsort<int32_t>/100                  -0.1993         -0.2003          3542          2836          3518          2814
avx512_partial_qsort<int32_t>/1000                 -0.1192         -0.1193          5964          5253          5938          5230
avx512_partial_qsort<int32_t>/5000                 -0.0700         -0.0696         18375         17089         18356         17078
avx512_partial_qsort<double>/10                    -0.0267         -0.0268          6043          5882          6049          5887
avx512_partial_qsort<double>/100                   -0.0167         -0.0167          6179          6076          6184          6081
avx512_partial_qsort<double>/1000                  -0.0998         -0.0995         10208          9189         10211          9195
avx512_partial_qsort<double>/5000                  -0.0849         -0.0848         31744         29050         31752         29058
avx512_partial_qsort<uint64_t>/10                  -0.0160         -0.0158          6769          6661          6775          6667
avx512_partial_qsort<uint64_t>/100                 +0.0104         +0.0106          7101          7175          7106          7182
avx512_partial_qsort<uint64_t>/1000                -0.0634         -0.0634         15914         14904         15918         14908
avx512_partial_qsort<uint64_t>/5000                -0.0400         -0.0400         42757         41046         42766         41056
avx512_partial_qsort<int64_t>/10                   -0.0153         -0.0143          6774          6670          6773          6676
avx512_partial_qsort<int64_t>/100                  -0.0821         -0.0820          7734          7099          7741          7107
avx512_partial_qsort<int64_t>/1000                 -0.0559         -0.0558         15862         14975         15868         14982
avx512_partial_qsort<int64_t>/5000                 -0.0250         -0.0251         42110         41056         42119         41063
avx512_partial_qsort<uint16_t>/10                  +0.0132         +0.0136          3347          3391          3347          3393
avx512_partial_qsort<uint16_t>/100                 +0.0209         +0.0217          3447          3519          3446          3520
avx512_partial_qsort<uint16_t>/1000                -0.0723         -0.0713          5767          5350          5762          5351
avx512_partial_qsort<uint16_t>/5000                -0.0061         -0.0058         16529         16428         16525         16430
avx512_partial_qsort<int16_t>/10                   -0.0316         -0.0307          3425          3317          3425          3320
avx512_partial_qsort<int16_t>/100                  -0.0215         -0.0220          3527          3451          3527          3449
avx512_partial_qsort<int16_t>/1000                 -0.1035         -0.1033          5852          5247          5852          5248
avx512_partial_qsort<int16_t>/5000                 -0.0261         -0.0261         16579         16146         16581         16148
avx512_qselect<float>/10                           -0.1456         -0.1477          4108          3510          4085          3481
avx512_qselect<float>/100                          -0.1388         -0.1403          4200          3618          4178          3592
avx512_qselect<float>/1000                         -0.1409         -0.1422          4267          3666          4242          3638
avx512_qselect<float>/5000                         +0.0052         +0.0058          4135          4157          4109          4133
avx512_qselect<uint32_t>/10                        -0.1477         -0.1499          3344          2850          3323          2825
avx512_qselect<uint32_t>/100                       -0.1600         -0.1618          3402          2858          3380          2833
avx512_qselect<uint32_t>/1000                      -0.1605         -0.1621          3689          3097          3668          3073
avx512_qselect<uint32_t>/5000                      -0.0456         -0.0441          3888          3710          3860          3690
avx512_qselect<int32_t>/10                         -0.1767         -0.1779          3435          2828          3413          2806
avx512_qselect<int32_t>/100                        -0.1718         -0.1733          3418          2831          3397          2808
avx512_qselect<int32_t>/1000                       -0.1721         -0.1730          3693          3057          3672          3036
avx512_qselect<int32_t>/5000                       -0.0339         -0.0324          3889          3757          3861          3736
avx512_qselect<double>/10                          -0.0103         -0.0093          5871          5811          5873          5818
avx512_qselect<double>/100                         +0.0071         +0.0074          5845          5887          5849          5892
avx512_qselect<double>/1000                        -0.0016         -0.0006          5588          5579          5584          5581
avx512_qselect<double>/5000                        -0.0366         -0.0369          6839          6588          6845          6592
avx512_qselect<uint64_t>/10                        -0.0013         -0.0010          6728          6720          6731          6725
avx512_qselect<uint64_t>/100                       +0.0038         +0.0046          6690          6716          6692          6723
avx512_qselect<uint64_t>/1000                      +0.0121         +0.0141          8970          9078          8958          9084
avx512_qselect<uint64_t>/5000                      +0.0025         +0.0031          8752          8774          8755          8782
avx512_qselect<int64_t>/10                         -0.0010         -0.0001          6730          6723          6729          6728
avx512_qselect<int64_t>/100                        -0.0023         -0.0020          6750          6735          6753          6740
avx512_qselect<int64_t>/1000                       -0.0096         -0.0096          9073          8987          9077          8991
avx512_qselect<int64_t>/5000                       -0.0282         -0.0286          8752          8505          8756          8506
avx512_qselect<uint16_t>/10                        -0.0116         -0.0101          3304          3266          3300          3267
avx512_qselect<uint16_t>/100                       -0.0105         -0.0098          3353          3318          3349          3316
avx512_qselect<uint16_t>/1000                      -0.0096         -0.0080          3376          3344          3372          3345
avx512_qselect<uint16_t>/5000                      -0.0022         -0.0013          3651          3643          3650          3645
avx512_qselect<int16_t>/10                         -0.0155         -0.0152          3356          3304          3356          3305
avx512_qselect<int16_t>/100                        -0.0146         -0.0135          3393          3344          3392          3346
avx512_qselect<int16_t>/1000                       -0.0154         -0.0153          3433          3381          3431          3379
avx512_qselect<int16_t>/5000                       -0.0144         -0.0138          3691          3638          3689          3638
avx512_qselect<_Float16>/10                        -0.0361         -0.0360          4298          4143          4299          4144
avx512_qselect<_Float16>/100                       -0.0258         -0.0253          3754          3657          3753          3658
avx512_qselect<_Float16>/1000                      -0.0352         -0.0313          3557          3432          3555          3444
avx512_qselect<_Float16>/5000                      -0.0258         -0.0235          3463          3374          3456          3375
avx512_partial_qsort<_Float16>/10                  -0.0020         +0.0004          3575          3568          3568          3569
avx512_partial_qsort<_Float16>/100                 +0.0025         +0.0030          4105          4115          4104          4117
avx512_partial_qsort<_Float16>/1000                -0.1008         -0.1006          7399          6653          7399          6655
avx512_partial_qsort<_Float16>/5000                -0.0573         -0.0572         21202         19986         21198         19987
OVERALL_GEOMEAN                                    -0.0593         -0.0593             0             0             0             0

Originally there were additional commits that made changes that are now irrelevant. Originally, this PR made large changes to the operation of avx512_qsort, too, but it now focusses solely on specialising avx512_partial_qsort instead. There was a logical mistake that slipped through the test cases and had artificially improved the sorting performance.

mosullivan93 · 2023-06-28T08:14:15Z

I've identified a rather significant bug in my changes. I'll retest this with the fix applied.
I'm a little concerned that the issue wasn't picked up with the built in tests (at least when I ran locally). I'll see what I can learn.

The original partial sort function was a combination of a QuickSelect pass followed by a QuickSort pass. This can lead to unnecessary comparisons and movement of data. This update changes the internal method used by `avx512_partial_qsort` so that it can more efficiently sort or recurse into each partition as required and without needing to take a second sweep over the array. This commit also tweaks the conditional checks in `avx512_qselect` to avoid a redundant comparison.

mosullivan93 · 2023-06-28T16:39:48Z

I had a problem with the right recursion in the first version of this PR that I missed because all of the tests passed without complaint. The seeds are fixed as 42 in the functions in utils/rand_array.h (I'm guessing this is for consistency in the benchmarking) so I didn't originally pick up on the issue because k was always 4 in the partial range test and the recursion was always into the left partition until the bitonic sort would take over. (This also led me to believe my code was a lot faster than it really was 😬).

Anyway, this is working now so happy to get some feedback.

x86-simd-sort/utils/rand_array.h

Line 21 in 85f4e9c

e1.seed(42);

x86-simd-sort/utils/rand_array.h

Line 38 in 85f4e9c

gen.seed(42);

mosullivan93 · 2023-06-29T04:46:40Z

Ah. Sorry, I didn't run a full test on GitHub because I never pushed to my main branch. I see Google's changed their repository a little bit so the workflow cannot run. I've proposed a simple workaround on #54 and I can rebase these once this is resolved.

r-devulap · 2023-07-04T04:20:17Z

thanks for your PR, allow me some time to review it. will get back to you as soon as I can :)

r-devulap

[Edited: looks like I pasted avx512_qselect instead of avx512_partialsort]

I am not seeing any improvements on my SKX or my TGL.

on SKX:

Comparing avx512_partial_qsort (from ./builddir-main/benchexe) to avx512_partial_qsort (from ./builddir-partialsort/benchexe)
Benchmark                                                                        Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------------------
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/10                     +0.0036         +0.0030          5114          5132          5116          5131
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/100                    +0.0163         +0.0161          5285          5371          5287          5372
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/1000                   -0.0349         -0.0348          8575          8276          8577          8279
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/5000                   +0.0370         +0.0369         23481         24349         23480         24347

on TGL:

Comparing avx512_partial_qsort<float> (from ./builddir-main/benchexe) to avx512_partial_qsort<float> (from ./builddir-partialsort/benchexe)
Benchmark                                                                            Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/10                  -0.0014         -0.0016          4468          4462          4454          4446
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/100                 -0.0052         -0.0018          4703          4679          4671          4663
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/1000                -0.0663         -0.0662          8002          7471          7983          7454
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/5000                +0.0205         +0.0214         22357         22815         22325         22803
OVERALL_GEOMEAN                                                                   -0.0136         -0.0126             0             0             0             0

r-devulap · 2023-08-03T20:07:04Z

src/avx512-16bit-common.h

- else if ((pivot != biggest) && (pos >= pivot_index))
- qselect_16bit_<vtype>(arr, pos, pivot_index, right, max_iters - 1);
+
+ if (pos < pivot_index) {


Would prefer to keep the if conditions the way they are, just more readable I think.

r-devulap · 2023-08-03T20:08:33Z

src/avx512-16bit-common.h

+}
+
+template <typename vtype, typename type_t>
+static void qselsort_16bit_(type_t *arr,


Think we need a better name. May be qpartialsort_*?

r-devulap · 2023-08-03T20:27:08Z

Your benchmark also reports unto a 14% improvement on avx512_qselect which is a bit surprising since there should be no changes to the way quick select runs with this patch. I ran quickselect benchmarks comparison multiple times on my SKX and don't see any change to it which is what I expect.

mosullivan93 · 2023-08-07T07:53:03Z

Thanks for taking a look, @r-devulap.

I'll rebase, address your comments, and investigate the performance when I have a chance over the next week or so.

mosullivan93 · 2023-09-14T03:47:48Z

@r-devulap
I apologize for the delay in revisiting this. Given the updates to the library since this PR was authored, I expect quite a few adjustments will be required during rebasing. I do intend to handle it eventually. If it's better for repository management, let me know if I should close the PR for the time being.

r-devulap · 2023-09-18T04:30:41Z

No worries. You can keep it open if you wish, I don't mind either way.

mosullivan93 force-pushed the singlepasspartial branch from 28abc6c to 03948d8 Compare June 27, 2023 14:15

mosullivan93 marked this pull request as draft June 28, 2023 05:53

mosullivan93 force-pushed the singlepasspartial branch from 03948d8 to bc73d86 Compare June 28, 2023 16:14

mosullivan93 marked this pull request as ready for review June 28, 2023 16:39

r-devulap reviewed Aug 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce overhead and optimise partial sorting algorithm #53

Reduce overhead and optimise partial sorting algorithm #53

mosullivan93 commented Jun 26, 2023 •

edited

Loading

mosullivan93 commented Jun 28, 2023

mosullivan93 commented Jun 28, 2023

mosullivan93 commented Jun 29, 2023

r-devulap commented Jul 4, 2023 •

edited

Loading

r-devulap left a comment •

edited

Loading

r-devulap Aug 3, 2023

r-devulap Aug 3, 2023

r-devulap commented Aug 3, 2023 •

edited

Loading

mosullivan93 commented Aug 7, 2023

mosullivan93 commented Sep 14, 2023

r-devulap commented Sep 18, 2023

Reduce overhead and optimise partial sorting algorithm #53

Are you sure you want to change the base?

Reduce overhead and optimise partial sorting algorithm #53

Conversation

mosullivan93 commented Jun 26, 2023 • edited Loading

mosullivan93 commented Jun 28, 2023

mosullivan93 commented Jun 28, 2023

mosullivan93 commented Jun 29, 2023

r-devulap commented Jul 4, 2023 • edited Loading

r-devulap left a comment • edited Loading

Choose a reason for hiding this comment

r-devulap Aug 3, 2023

Choose a reason for hiding this comment

r-devulap Aug 3, 2023

Choose a reason for hiding this comment

r-devulap commented Aug 3, 2023 • edited Loading

mosullivan93 commented Aug 7, 2023

mosullivan93 commented Sep 14, 2023

r-devulap commented Sep 18, 2023

mosullivan93 commented Jun 26, 2023 •

edited

Loading

r-devulap commented Jul 4, 2023 •

edited

Loading

r-devulap left a comment •

edited

Loading

r-devulap commented Aug 3, 2023 •

edited

Loading