Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce overhead and optimise partial sorting algorithm #53

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mosullivan93
Copy link
Contributor

@mosullivan93 mosullivan93 commented Jun 26, 2023

The original implementation of the partial sorting algorithm was very basic: an initial avx512_qselect pass followed by avx512_qsort. This works, but unfortunately it can cause unnecessary comparisons and movement of data because the second sweep (using qsort) does not know what region (that surrounds $k$) is already sorted and what partitions had already been made during the first sweep.

This PR changes the way avx512_partial_qsort works so that it can specialise its behaviour while recursing. Compared with main, there was an almost 6% improvement in the partial sorting related methods.


Benchmarks were performed in the cloud on a VM with 8 vCPUs of an Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz.

Main vs PR Comparison
mosullivan@sprvm-1:~/x86-simd-sort/builddir$ $COMPARE benchmarks main.json pr.json
Comparing main.json to pr.json
Benchmark                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------
avx512_partial_qsort<float>/10                     -0.1554         -0.1548          4145          3501          4124          3486
avx512_partial_qsort<float>/100                    -0.1357         -0.1376          4359          3767          4339          3741
avx512_partial_qsort<float>/1000                   -0.0847         -0.0855          7363          6739          7344          6717
avx512_partial_qsort<float>/5000                   +0.0324         +0.0326         20576         21243         20558         21227
avx512_partial_qsort<uint32_t>/10                  -0.1730         -0.1745          3468          2868          3445          2844
avx512_partial_qsort<uint32_t>/100                 -0.1932         -0.1947          3553          2867          3529          2842
avx512_partial_qsort<uint32_t>/1000                -0.1125         -0.1130          5999          5324          5976          5301
avx512_partial_qsort<uint32_t>/5000                -0.0625         -0.0623         18569         17408         18552         17396
avx512_partial_qsort<int32_t>/10                   -0.1699         -0.1714          3412          2833          3391          2810
avx512_partial_qsort<int32_t>/100                  -0.1993         -0.2003          3542          2836          3518          2814
avx512_partial_qsort<int32_t>/1000                 -0.1192         -0.1193          5964          5253          5938          5230
avx512_partial_qsort<int32_t>/5000                 -0.0700         -0.0696         18375         17089         18356         17078
avx512_partial_qsort<double>/10                    -0.0267         -0.0268          6043          5882          6049          5887
avx512_partial_qsort<double>/100                   -0.0167         -0.0167          6179          6076          6184          6081
avx512_partial_qsort<double>/1000                  -0.0998         -0.0995         10208          9189         10211          9195
avx512_partial_qsort<double>/5000                  -0.0849         -0.0848         31744         29050         31752         29058
avx512_partial_qsort<uint64_t>/10                  -0.0160         -0.0158          6769          6661          6775          6667
avx512_partial_qsort<uint64_t>/100                 +0.0104         +0.0106          7101          7175          7106          7182
avx512_partial_qsort<uint64_t>/1000                -0.0634         -0.0634         15914         14904         15918         14908
avx512_partial_qsort<uint64_t>/5000                -0.0400         -0.0400         42757         41046         42766         41056
avx512_partial_qsort<int64_t>/10                   -0.0153         -0.0143          6774          6670          6773          6676
avx512_partial_qsort<int64_t>/100                  -0.0821         -0.0820          7734          7099          7741          7107
avx512_partial_qsort<int64_t>/1000                 -0.0559         -0.0558         15862         14975         15868         14982
avx512_partial_qsort<int64_t>/5000                 -0.0250         -0.0251         42110         41056         42119         41063
avx512_partial_qsort<uint16_t>/10                  +0.0132         +0.0136          3347          3391          3347          3393
avx512_partial_qsort<uint16_t>/100                 +0.0209         +0.0217          3447          3519          3446          3520
avx512_partial_qsort<uint16_t>/1000                -0.0723         -0.0713          5767          5350          5762          5351
avx512_partial_qsort<uint16_t>/5000                -0.0061         -0.0058         16529         16428         16525         16430
avx512_partial_qsort<int16_t>/10                   -0.0316         -0.0307          3425          3317          3425          3320
avx512_partial_qsort<int16_t>/100                  -0.0215         -0.0220          3527          3451          3527          3449
avx512_partial_qsort<int16_t>/1000                 -0.1035         -0.1033          5852          5247          5852          5248
avx512_partial_qsort<int16_t>/5000                 -0.0261         -0.0261         16579         16146         16581         16148
avx512_qselect<float>/10                           -0.1456         -0.1477          4108          3510          4085          3481
avx512_qselect<float>/100                          -0.1388         -0.1403          4200          3618          4178          3592
avx512_qselect<float>/1000                         -0.1409         -0.1422          4267          3666          4242          3638
avx512_qselect<float>/5000                         +0.0052         +0.0058          4135          4157          4109          4133
avx512_qselect<uint32_t>/10                        -0.1477         -0.1499          3344          2850          3323          2825
avx512_qselect<uint32_t>/100                       -0.1600         -0.1618          3402          2858          3380          2833
avx512_qselect<uint32_t>/1000                      -0.1605         -0.1621          3689          3097          3668          3073
avx512_qselect<uint32_t>/5000                      -0.0456         -0.0441          3888          3710          3860          3690
avx512_qselect<int32_t>/10                         -0.1767         -0.1779          3435          2828          3413          2806
avx512_qselect<int32_t>/100                        -0.1718         -0.1733          3418          2831          3397          2808
avx512_qselect<int32_t>/1000                       -0.1721         -0.1730          3693          3057          3672          3036
avx512_qselect<int32_t>/5000                       -0.0339         -0.0324          3889          3757          3861          3736
avx512_qselect<double>/10                          -0.0103         -0.0093          5871          5811          5873          5818
avx512_qselect<double>/100                         +0.0071         +0.0074          5845          5887          5849          5892
avx512_qselect<double>/1000                        -0.0016         -0.0006          5588          5579          5584          5581
avx512_qselect<double>/5000                        -0.0366         -0.0369          6839          6588          6845          6592
avx512_qselect<uint64_t>/10                        -0.0013         -0.0010          6728          6720          6731          6725
avx512_qselect<uint64_t>/100                       +0.0038         +0.0046          6690          6716          6692          6723
avx512_qselect<uint64_t>/1000                      +0.0121         +0.0141          8970          9078          8958          9084
avx512_qselect<uint64_t>/5000                      +0.0025         +0.0031          8752          8774          8755          8782
avx512_qselect<int64_t>/10                         -0.0010         -0.0001          6730          6723          6729          6728
avx512_qselect<int64_t>/100                        -0.0023         -0.0020          6750          6735          6753          6740
avx512_qselect<int64_t>/1000                       -0.0096         -0.0096          9073          8987          9077          8991
avx512_qselect<int64_t>/5000                       -0.0282         -0.0286          8752          8505          8756          8506
avx512_qselect<uint16_t>/10                        -0.0116         -0.0101          3304          3266          3300          3267
avx512_qselect<uint16_t>/100                       -0.0105         -0.0098          3353          3318          3349          3316
avx512_qselect<uint16_t>/1000                      -0.0096         -0.0080          3376          3344          3372          3345
avx512_qselect<uint16_t>/5000                      -0.0022         -0.0013          3651          3643          3650          3645
avx512_qselect<int16_t>/10                         -0.0155         -0.0152          3356          3304          3356          3305
avx512_qselect<int16_t>/100                        -0.0146         -0.0135          3393          3344          3392          3346
avx512_qselect<int16_t>/1000                       -0.0154         -0.0153          3433          3381          3431          3379
avx512_qselect<int16_t>/5000                       -0.0144         -0.0138          3691          3638          3689          3638
avx512_qselect<_Float16>/10                        -0.0361         -0.0360          4298          4143          4299          4144
avx512_qselect<_Float16>/100                       -0.0258         -0.0253          3754          3657          3753          3658
avx512_qselect<_Float16>/1000                      -0.0352         -0.0313          3557          3432          3555          3444
avx512_qselect<_Float16>/5000                      -0.0258         -0.0235          3463          3374          3456          3375
avx512_partial_qsort<_Float16>/10                  -0.0020         +0.0004          3575          3568          3568          3569
avx512_partial_qsort<_Float16>/100                 +0.0025         +0.0030          4105          4115          4104          4117
avx512_partial_qsort<_Float16>/1000                -0.1008         -0.1006          7399          6653          7399          6655
avx512_partial_qsort<_Float16>/5000                -0.0573         -0.0572         21202         19986         21198         19987
OVERALL_GEOMEAN                                    -0.0593         -0.0593             0             0             0             0

Originally there were additional commits that made changes that are now irrelevant. Originally, this PR made large changes to the operation of avx512_qsort, too, but it now focusses solely on specialising avx512_partial_qsort instead. There was a logical mistake that slipped through the test cases and had artificially improved the sorting performance.

@mosullivan93 mosullivan93 marked this pull request as draft June 28, 2023 05:53
@mosullivan93
Copy link
Contributor Author

I've identified a rather significant bug in my changes. I'll retest this with the fix applied.
I'm a little concerned that the issue wasn't picked up with the built in tests (at least when I ran locally). I'll see what I can learn.

The original partial sort function was a combination of a QuickSelect
pass followed by a QuickSort pass. This can lead to unnecessary
comparisons and movement of data. This update changes the internal
method used by `avx512_partial_qsort` so that it can more efficiently
sort or recurse into each partition as required and without needing to
take a second sweep over the array. This commit also tweaks the
conditional checks in `avx512_qselect` to avoid a redundant comparison.
@mosullivan93
Copy link
Contributor Author

I had a problem with the right recursion in the first version of this PR that I missed because all of the tests passed without complaint. The seeds are fixed as 42 in the functions in utils/rand_array.h (I'm guessing this is for consistency in the benchmarking) so I didn't originally pick up on the issue because k was always 4 in the partial range test and the recursion was always into the left partition until the bitonic sort would take over. (This also led me to believe my code was a lot faster than it really was 😬).

Anyway, this is working now so happy to get some feedback.


e1.seed(42);

gen.seed(42);

@mosullivan93 mosullivan93 marked this pull request as ready for review June 28, 2023 16:39
@mosullivan93
Copy link
Contributor Author

Ah. Sorry, I didn't run a full test on GitHub because I never pushed to my main branch. I see Google's changed their repository a little bit so the workflow cannot run. I've proposed a simple workaround on #54 and I can rebase these once this is resolved.

@r-devulap
Copy link
Contributor

r-devulap commented Jul 4, 2023

thanks for your PR, allow me some time to review it. will get back to you as soon as I can :)

Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Edited: looks like I pasted avx512_qselect instead of avx512_partialsort]

I am not seeing any improvements on my SKX or my TGL.

on SKX:

Comparing avx512_partial_qsort (from ./builddir-main/benchexe) to avx512_partial_qsort (from ./builddir-partialsort/benchexe)
Benchmark                                                                        Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------------------
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/10                     +0.0036         +0.0030          5114          5132          5116          5131
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/100                    +0.0163         +0.0161          5285          5371          5287          5372
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/1000                   -0.0349         -0.0348          8575          8276          8577          8279
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/5000                   +0.0370         +0.0369         23481         24349         23480         24347

on TGL:

Comparing avx512_partial_qsort<float> (from ./builddir-main/benchexe) to avx512_partial_qsort<float> (from ./builddir-partialsort/benchexe)
Benchmark                                                                            Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/10                  -0.0014         -0.0016          4468          4462          4454          4446
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/100                 -0.0052         -0.0018          4703          4679          4671          4663
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/1000                -0.0663         -0.0662          8002          7471          7983          7454
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/5000                +0.0205         +0.0214         22357         22815         22325         22803
OVERALL_GEOMEAN                                                                   -0.0136         -0.0126             0             0             0             0

else if ((pivot != biggest) && (pos >= pivot_index))
qselect_16bit_<vtype>(arr, pos, pivot_index, right, max_iters - 1);

if (pos < pivot_index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to keep the if conditions the way they are, just more readable I think.

}

template <typename vtype, typename type_t>
static void qselsort_16bit_(type_t *arr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we need a better name. May be qpartialsort_*?

@r-devulap
Copy link
Contributor

r-devulap commented Aug 3, 2023

Your benchmark also reports unto a 14% improvement on avx512_qselect which is a bit surprising since there should be no changes to the way quick select runs with this patch. I ran quickselect benchmarks comparison multiple times on my SKX and don't see any change to it which is what I expect.

@mosullivan93
Copy link
Contributor Author

Thanks for taking a look, @r-devulap.

I'll rebase, address your comments, and investigate the performance when I have a chance over the next week or so.

@mosullivan93
Copy link
Contributor Author

@r-devulap
I apologize for the delay in revisiting this. Given the updates to the library since this PR was authored, I expect quite a few adjustments will be required during rebasing. I do intend to handle it eventually. If it's better for repository management, let me know if I should close the PR for the time being.

@r-devulap
Copy link
Contributor

No worries. You can keep it open if you wish, I don't mind either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants