-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce overhead and optimise partial sorting algorithm #53
base: main
Are you sure you want to change the base?
Conversation
28abc6c
to
03948d8
Compare
I've identified a rather significant bug in my changes. I'll retest this with the fix applied. |
The original partial sort function was a combination of a QuickSelect pass followed by a QuickSort pass. This can lead to unnecessary comparisons and movement of data. This update changes the internal method used by `avx512_partial_qsort` so that it can more efficiently sort or recurse into each partition as required and without needing to take a second sweep over the array. This commit also tweaks the conditional checks in `avx512_qselect` to avoid a redundant comparison.
03948d8
to
bc73d86
Compare
I had a problem with the right recursion in the first version of this PR that I missed because all of the tests passed without complaint. The seeds are fixed as 42 in the functions in Anyway, this is working now so happy to get some feedback. x86-simd-sort/utils/rand_array.h Line 21 in 85f4e9c
x86-simd-sort/utils/rand_array.h Line 38 in 85f4e9c
|
Ah. Sorry, I didn't run a full test on GitHub because I never pushed to my main branch. I see Google's changed their repository a little bit so the workflow cannot run. I've proposed a simple workaround on #54 and I can rebase these once this is resolved. |
thanks for your PR, allow me some time to review it. will get back to you as soon as I can :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Edited: looks like I pasted avx512_qselect
instead of avx512_partialsort
]
I am not seeing any improvements on my SKX or my TGL.
on SKX:
Comparing avx512_partial_qsort (from ./builddir-main/benchexe) to avx512_partial_qsort (from ./builddir-partialsort/benchexe)
Benchmark Time CPU Time Old Time New CPU Old CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------------------
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/10 +0.0036 +0.0030 5114 5132 5116 5131
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/100 +0.0163 +0.0161 5285 5371 5287 5372
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/1000 -0.0349 -0.0348 8575 8276 8577 8279
[avx512_partial_qsort vs. avx512_partial_qsort]<float>/5000 +0.0370 +0.0369 23481 24349 23480 24347
on TGL:
Comparing avx512_partial_qsort<float> (from ./builddir-main/benchexe) to avx512_partial_qsort<float> (from ./builddir-partialsort/benchexe)
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/10 -0.0014 -0.0016 4468 4462 4454 4446
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/100 -0.0052 -0.0018 4703 4679 4671 4663
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/1000 -0.0663 -0.0662 8002 7471 7983 7454
[avx512_partial_qsort<float> vs. avx512_partial_qsort<float>]/5000 +0.0205 +0.0214 22357 22815 22325 22803
OVERALL_GEOMEAN -0.0136 -0.0126 0 0 0 0
else if ((pivot != biggest) && (pos >= pivot_index)) | ||
qselect_16bit_<vtype>(arr, pos, pivot_index, right, max_iters - 1); | ||
|
||
if (pos < pivot_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to keep the if conditions the way they are, just more readable I think.
} | ||
|
||
template <typename vtype, typename type_t> | ||
static void qselsort_16bit_(type_t *arr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we need a better name. May be qpartialsort_*
?
Your benchmark also reports unto a 14% improvement on |
Thanks for taking a look, @r-devulap. I'll rebase, address your comments, and investigate the performance when I have a chance over the next week or so. |
@r-devulap |
No worries. You can keep it open if you wish, I don't mind either way. |
The original implementation of the partial sorting algorithm was very basic: an initial$k$ ) is already sorted and what partitions had already been made during the first sweep.
avx512_qselect
pass followed byavx512_qsort
. This works, but unfortunately it can cause unnecessary comparisons and movement of data because the second sweep (usingqsort
) does not know what region (that surroundsThis PR changes the way
avx512_partial_qsort
works so that it can specialise its behaviour while recursing. Compared with main, there was an almost 6% improvement in the partial sorting related methods.Benchmarks were performed in the cloud on a VM with 8 vCPUs of an Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz.
Main vs PR Comparison
Originally there were additional commits that made changes that are now irrelevant. Originally, this PR made large changes to the operation of
avx512_qsort
, too, but it now focusses solely on specialisingavx512_partial_qsort
instead. There was a logical mistake that slipped through the test cases and had artificially improved the sorting performance.