-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461 #477
Conversation
I wouldn't go to the trouble of exploiting this when kernel support is less than the SIMD vector length. Overall I haven't benchmarked this component much, since it is almost always completely dominated by the application of the kernel to the uniform data (kernel generation is |
Yes, it mostly matters in 1D. I mostly test the speed in 1D, without the symmetry trick, the original code on amd genoa in my test total time for spread 1d of width 15 is around 2s, with method 2 improves to 1.8, with method 1 improves to 1.6. It's seems to be a 10% difference, that's why I'm interested in understanding the aligned store and wondering if @mreineck you see the 10% difference with aligned store on amd genoa cpu. I don't see this difference on my old intel xeon cpu. It seems that unaligned or aligned store only differs in the speed when the unaligned store results in using two cache lines? The measure is using single core with single thread. I like |
Sorry, I only have access to slightly older hardware here and have never tested the code on AVX512, which is what you have on Genoa, as far as I can see. I fear that doing really exhaustive tests (Intel/AMD, AVX, AVX2, AVX512, single-thread/parallel, etc.) will not show a clear winner in all cases, and having many separate code paths for different architectures is bad for maintainability, so I personally try not to spend too much time on this kind of details. |
Yes, that's why I want to choose only one of the two methods merging to the master. |
src/spreadinterp.cpp
Outdated
|
||
// process simd vecs | ||
for (uint8_t i = 0; i < n_eval; i += simd_size) { | ||
auto k_odd = if_odd_degree ? simd_type::load_aligned(padded_coeffs[0].data() + i) : zerov; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a constexpr lambda.
@lu1and10 Libin, are we ready to merge this? |
I'm doing a rerun of your performance script since master branch changes, do you want to double check and run your Julia performance script making sure everything is OK? |
@ahbarnett following is the new bench with julia bench script after the master branch merging interp and new compiler flags. Compared to the plots above, for avx2, the spreader speed in master branch and this PR is no long linear in for avx512, it seems to be less impacted by the compiler flags. |
src/spreadinterp.cpp
Outdated
simd_type k_odd, k_even, k_prev, k_sym = zerov; | ||
for (uint8_t i = 0, offset = w - tail; i < (w + 1) / 2; | ||
i += simd_size, offset -= simd_size) { | ||
k_odd = if_odd_degree ? simd_type::load_aligned(padded_coeffs[0].data() + i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be:
k_odd = [i]() constexpr noexept{
if constexpr(if_odd_degree){
return simd_type::load_aligned(padded_coeffs[0].data() + i);
} else return simd_type{0};
}();
src/spreadinterp.cpp
Outdated
const auto cji_odd = simd_type::load_aligned(padded_coeffs[j].data() + i); | ||
k_odd = xsimd::fma(k_odd, z2v, cji_odd); | ||
const auto cji_even = | ||
simd_type::load_aligned(padded_coeffs[j + 1].data() + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is [j+1] safe here? The loop is up to nc, should not be up to (nc-1) or (nc&~1)
src/spreadinterp.cpp
Outdated
if constexpr (use_ker_sym) { | ||
static constexpr uint8_t tail = w % simd_size; | ||
static constexpr uint8_t if_odd_degree = ((nc + 1) % 2); | ||
static const simd_type zerov(0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my experiments retuning {0} is often faster than saving it as a static const
src/spreadinterp.cpp
Outdated
static constexpr auto reverse_batch = | ||
xsimd::make_batch_constant<xsimd::as_unsigned_integer_t<FLT>, arch_t, | ||
reverse_index<simd_size>>(); | ||
|
||
// process simd vecs | ||
for (uint8_t i = 0, offset = w - simd_size; i < w / 2; | ||
i += simd_size, offset -= simd_size) { | ||
auto k_odd = if_odd_degree | ||
? simd_type::load_aligned(padded_coeffs[0].data() + i) | ||
: zerov; | ||
auto k_even = simd_type::load_aligned(padded_coeffs[if_odd_degree].data() + i); | ||
for (uint8_t j = 1 + if_odd_degree; j < nc; j += 2) { | ||
const auto cji_odd = simd_type::load_aligned(padded_coeffs[j].data() + i); | ||
k_odd = xsimd::fma(k_odd, z2v, cji_odd); | ||
const auto cji_even = | ||
simd_type::load_aligned(padded_coeffs[j + 1].data() + i); | ||
k_even = xsimd::fma(k_even, z2v, cji_even); | ||
} | ||
// left part | ||
xsimd::fma(k_odd, zv, k_even).store_aligned(ker + i); | ||
// right part symmetric to the left part | ||
if (offset >= w / 2) { | ||
// reverse the order for symmetric part | ||
xsimd::swizzle(xsimd::fma(k_odd, -zv, k_even), reverse_batch) | ||
.store_aligned(ker + offset); | ||
} | ||
} | ||
} | ||
} else { | ||
const simd_type zv(z); | ||
|
||
for (uint8_t i = 0; i < w; i += simd_size) { | ||
auto k = simd_type::load_aligned(padded_coeffs[0].data() + i); | ||
for (uint8_t j = 1; j < nc; ++j) { | ||
const auto cji = simd_type::load_aligned(padded_coeffs[j].data() + i); | ||
k = xsimd::fma(k, zv, cji); | ||
} | ||
k.store_aligned(ker + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is quite a lot of repetition I think you could do something like w/2 + (tail>0) to merge most of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give more of a hint what you're suggesting here? Or maybe Libin sees it...
@lu1and10 it would be great to incorporate Marco's review and have this merged for meeting tomorrow... |
I did the change I understand, there are some reviews I don't understand, especially the unsafe loop @DiamonDinoia mentioned(#477 (comment)), we should discuss it tomorrow, if it's unsafe then there will be a bug. |
great - maybe you can sort it out tomorrow am.
Thanks, Alex
…On Mon, Jul 15, 2024 at 9:11 PM Libin Lu ***@***.***> wrote:
@lu1and10 <https://github.com/lu1and10> it would be great to incorporate
Marco's review and have this merged for meeting tomorrow...
I did the change I understand, there are some reviews I don't understand,
especially the unsafe loop @DiamonDinoia <https://github.com/DiamonDinoia>
mentioned(#477 (comment)
<#477 (comment)>),
we should discuss it tomorrow, if it's unsafe then there will be a bug.
—
Reply to this email directly, view it on GitHub
<#477 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSSSCDSRFD2HVFRRJFTZMRXGPAVCNFSM6AAAAABKRDMZUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZG44DGMRVGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
oh no, @DiamonDinoia could you please keep the commit history when merging this PR to master, there are some commits I want to try later and keep in master(for example Martin's unaligned store version). In the PR #429, there seems more commits can be merged while not merged and now in master. I have been carefully with the commits and as less as possible so that the commits doesn't need to be squashed. |
Ok, @DiamonDinoia. Done in #492 #493 , no need to worry about this, thanks. |
Following @mreineck 's idea in discussions #461, using the symmetry trick with simd evaluation for width greater than simd vec length.
For width less or equal than simd vec length, one may need to batch non-uniform points and do polynomial evaluations related to PR #467, this PR does 1 x w for polynomial kernel eval, in PR #467, one may do 2 x w, 4 x w, ... etc. polynomial evaluations(2, 4 are the number of non-uniform points batched for polynomial evaluations)
Interleaving 2D, 3D polynomial evaluations for saving registers loading coefficients(discussed in #461) is not done in this PR.
This PR only tries out the symmetry trick with current master branch code.
There are two versions to test out:
eval_kernel_vec_Horner
utilizes the aligned store which needs some simd shuffles of two simd vecs before aligned store.eval_kernel_vec_Horner_unaligned_store
, this version comes from @mreineck code which does the simd vec calculations and then doing a loop for simd vec elements for storing.In my limited tests, in most cases version 1 and version 2 are of similar speed. While on the amd genoa cpu nodes, version 1 seems to be faster than version 2. @mreineck do you have any idea about the speed of using aligned store when you implement the symmetry trick?