Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461 #477

Merged
merged 20 commits into from
Jul 17, 2024

Conversation

lu1and10
Copy link
Member

@lu1and10 lu1and10 commented Jul 8, 2024

Following @mreineck 's idea in discussions #461, using the symmetry trick with simd evaluation for width greater than simd vec length.

For width less or equal than simd vec length, one may need to batch non-uniform points and do polynomial evaluations related to PR #467, this PR does 1 x w for polynomial kernel eval, in PR #467, one may do 2 x w, 4 x w, ... etc. polynomial evaluations(2, 4 are the number of non-uniform points batched for polynomial evaluations)
Interleaving 2D, 3D polynomial evaluations for saving registers loading coefficients(discussed in #461) is not done in this PR.

This PR only tries out the symmetry trick with current master branch code.
There are two versions to test out:

  1. eval_kernel_vec_Horner utilizes the aligned store which needs some simd shuffles of two simd vecs before aligned store.
  2. eval_kernel_vec_Horner_unaligned_store, this version comes from @mreineck code which does the simd vec calculations and then doing a loop for simd vec elements for storing.

In my limited tests, in most cases version 1 and version 2 are of similar speed. While on the amd genoa cpu nodes, version 1 seems to be faster than version 2. @mreineck do you have any idea about the speed of using aligned store when you implement the symmetry trick?

@mreineck
Copy link
Collaborator

mreineck commented Jul 8, 2024

I wouldn't go to the trouble of exploiting this when kernel support is less than the SIMD vector length.

Overall I haven't benchmarked this component much, since it is almost always completely dominated by the application of the kernel to the uniform data (kernel generation is O(support*ndim), spreading/interpolation is O(support**ndim)). If it is noticeable at all, it will be for 1D transforms. The only real reason I implemented this is that it reduces the size of the Horner coefficient array,.

@lu1and10
Copy link
Member Author

lu1and10 commented Jul 8, 2024

I wouldn't go to the trouble of exploiting this when kernel support is less than the SIMD vector length.

Overall I haven't benchmarked this component much, since it is almost always completely dominated by the application of the kernel to the uniform data (kernel generation is O(support*ndim), spreading/interpolation is O(support**ndim)). If it is noticeable at all, it will be for 1D transforms. The only real reason I implemented this is that it reduces the size of the Horner coefficient array,.

Yes, it mostly matters in 1D. I mostly test the speed in 1D, without the symmetry trick, the original code on amd genoa in my test total time for spread 1d of width 15 is around 2s, with method 2 improves to 1.8, with method 1 improves to 1.6. It's seems to be a 10% difference, that's why I'm interested in understanding the aligned store and wondering if @mreineck you see the 10% difference with aligned store on amd genoa cpu. I don't see this difference on my old intel xeon cpu. It seems that unaligned or aligned store only differs in the speed when the unaligned store results in using two cache lines? The measure is using single core with single thread. I like eval_kernel_vec_Horner_unaligned_store more, it's neat, but I'm a bit bothered with the 10% diff in 1D on amd genoa cpu.

@mreineck
Copy link
Collaborator

mreineck commented Jul 8, 2024

Sorry, I only have access to slightly older hardware here and have never tested the code on AVX512, which is what you have on Genoa, as far as I can see.

I fear that doing really exhaustive tests (Intel/AMD, AVX, AVX2, AVX512, single-thread/parallel, etc.) will not show a clear winner in all cases, and having many separate code paths for different architectures is bad for maintainability, so I personally try not to spend too much time on this kind of details.

@lu1and10
Copy link
Member Author

lu1and10 commented Jul 8, 2024

and having many separate code paths for different architectures is bad for maintainability

Yes, that's why I want to choose only one of the two methods merging to the master.

@lu1and10
Copy link
Member Author

lu1and10 commented Jul 9, 2024

geona avx512 gcc11 M1e7_N1e6:
genoa-gcc11_nthr1_spread_M1 0e7_N1 0e6
genoa-gcc11_nthr1_interp_M1 0e7_N1 0e6

rome avx2 gcc11 M1e7_N1e6:
rome-gcc11_nthr1_spread_M1 0e7_N1 0e6
rome-gcc11_nthr1_interp_M1 0e7_N1 0e6

@DiamonDinoia DiamonDinoia self-requested a review July 9, 2024 19:48

// process simd vecs
for (uint8_t i = 0; i < n_eval; i += simd_size) {
auto k_odd = if_odd_degree ? simd_type::load_aligned(padded_coeffs[0].data() + i) : zerov;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a constexpr lambda.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Jul 11, 2024

@lu1and10 Libin, are we ready to merge this?

@lu1and10
Copy link
Member Author

@lu1and10 Libin, are we ready to merge this?

I'm doing a rerun of your performance script since master branch changes, do you want to double check and run your Julia performance script making sure everything is OK?

@lu1and10
Copy link
Member Author

lu1and10 commented Jul 11, 2024

@ahbarnett following is the new bench with julia bench script after the master branch merging interp and new compiler flags.

Compared to the plots above,

for avx2, the spreader speed in master branch and this PR is no long linear in w, possibly because the flags(master branch spreader code does not change). Using this PR, there is slow down on w==8 for rome(avx2, meaning with new flags kernel symmetry does not help with avx2(4 doubles simd size) instructions for w==8 ), it seems hurt by the new flags, without the flags, w==8 is not slower than the master. Interp 1D seems more speed up for large w with kernel symmetry after the master merging interp PR.

for avx512, it seems to be less impacted by the compiler flags.

geona avx512 gcc11 M1e7_N1e6:
genoa-gcc11_nthr1_spread_M1 0e7_N1 0e6
genoa-gcc11_nthr1_interp_M1 0e7_N1 0e6

rome avx2 gcc11 M1e7_N1e6:
rome-gcc11_nthr1_spread_M1 0e7_N1 0e6
rome-gcc11_nthr1_interp_M1 0e7_N1 0e6

simd_type k_odd, k_even, k_prev, k_sym = zerov;
for (uint8_t i = 0, offset = w - tail; i < (w + 1) / 2;
i += simd_size, offset -= simd_size) {
k_odd = if_odd_degree ? simd_type::load_aligned(padded_coeffs[0].data() + i)
Copy link
Collaborator

@DiamonDinoia DiamonDinoia Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be:

k_odd = [i]() constexpr noexept{
	if constexpr(if_odd_degree){
		return simd_type::load_aligned(padded_coeffs[0].data() + i);
	} else return simd_type{0};
}();

const auto cji_odd = simd_type::load_aligned(padded_coeffs[j].data() + i);
k_odd = xsimd::fma(k_odd, z2v, cji_odd);
const auto cji_even =
simd_type::load_aligned(padded_coeffs[j + 1].data() + i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is [j+1] safe here? The loop is up to nc, should not be up to (nc-1) or (nc&~1)

src/spreadinterp.cpp Outdated Show resolved Hide resolved
if constexpr (use_ker_sym) {
static constexpr uint8_t tail = w % simd_size;
static constexpr uint8_t if_odd_degree = ((nc + 1) % 2);
static const simd_type zerov(0.0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my experiments retuning {0} is often faster than saving it as a static const

Comment on lines 834 to 871
static constexpr auto reverse_batch =
xsimd::make_batch_constant<xsimd::as_unsigned_integer_t<FLT>, arch_t,
reverse_index<simd_size>>();

// process simd vecs
for (uint8_t i = 0, offset = w - simd_size; i < w / 2;
i += simd_size, offset -= simd_size) {
auto k_odd = if_odd_degree
? simd_type::load_aligned(padded_coeffs[0].data() + i)
: zerov;
auto k_even = simd_type::load_aligned(padded_coeffs[if_odd_degree].data() + i);
for (uint8_t j = 1 + if_odd_degree; j < nc; j += 2) {
const auto cji_odd = simd_type::load_aligned(padded_coeffs[j].data() + i);
k_odd = xsimd::fma(k_odd, z2v, cji_odd);
const auto cji_even =
simd_type::load_aligned(padded_coeffs[j + 1].data() + i);
k_even = xsimd::fma(k_even, z2v, cji_even);
}
// left part
xsimd::fma(k_odd, zv, k_even).store_aligned(ker + i);
// right part symmetric to the left part
if (offset >= w / 2) {
// reverse the order for symmetric part
xsimd::swizzle(xsimd::fma(k_odd, -zv, k_even), reverse_batch)
.store_aligned(ker + offset);
}
}
}
} else {
const simd_type zv(z);

for (uint8_t i = 0; i < w; i += simd_size) {
auto k = simd_type::load_aligned(padded_coeffs[0].data() + i);
for (uint8_t j = 1; j < nc; ++j) {
const auto cji = simd_type::load_aligned(padded_coeffs[j].data() + i);
k = xsimd::fma(k, zv, cji);
}
k.store_aligned(ker + i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is quite a lot of repetition I think you could do something like w/2 + (tail>0) to merge most of the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give more of a hint what you're suggesting here? Or maybe Libin sees it...

@ahbarnett
Copy link
Collaborator

@lu1and10 it would be great to incorporate Marco's review and have this merged for meeting tomorrow...

@lu1and10
Copy link
Member Author

@lu1and10 it would be great to incorporate Marco's review and have this merged for meeting tomorrow...

I did the change I understand, there are some reviews I don't understand, especially the unsafe loop @DiamonDinoia mentioned(#477 (comment)), we should discuss it tomorrow, if it's unsafe then there will be a bug.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Jul 16, 2024 via email

@DiamonDinoia DiamonDinoia mentioned this pull request Jul 17, 2024
8 tasks
@DiamonDinoia DiamonDinoia added this to the 3.0 milestone Jul 17, 2024
@DiamonDinoia DiamonDinoia merged commit 4ea0096 into flatironinstitute:master Jul 17, 2024
34 checks passed
@lu1and10
Copy link
Member Author

lu1and10 commented Jul 18, 2024

oh no, @DiamonDinoia could you please keep the commit history when merging this PR to master, there are some commits I want to try later and keep in master(for example Martin's unaligned store version). In the PR #429, there seems more commits can be merged while not merged and now in master. I have been carefully with the commits and as less as possible so that the commits doesn't need to be squashed.

@lu1and10
Copy link
Member Author

oh no, @DiamonDinoia could you please keep the commit history when merging this PR to master, there are some commits I want to try later and keep in master(for example Martin's unaligned store version). In the PR #429, there seems more commits can be merged while not merged and now in master. I have been carefully with the commits and as less as possible so that the commits doesn't need to be squashed.

Ok, @DiamonDinoia. Done in #492 #493 , no need to worry about this, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants