Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairwise summation floats #4303

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from
Open

Conversation

robertbindar
Copy link
Contributor

@robertbindar robertbindar commented Aug 30, 2023

This adds a pairwise summation function for floating point types.


TYPE: FEATURE
DESC: Pairwise summation floats

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #33201: Implement pairwise summation for floats.

Copy link
Contributor

@eric-hughes-tiledb eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if this function were extensible to allow integral types. This is an interface issue, not a request for such code in the current PR. One goal would be to remove any explicit type dependency (such as log2f vs. log2l) from the unit tests and into a template of some sort.

Overflow would need to be dealt with. You'd also need a new equality operation (not operator==) for testing; it would admit the difference between exactly equal and approximately equal.

namespace tiledb::common {

template <class T, std::size_t Base = 128>
requires std::floating_point<T> T pairwise_sum(const std::span<T>& x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs documentation, at minimum.

If I'm reading this correctly, this function is written in such a way as to allow the optimizer to generate vector instructions. If that's the purpose, on top of the two standard uses (1. main code and 2. unit tests) it has two additional uses: 3. Development performance testing and 4. Pre-deployment performance measurement.

Item 3: Development performance testing. This is the testing to validate that the particular algorithm actually generates optimal performance. This must always be measured, and always measured with respect to alternatives. The point about alternatives is that the interface between alternatives must be identical. This is hard to do with free functions, and is easier to do with classes. It's possible for a class to declare static operator() so that it looks just like a function.

Item 4: The size of cache lines varies between processors, and it's a parameter that a user can look up. It would be better if the Base argument were CacheLineSize or something. The maximum length of the loop can be computed as a constexpr. Doing this level of optimization will eventually need top-level configuration support, so it would be a good idea to declare a constexpr at the top of the file with the default used.

The name should first say what it does. In this case, because there need to be multiple versions for optimization, it should secondarily distinguish itself by how it works. I would call it IteratedSum_Recursive for this version in the PR at the time of this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation, agree, I'll do that.
I wasn't aware of Item 3 and Item 4. Let's use this comment as basis for a meeting to draw the real requirements of this work.
Being cache line aware for the base case is definitely interesting and worth exploring. The numpy team for instance chose to go the manual loop unrolling direction as the first optimization and saw somewhere they claimed a 20% benefit 🤷 Very expensive to go down any direction for such simple function without a clear direction in mind and a benchmark on the table to test variations.

}
size_t mid = std::floor(x.size() / 2);
return pairwise_sum<T, Base>(x.first(mid)) +
pairwise_sum<T, Base>(x.subspan(mid));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursion is going to slow down the operation, as is running a bunch of loops of variable length.

I'd do the following as a first pass implementation. N is the line length.

  • Validate that N is power of two. This is to allow the use of std::align in the next step.
  • Split the range into three parts, based on alignment. The first part is of length in the range [0,N). The second part is of length kN for some integer k > 0 and is aligned on memory boundary aligned with N. The third part is also of length in the range [0,N).
  • Loop over the parts in memory order, in three stages.
    • Use a fixed-length loop for the second stage. Fixed-length vector operations are a bit faster that variable-length ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All sound good, I wouldn't go down this rabbit hole though without a benchmark in place to test the benefits of these optimizations.

@robertbindar
Copy link
Contributor Author

It would be better if this function were extensible to allow integral types. This is an interface issue, not a request for such code in the current PR. One goal would be to remove any explicit type dependency (such as log2f vs. log2l) from the unit tests and into a template of some sort.

@eric-hughes-tiledb I don't really understand why I would extend this to integral types, I didn't have much of a requirements list for this function, but I definitely saw that it's a floating point summation function, please let me know if I'm missing something.

Regarding type dependency, log2f and log2l I think can go away with c++23 when log2 gets added.

Overflow would need to be dealt with. You'd also need a new equality operation (not operator==) for testing; it would admit the difference between exactly equal and approximately equal.

Do we have plans to turn this into a floating point operations library of some sorts? This function was pitched to me like "1-2 hours of work", thus I added a simple pairwise summation implementation + a test.

@robertbindar
Copy link
Contributor Author

@eric-hughes-tiledb updated the code based on what we last discussed.
I have some comments/questions here:

  1. I adjusted generated vectors from 2^24 as we discussed to 2^22 as the test was quite slow.
  2. It's the first time I use catch2 benchmark, so I just put the small snippets in the unit test. Where are these benchmarks supposed to be placed? Do we want them to be run in the CI given they are quite slow?
  3. static operator() overloads need to wait till c++23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants