-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pairwise summation floats #4303
base: dev
Are you sure you want to change the base?
Conversation
This pull request has been linked to Shortcut Story #33201: Implement pairwise summation for floats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if this function were extensible to allow integral types. This is an interface issue, not a request for such code in the current PR. One goal would be to remove any explicit type dependency (such as log2f
vs. log2l
) from the unit tests and into a template of some sort.
Overflow would need to be dealt with. You'd also need a new equality operation (not operator==
) for testing; it would admit the difference between exactly equal and approximately equal.
tiledb/common/pairwise_sum.h
Outdated
namespace tiledb::common { | ||
|
||
template <class T, std::size_t Base = 128> | ||
requires std::floating_point<T> T pairwise_sum(const std::span<T>& x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs documentation, at minimum.
If I'm reading this correctly, this function is written in such a way as to allow the optimizer to generate vector instructions. If that's the purpose, on top of the two standard uses (1. main code and 2. unit tests) it has two additional uses: 3. Development performance testing and 4. Pre-deployment performance measurement.
Item 3: Development performance testing. This is the testing to validate that the particular algorithm actually generates optimal performance. This must always be measured, and always measured with respect to alternatives. The point about alternatives is that the interface between alternatives must be identical. This is hard to do with free functions, and is easier to do with classes. It's possible for a class to declare static operator()
so that it looks just like a function.
Item 4: The size of cache lines varies between processors, and it's a parameter that a user can look up. It would be better if the Base
argument were CacheLineSize
or something. The maximum length of the loop can be computed as a constexpr
. Doing this level of optimization will eventually need top-level configuration support, so it would be a good idea to declare a constexpr
at the top of the file with the default used.
The name should first say what it does. In this case, because there need to be multiple versions for optimization, it should secondarily distinguish itself by how it works. I would call it IteratedSum_Recursive
for this version in the PR at the time of this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation, agree, I'll do that.
I wasn't aware of Item 3 and Item 4. Let's use this comment as basis for a meeting to draw the real requirements of this work.
Being cache line aware for the base case is definitely interesting and worth exploring. The numpy team for instance chose to go the manual loop unrolling direction as the first optimization and saw somewhere they claimed a 20% benefit 🤷 Very expensive to go down any direction for such simple function without a clear direction in mind and a benchmark on the table to test variations.
tiledb/common/pairwise_sum.h
Outdated
} | ||
size_t mid = std::floor(x.size() / 2); | ||
return pairwise_sum<T, Base>(x.first(mid)) + | ||
pairwise_sum<T, Base>(x.subspan(mid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursion is going to slow down the operation, as is running a bunch of loops of variable length.
I'd do the following as a first pass implementation. N
is the line length.
- Validate that
N
is power of two. This is to allow the use ofstd::align
in the next step. - Split the range into three parts, based on alignment. The first part is of length in the range
[0,N)
. The second part is of lengthkN
for some integerk > 0
and is aligned on memory boundary aligned withN
. The third part is also of length in the range[0,N)
. - Loop over the parts in memory order, in three stages.
- Use a fixed-length loop for the second stage. Fixed-length vector operations are a bit faster that variable-length ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All sound good, I wouldn't go down this rabbit hole though without a benchmark in place to test the benefits of these optimizations.
@eric-hughes-tiledb I don't really understand why I would extend this to integral types, I didn't have much of a requirements list for this function, but I definitely saw that it's a floating point summation function, please let me know if I'm missing something. Regarding type dependency,
Do we have plans to turn this into a floating point operations library of some sorts? This function was pitched to me like "1-2 hours of work", thus I added a simple pairwise summation implementation + a test. |
@eric-hughes-tiledb updated the code based on what we last discussed.
|
This adds a pairwise summation function for floating point types.
TYPE: FEATURE
DESC: Pairwise summation floats