-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursive interpolation #102
base: main
Are you sure you want to change the base?
Conversation
cebc8e6
to
8ac909c
Compare
… type (but so far only T).
@Yurlungur, I give about a 50% chance that this solution is controversial. But it's a way to clean up all the nasty equations that span 30ish lines. It also extends the ability to use an index in place of a coordinate so that it can be applied to any dimension instead of only the hard-coded ones (assuming one uses the new interface rather than the existing interface that I left in place so that we discuss all the details that go into this change). |
It may be that this fulfills your desire for
and that becomes simply looking up the value with those indices. If the user instead does
then that's an interpolation. And they can mix and match as well, such as
Of course, one could argue that this becomes unclear and hard to understand the difference between interpolating and indexing, but maybe not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for building this @BrendanKKrueger ! I think this is a nice solution to the hand-coded mess. That said, I do have some performance concerns, listed below. I don't know if I have the time to do performance testing, and I'm reluctant to ask you to do it... but I think that's the point we should consider. It's very important that DataBox
is as performant as we can make it. Code cleanliness comes second vs that goal.
// Note: grids are in reverse order relative to arguments | ||
append_index_and_weights(iwlist + 1, grid - 1, other_args...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverting the order here is a little frightening... we might want some kind of (debug enabled) sanity check that the grid array is the same size as the arg list when we call this for the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, something has to be inverted:
- The order of the arguments is chosen to reflect the order of the interpolation recursion
- The order of
grids_
is reversed relative to the order of the arguments - The
append_index_and_weights
method could recurse in the opposite order as the interpolation, which would mean that the loop over grids wouldn't be backwards. But then either we're fillingiwlist
back to front or we're readingiwlist
back to front during the interpolation recursion.
I was assuming that assert(canInterpToReal_(N));
would do the kind of sanity checking that you're looking for. If not, what sort of check(s) would you consider sufficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly related: When filling iwlist
, I skip axes where the argument is already an index, because we don't need to store another copy of the same index and the weights are implicitly 1 and 0, so iwlist
is redundant for those axes. However, because I declare iwlist
as having length N
, the memory is already allocated. After thinking about it, I think that this is just one extra thing to remember and we're better off just using iwlist
consistently regardless of whether the argument is an index or a coordinate. This may be relevant if we change how iwlist
is filled (e.g., if we fill it back-to-front). So I'm changing that behavior to just always use iwlist
consistently.
spiner/databox.hpp
Outdated
const T v0 = interp_core( | ||
[&c = callable, i = current.index](auto... ii) { return c(i, ii...); }, | ||
iwlist + 1, other_args...); | ||
const T v1 = interp_core([&c = callable, i = current.index + 1]( | ||
auto... ii) { return c(i, ii...); }, | ||
iwlist + 1, other_args...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned that this recursion where you hide more and more index data inside closures might have performance implications over the gross hand-implemented stuff. I think it would be good to make sure this recursive treatment doesn't have performance implications.
We might also be able to avoid the nested closures though. Perhaps the actual interpolation rule could be expressed as a fold expression or something similar?
The reason for my concern here is when I was first writing this code, i found that the order of operations in those complicated interpolation routines mattered a lot for performance. And IMO spiner is a low level thing, so it's important it's performant even if it's ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the nested closures with rotating the arguments and replacing them by indices so that after a single full rotation you have all the indices you need as arguments, and then you essentially just call dataView_(args...)
. I think it's a lot cleaner without the nested closures (and they were really bothering me, too), but take a look and see what you think.
spiner/databox.hpp
Outdated
return interp_core( | ||
[&c = callable, i = index](auto... ii) { return c(i, ii...); }, | ||
iwlist, // we didn't use iwlist so don't advance the pointer | ||
other_args...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah clever. Hard to reason about, but clever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, "Hard to reason about, but clever," is very likely to be bad code. What part(s) do you find hard to reason about? I've cleaned up some things, but I'm still looking for ways to improve this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, here's where you add in the ability to mix in integer indexes, right? It was hard to chase down what bit of code was doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so a bit more clarifying documentation (in this case, probably code comments) seems called for. I'll add that. If you find anything unclear after I change that, we can continue this discussion to see how to improve readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Also---I do think we should do some performance testing before you dive too deep into this. I'm still concerned about the nested closures (i.e., nested lambdas).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After playing with it some more, I'm even more convinced that test/benchmark.cpp
isn't really reporting what we think it's reporting. I modified the test to do extra work (instead of just sampling one point in each GPU kernel, sample 3^3 points), and most of the times went down. So I think the per-kernel work is essentially negligible, and we're mostly timing random variations in something like transfer time to the GPU, kernel launch time, synchronization time of the reduction, or something along those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I cranked it up to about 5,000 evaluations per GPU kernel. I'll put the actual data behind a fold (if GitHub will let me), but the punchline is: at this point the timings are more consistent, so we may be able to say something meaningful (but I'm still not sure, so take this with a healthy dose of skepticism). In this case, the recursive method is almost always faster than the hand-coded method. Occasionally the hand-coded method ends up faster, but only by a very small factor. The hand-coded method is usually slower by anywhere from a few percent to almost a factor of two.
runtime data
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
# nfine time/point (ns) L2 error
10000 1.308670e-07 0.000000e+00
100000 1.975500e-11 0.000000e+00
1000000 1.798200e-14 0.000000e+00
# interpolate 3D (recursive method)
# nfine time/point (ns) L2 error
10000 7.284700e-08 0.000000e+00
100000 1.908000e-11 0.000000e+00
1000000 1.735100e-14 0.000000e+00
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
# nfine time/point (ns) L2 error
10000 1.316370e-07 0.000000e+00
100000 1.954000e-11 0.000000e+00
1000000 2.921400e-14 0.000000e+00
# interpolate 3D (recursive method)
# nfine time/point (ns) L2 error
10000 1.002250e-07 0.000000e+00
100000 1.866700e-11 0.000000e+00
1000000 1.754400e-14 0.000000e+00
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
# nfine time/point (ns) L2 error
10000 1.229280e-07 0.000000e+00
100000 1.953800e-11 0.000000e+00
1000000 2.972000e-14 0.000000e+00
# interpolate 3D (recursive method)
# nfine time/point (ns) L2 error
10000 1.002670e-07 0.000000e+00
100000 2.033900e-11 0.000000e+00
1000000 1.838300e-14 0.000000e+00
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
# nfine time/point (ns) L2 error
10000 1.263160e-07 0.000000e+00
100000 1.975700e-11 0.000000e+00
1000000 1.807400e-14 0.000000e+00
# interpolate 3D (recursive method)
# nfine time/point (ns) L2 error
10000 1.003250e-07 0.000000e+00
100000 1.905700e-11 0.000000e+00
1000000 1.900200e-14 0.000000e+00
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
# nfine time/point (ns) L2 error
10000 1.279110e-07 0.000000e+00
100000 1.978200e-11 0.000000e+00
1000000 1.831300e-14 0.000000e+00
# interpolate 3D (recursive method)
# nfine time/point (ns) L2 error
10000 1.046090e-07 0.000000e+00
100000 1.957700e-11 0.000000e+00
1000000 1.833200e-14 0.000000e+00
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
# nfine time/point (ns) L2 error
10000 1.363300e-07 0.000000e+00
100000 1.942000e-11 0.000000e+00
1000000 3.694600e-14 0.000000e+00
# interpolate 3D (recursive method)
# nfine time/point (ns) L2 error
10000 9.183900e-08 0.000000e+00
100000 1.946100e-11 0.000000e+00
1000000 1.773400e-14 0.000000e+00
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
# nfine time/point (ns) L2 error
10000 1.227620e-07 0.000000e+00
100000 1.960800e-11 0.000000e+00
1000000 1.834800e-14 0.000000e+00
# interpolate 3D (recursive method)
# nfine time/point (ns) L2 error
10000 9.942100e-08 0.000000e+00
100000 1.910700e-11 0.000000e+00
1000000 1.716100e-14 0.000000e+00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm satisfied with this. Thanks for chasing this down. When we wrote the paper, @dholladay00 and I also used some instrumentation... @dholladay00 do you know where that code went?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall this, what do you mean by instrumentation code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had a benchmarking tool... as well as nvprov. I know we had to scale it way up from the test that's in the repo. But maybe it's just that test scaled way up...
spiner/databox.hpp
Outdated
// (2) I think it would be worth testing if DataBox works for other | ||
// (non-real) types, such as std::complex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and I think there's no reason it shouldn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suspicion is that it might work, but that it's very likely for there to be improvements we could make. For example, the weights are declared to be type T
, but in practice they're always real. Similarly, I think DataBox
may compute the x values as type T
when they're also real. It's possible that may push the design towards something like adding a template for the independent variable's type and have it default to the dependent variable's type. Existing code would hopefully continue to work as-is, but people wanting to leverage newer features (like improved support for std::complex
) would be able to say "the dependent variable is std::complex<double>
but the independent variables are double
.
…a typo fix that already existed).
…ture removal of "interpToReal".
PR Summary
Recursive implementation of
interpToReal
to avoid the big, nasty formulas.PR Checklist