Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive interpolation #102

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Recursive interpolation #102

wants to merge 23 commits into from

Conversation

BrendanKKrueger
Copy link
Collaborator

@BrendanKKrueger BrendanKKrueger commented Sep 30, 2024

PR Summary

Recursive implementation of interpToReal to avoid the big, nasty formulas.

PR Checklist

  • Code is formatted. (You can use the format_spiner make target.)
  • Adds a test for any bugs fixed. Adds tests for new features.
  • If preparing for a new release, update the version in cmake.

@BrendanKKrueger BrendanKKrueger changed the title Bkk interp to real interpToReal Sep 30, 2024
@BrendanKKrueger BrendanKKrueger marked this pull request as ready for review October 1, 2024 17:24
@BrendanKKrueger BrendanKKrueger marked this pull request as draft October 1, 2024 17:26
@BrendanKKrueger BrendanKKrueger changed the title interpToReal Recursive interpolation Oct 1, 2024
@BrendanKKrueger BrendanKKrueger marked this pull request as ready for review October 1, 2024 21:05
@BrendanKKrueger
Copy link
Collaborator Author

@Yurlungur, I give about a 50% chance that this solution is controversial. But it's a way to clean up all the nasty equations that span 30ish lines. It also extends the ability to use an index in place of a coordinate so that it can be applied to any dimension instead of only the hard-coded ones (assuming one uses the new interface rather than the existing interface that I left in place so that we discuss all the details that go into this change).

@BrendanKKrueger
Copy link
Collaborator Author

It also extends the ability to use an index in place of a coordinate so that it can be applied to any dimension instead of only the hard-coded ones

It may be that this fulfills your desire for DataBox to be used as a data array in addition to being used as an interpolator. I realize I'm proposing more sweeping changes, but this is just thinking about possible future evolution. If we rename interpToReal to operator() then the user can do something like

auto value = db(1,7,3);

and that becomes simply looking up the value with those indices. If the user instead does

auto value = db(1.0, 7.0, 3.0);

then that's an interpolation. And they can mix and match as well, such as

auto value = db(1.0, 7, 3.0);

Of course, one could argue that this becomes unclear and hard to understand the difference between interpolating and indexing, but maybe not.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for building this @BrendanKKrueger ! I think this is a nice solution to the hand-coded mess. That said, I do have some performance concerns, listed below. I don't know if I have the time to do performance testing, and I'm reluctant to ask you to do it... but I think that's the point we should consider. It's very important that DataBox is as performant as we can make it. Code cleanliness comes second vs that goal.

spiner/databox.hpp Outdated Show resolved Hide resolved
Comment on lines +476 to +477
// Note: grids are in reverse order relative to arguments
append_index_and_weights(iwlist + 1, grid - 1, other_args...);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inverting the order here is a little frightening... we might want some kind of (debug enabled) sanity check that the grid array is the same size as the arg list when we call this for the first time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, something has to be inverted:

  • The order of the arguments is chosen to reflect the order of the interpolation recursion
  • The order of grids_ is reversed relative to the order of the arguments
  • The append_index_and_weights method could recurse in the opposite order as the interpolation, which would mean that the loop over grids wouldn't be backwards. But then either we're filling iwlist back to front or we're reading iwlist back to front during the interpolation recursion.

I was assuming that assert(canInterpToReal_(N)); would do the kind of sanity checking that you're looking for. If not, what sort of check(s) would you consider sufficient here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly related: When filling iwlist, I skip axes where the argument is already an index, because we don't need to store another copy of the same index and the weights are implicitly 1 and 0, so iwlist is redundant for those axes. However, because I declare iwlist as having length N, the memory is already allocated. After thinking about it, I think that this is just one extra thing to remember and we're better off just using iwlist consistently regardless of whether the argument is an index or a coordinate. This may be relevant if we change how iwlist is filled (e.g., if we fill it back-to-front). So I'm changing that behavior to just always use iwlist consistently.

spiner/databox.hpp Outdated Show resolved Hide resolved
spiner/regular_grid_1d.hpp Outdated Show resolved Hide resolved
Comment on lines 529 to 534
const T v0 = interp_core(
[&c = callable, i = current.index](auto... ii) { return c(i, ii...); },
iwlist + 1, other_args...);
const T v1 = interp_core([&c = callable, i = current.index + 1](
auto... ii) { return c(i, ii...); },
iwlist + 1, other_args...);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned that this recursion where you hide more and more index data inside closures might have performance implications over the gross hand-implemented stuff. I think it would be good to make sure this recursive treatment doesn't have performance implications.

We might also be able to avoid the nested closures though. Perhaps the actual interpolation rule could be expressed as a fold expression or something similar?

The reason for my concern here is when I was first writing this code, i found that the order of operations in those complicated interpolation routines mattered a lot for performance. And IMO spiner is a low level thing, so it's important it's performant even if it's ugly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the nested closures with rotating the arguments and replacing them by indices so that after a single full rotation you have all the indices you need as arguments, and then you essentially just call dataView_(args...). I think it's a lot cleaner without the nested closures (and they were really bothering me, too), but take a look and see what you think.

Comment on lines 549 to 552
return interp_core(
[&c = callable, i = index](auto... ii) { return c(i, ii...); },
iwlist, // we didn't use iwlist so don't advance the pointer
other_args...);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah clever. Hard to reason about, but clever.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, "Hard to reason about, but clever," is very likely to be bad code. What part(s) do you find hard to reason about? I've cleaned up some things, but I'm still looking for ways to improve this code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, here's where you add in the ability to mix in integer indexes, right? It was hard to chase down what bit of code was doing that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so a bit more clarifying documentation (in this case, probably code comments) seems called for. I'll add that. If you find anything unclear after I change that, we can continue this discussion to see how to improve readability and maintainability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Also---I do think we should do some performance testing before you dive too deep into this. I'm still concerned about the nested closures (i.e., nested lambdas).

Copy link
Collaborator Author

@BrendanKKrueger BrendanKKrueger Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After playing with it some more, I'm even more convinced that test/benchmark.cpp isn't really reporting what we think it's reporting. I modified the test to do extra work (instead of just sampling one point in each GPU kernel, sample 3^3 points), and most of the times went down. So I think the per-kernel work is essentially negligible, and we're mostly timing random variations in something like transfer time to the GPU, kernel launch time, synchronization time of the reduction, or something along those lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I cranked it up to about 5,000 evaluations per GPU kernel. I'll put the actual data behind a fold (if GitHub will let me), but the punchline is: at this point the timings are more consistent, so we may be able to say something meaningful (but I'm still not sure, so take this with a healthy dose of skepticism). In this case, the recursive method is almost always faster than the hand-coded method. Occasionally the hand-coded method ends up faster, but only by a very small factor. The hand-coded method is usually slower by anywhere from a few percent to almost a factor of two.

runtime data
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000 
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
#  nfine             time/point (ns)   L2 error       
   10000             1.308670e-07      0.000000e+00   
   100000            1.975500e-11      0.000000e+00   
   1000000           1.798200e-14      0.000000e+00   
# interpolate 3D (recursive method)
#  nfine             time/point (ns)   L2 error       
   10000             7.284700e-08      0.000000e+00   
   100000            1.908000e-11      0.000000e+00   
   1000000           1.735100e-14      0.000000e+00   
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000 
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
#  nfine             time/point (ns)   L2 error       
   10000             1.316370e-07      0.000000e+00   
   100000            1.954000e-11      0.000000e+00   
   1000000           2.921400e-14      0.000000e+00   
# interpolate 3D (recursive method)
#  nfine             time/point (ns)   L2 error       
   10000             1.002250e-07      0.000000e+00   
   100000            1.866700e-11      0.000000e+00   
   1000000           1.754400e-14      0.000000e+00   
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000 
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
#  nfine             time/point (ns)   L2 error       
   10000             1.229280e-07      0.000000e+00   
   100000            1.953800e-11      0.000000e+00   
   1000000           2.972000e-14      0.000000e+00   
# interpolate 3D (recursive method)
#  nfine             time/point (ns)   L2 error       
   10000             1.002670e-07      0.000000e+00   
   100000            2.033900e-11      0.000000e+00   
   1000000           1.838300e-14      0.000000e+00   
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000 
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
#  nfine             time/point (ns)   L2 error       
   10000             1.263160e-07      0.000000e+00   
   100000            1.975700e-11      0.000000e+00   
   1000000           1.807400e-14      0.000000e+00   
# interpolate 3D (recursive method)
#  nfine             time/point (ns)   L2 error       
   10000             1.003250e-07      0.000000e+00   
   100000            1.905700e-11      0.000000e+00   
   1000000           1.900200e-14      0.000000e+00   
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000 
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
#  nfine             time/point (ns)   L2 error       
   10000             1.279110e-07      0.000000e+00   
   100000            1.978200e-11      0.000000e+00   
   1000000           1.831300e-14      0.000000e+00   
# interpolate 3D (recursive method)
#  nfine             time/point (ns)   L2 error       
   10000             1.046090e-07      0.000000e+00   
   100000            1.957700e-11      0.000000e+00   
   1000000           1.833200e-14      0.000000e+00   
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000 
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
#  nfine             time/point (ns)   L2 error       
   10000             1.363300e-07      0.000000e+00   
   100000            1.942000e-11      0.000000e+00   
   1000000           3.694600e-14      0.000000e+00   
# interpolate 3D (recursive method)
#  nfine             time/point (ns)   L2 error       
   10000             9.183900e-08      0.000000e+00   
   100000            1.946100e-11      0.000000e+00   
   1000000           1.773400e-14      0.000000e+00   
bkkrueger@cn136~> /vast/home/bkkrueger/spiner/worktrees/bkk_interpToReal/build/volta-x86/test/benchmark2.bin 1000 10000 100000 1000000 
# ncoarse = 1000
# interpToReal 3D (hand-coded method)
#  nfine             time/point (ns)   L2 error       
   10000             1.227620e-07      0.000000e+00   
   100000            1.960800e-11      0.000000e+00   
   1000000           1.834800e-14      0.000000e+00   
# interpolate 3D (recursive method)
#  nfine             time/point (ns)   L2 error       
   10000             9.942100e-08      0.000000e+00   
   100000            1.910700e-11      0.000000e+00   
   1000000           1.716100e-14      0.000000e+00   

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm satisfied with this. Thanks for chasing this down. When we wrote the paper, @dholladay00 and I also used some instrumentation... @dholladay00 do you know where that code went?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall this, what do you mean by instrumentation code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had a benchmarking tool... as well as nvprov. I know we had to scale it way up from the test that's in the repo. But maybe it's just that test scaled way up...

Comment on lines 219 to 220
// (2) I think it would be worth testing if DataBox works for other
// (non-real) types, such as std::complex.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and I think there's no reason it shouldn't work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suspicion is that it might work, but that it's very likely for there to be improvements we could make. For example, the weights are declared to be type T, but in practice they're always real. Similarly, I think DataBox may compute the x values as type T when they're also real. It's possible that may push the design towards something like adding a template for the independent variable's type and have it default to the dependent variable's type. Existing code would hopefully continue to work as-is, but people wanting to leverage newer features (like improved support for std::complex) would be able to say "the dependent variable is std::complex<double> but the independent variables are double.

spiner/databox.hpp Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants