Support for `gpu_spreadinterponly`=1 #564

chaithyagr · 2024-09-20T11:24:29Z

Hi there,

Thank you again for the amazing work. We re-ran our MRI benchmark for different nufft suits, and we are pleased to announce that cufinufft is the fastest NUFFT in the west!, and by a wide margin. Below are our benchmark results:

2D:

3D:

However, a major reason not to use it (as discussed in the past in #295) was the lack of density compensation estimation, which needed only the spreader and interpolator.
We understand that it takes a lot of effort to maintain API ends for the spreader and interpolator for Julia, Python, and MATLAB, so my earlier PR was not taken up.
However, we recently saw an internal option in cufinufft opts, gpu_spreadinterponly, and felt we could exploit it to allow density compensation estimators on our side.

This pull request implements just that (without exposing internal interfaces, unlike #308)

This could help the MR community forward in using cufinufft. We are open to discussing the implementation in detail and how we could further contribute to (cu)finufft.

To understand the advantage of using density compensation, please see our example (where we plan to add cufinufft eventually): https://mind-inria.github.io/mri-nufft/generated/autoexamples/GPU/example_density.html
Feel free to see our other examples and notice how we use density compensation extensively.

src/cuda/1d/cufinufft1d.cu

src/cuda/2d/cufinufft2d.cu

src/cuda/spreadinterp.cpp

include/cufinufft/impl.h

src/cuda/memtransfer_wrapper.cu

ahbarnett

Dear Chaithya,

Thanks for the tests! This PR all seems very reasonable, and is a very light change to the API. I'm so glad that you are able to use our option to switch off FFTs. I see you also skipped the kernel fseries calc, a good idea. It's great you added the error code to the docs, but my main request is to add something about upsampfac=1.0 in the GPU docs, which is a new use case to us! :)

What I still don't understand is how you use our spreadinterp, specifically, what if we were to change our kernel coefficients (very possible, since we already rescaled them in going from v2.2 to v2.3) - how will the density compensation user know? Won't it break your math? Do you access our kernel eval yourself (by bypassing our API) ?

By the way, I really like your new docs at MRI-nufft, with the formulae, NUFFT usage, etc. Very useful! (I have only skimmed so far).

Assuming others are happy, we can bring this in in a matter of days. I'd be willing to consider a CPU version of the above too, if it helps you and MRI people.

Best, Alex

ahbarnett · 2024-09-20T16:07:06Z

I should add two main questions on the topic of your tests:

what speedup factor did you notice from cufinufft in v2.2 going to v2.3 ? It looks like it changed a lot from your 3-week-old tests to the ones you posted today, right? (ie, cufinufft did not used always to be the fastest, but now it is?)
What GPU do you use? (and for that matter, CPU, since they are included in the tests)

Thanks, Alex

chaithyagr · 2024-09-20T16:18:50Z

Ill detail the perftest based on old and new cufinufft soon, need to dig up some results. gpuNUFFT was faster than cufinufft in the past for 3D multi-coil from what I remember. The tests was on A100 GPU I think, need to check the exact details and update here.

By the way, I really like your new docs at MRI-nufft, with the formulae, NUFFT usage, etc. Very useful! (I have only skimmed so far).

Thank you for your interest. While it is collaborative effort, I must say majority of this work, formulation and documentation was by @paquiteau (also the original setup for the benchmark).

ahbarnett · 2024-09-20T16:39:56Z

Another note to clean up the PR: currently opts.gpu_spreadinterponly is listed in docs as "diagnostic" and only to be used as such, and that it produces "garbage" (!). This PR should move it to be a "Data handling option", and have the MRI use case with upsampfac=1.0 mentioned.

PS Thanks @paquiteau for the tests and great new docs here:
https://mind-inria.github.io/mri-nufft/index.html

DiamonDinoia · 2024-09-21T09:29:31Z

I will review next week when I have time. I read the conversation gpu_spreadinterponly is maintained since we use it in type3. There is no harm in exposing this to the user. With proper documentation as it configures cufinufft to not do a nufft, but garbage in most contests :)

I think having this option in finufft is a good idea. In that case we should spend some time thinking how to reduce code duplication in a similar fashion as cufinufft.

paquiteau · 2024-09-21T12:19:43Z

Hello there ! Thanks @ahbarnett, for the positive feedback on our docs; it really is a team effort. Feel free to open any issue on our side if something is unclear or missing (for instance, the explanation of the how and why of density compensation is still missing)

What I still don't understand is how you use our spreadinterp, specifically, what if we were to change our kernel coefficients (very possible, since we already rescaled them in going from v2.2 to v2.3) - how will the density compensation user know? Won't it break your math? Do you access our kernel eval yourself (by bypassing our API) ?

That's the best part: it won't break anything, it fixes everything :) density compensation reweights the non-uniform points with respect to how they are processed together in the NUFFT

Regarding the benchmark, hopefully, you can run it on your hardware as well:

https://github.com/mind-inria/mri-nufft-benchmark/

It is designed to stress test the NUFFT in challenging MRI applications (high spatial resolution in 3D notably). We did not configure much of the parameters either (e.g., upsampfac, which indeed has an impact on both memory and perf)

what speedup factor did you notice from cufinufft in v2.2 going to v2.3 ? It looks like it changed a lot from your 3-week-old tests to the ones you posted today, right? (ie, cufinufft did not used always to be the fastest, but now it is?)

cufinufft has always been faster in 2D; for 3D, it was more subtle; the multi-coil computations of gpunufft (which uses async copy and computes) were putting it ahead, IIRC . Nevertheless, gpuNUFFT remains more memory efficient as we don't have to create two separate plans for type 1 and type 2 like in cufinufft

chaithyagr · 2024-09-23T07:33:16Z

I didnt have to add this change on this PR as it seemed like someone had done it already:

finufft/src/cuda/memtransfer_wrapper.cu

Lines 347 to 364 in b98fd1d

    
           if (!d_plan->opts.gpu_spreadinterponly) { 
        
             if ((ier = checkCudaErrors(cudaMallocWrapper( 
        
                      &d_plan->fw, maxbatchsize * nf1 * nf2 * nf3 * sizeof(cuda_complex<T>), 
        
                      stream, d_plan->supports_pools)))) 
        
               goto finalize; 
        
             if ((ier = checkCudaErrors( 
        
                      cudaMallocWrapper(&d_plan->fwkerhalf1, (nf1 / 2 + 1) * sizeof(T), stream, 
        
                                        d_plan->supports_pools)))) 
        
               goto finalize; 
        
             if ((ier = checkCudaErrors( 
        
                      cudaMallocWrapper(&d_plan->fwkerhalf2, (nf2 / 2 + 1) * sizeof(T), stream, 
        
                                        d_plan->supports_pools)))) 
        
               goto finalize; 
        
             if ((ier = checkCudaErrors( 
        
                      cudaMallocWrapper(&d_plan->fwkerhalf3, (nf3 / 2 + 1) * sizeof(T), stream, 
        
                                        d_plan->supports_pools)))) 
        
               goto finalize; 
        
           }

It seems like we don't allocate fw, fkernel etc if we have gpu_spreadinterponly=1

DiamonDinoia · 2024-09-23T10:01:42Z

This PR breaks type3, before I review could you fix it?
That option is used for the type3 plan where we do not do an FFT but we rely on the internal type2 plan for the NUFFT.

Does this behaviour conflict with your use case?

DiamonDinoia · 2024-09-23T13:25:10Z

Hello there ! Thanks @ahbarnett, for the positive feedback on our docs; it really is a team effort. Feel free to open any issue on our side if something is unclear or missing (for instance, the explanation of the how and why of density compensation is still missing)

What I still don't understand is how you use our spreadinterp, specifically, what if we were to change our kernel coefficients (very possible, since we already rescaled them in going from v2.2 to v2.3) - how will the density compensation user know? Won't it break your math? Do you access our kernel eval yourself (by bypassing our API) ?

That's the best part: it won't break anything, it fixes everything :) density compensation reweights the non-uniform points with respect to how they are processed together in the NUFFT

Regarding the benchmark, hopefully, you can run it on your hardware as well:

mind-inria/mri-nufft-benchmark

It is designed to stress test the NUFFT in challenging MRI applications (high spatial resolution in 3D notably). We did not configure much of the parameters either (e.g., upsampfac, which indeed has an impact on both memory and perf)

what speedup factor did you notice from cufinufft in v2.2 going to v2.3 ? It looks like it changed a lot from your 3-week-old tests to the ones you posted today, right? (ie, cufinufft did not used always to be the fastest, but now it is?)

cufinufft has always been faster in 2D; for 3D, it was more subtle; the multi-coil computations of gpunufft (which uses async copy and computes) were putting it ahead, IIRC . Nevertheless, gpuNUFFT remains more memory efficient as we don't have to create two separate plans for type 1 and type 2 like in cufinufft

Have you installed finufft from pip? Did you try building from source? -march=native and gpu native optimizations might change the performance drammatically.

src/cuda/2d/cufinufft2d.cu

include/cufinufft/impl.h

chaithyagr · 2024-09-24T08:12:38Z

@DiamonDinoia I think I am good with this PR, ill let you review it now.

DiamonDinoia

Looks good. Some minor changes and cleanup.

include/cufinufft/impl.h

src/cuda/1d/cufinufft1d.cu

src/cuda/2d/cufinufft2d.cu

src/cuda/3d/cufinufft3d.cu

chaithyagr · 2024-10-07T07:51:02Z

Is there something left on my side on this PR? Just want to be sure that you are not waiting on me for anything.

ahbarnett

Great. Very clean in the end. Thx for adding the docs (I may tweak them). Best, Alex

include/cufinufft/impl.h

src/cuda/1d/cufinufft1d.cu

ahbarnett · 2024-10-08T17:06:44Z

That's the best part: it won't break anything, it fixes everything :) density compensation reweights the non-uniform points with respect to how they are processed together in the NUFFT

This is still a mystery to me - precisely, how your MRI application knows what kernel we use for spreadinterp, without accessing our spreading function another way (I could guess it does a single-NU-point call to extract the kernel first?).

I'm about to bring in the PR...

ahbarnett · 2024-10-08T17:13:34Z

SInce the logic is tricky here (XNOR; t3, USF=1, etc) and t3 broke and was fixed, I want to leave the individual commits.

chaithyagr · 2024-10-09T07:12:47Z

Thank you for the merge.

This is still a mystery to me - precisely, how your MRI application knows what kernel we use for spreadinterp, without accessing our spreading function another way (I could guess it does a single-NU-point call to extract the kernel first?).

Well the density is estimated by a spread (S) followed by interp (I) call on all ones: I(S(1)). This way, we get an approximate estimate of how the spreading / interpolate kernels scale / weight each frequency element for a given sampling pattern. This can later be used iteratively to calculate how to "compensate" for it.

Here is the paper describing in detail: https://pubmed.ncbi.nlm.nih.gov/21688320/ (Eq 2).

PS: Do you have an apporximate idea on when we can get this PR to a release? If you anticipate it to take longer, Would a minor release be possible so that we can update mri-nufft accordingly?

chaithyagr · 2024-10-24T07:39:48Z

@ahbarnett do you plan a short release with this feature anytime soon? It could really help us to have say v2.3.1 with this feature, as it would streamline mri-nufft's use of it.
We have a PR mind-inria/mri-nufft#195 ready to tackle this specific support.

janden · 2024-11-19T20:55:30Z

@chaithyagr Sorry for the delay on this, but I've cherry-picked your changes onto our v2.3.X branch (see here) in preparation for our 2.3.1 release. (The reason is that the main branch currently contains a lot of other changes, such as type 3 and a major reorganization of the code, that we're not quite ready to release yet).

Since there are no unit tests for this new feature, would you mind running your tests on this branch to make sure I've transferred all your changes properly? Even better, if you could supply a unit test of some sort, that would help us out greatly in the future to make sure things remain stable.

chaithyagr · 2024-11-21T15:31:32Z

Hey @janden , thank you for this. We will get back soon with updates.

chaithyagr · 2024-11-21T15:55:32Z

@janden it seems to work good. Please do let us know if you make a release. I will make a PR in future testing it.

DiamonDinoia · 2024-11-21T15:57:09Z

Great, thank you! May I ask how you test it and if it is possible to write a unit test for it?

I am worried about future changes breaking the code.

chaithyagr · 2024-11-22T07:39:37Z

Iam in process of writing same feature in finufft and possibly adding some tests to it. Sadly the tests are not very rigorous at the moment and need to be run from mri-nufft

…

On Thu, 21 Nov, 2024, 4:57 pm Marco Barbone, ***@***.***> wrote: Great, thank you! May I ask how you test it and if it is possible to write a unit test for it? I am worried about future changes breaking the code. — Reply to this email directly, view it on GitHub <#564 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBSZMZGQ4N2EHIDAXJXD7T2BX7GZAVCNFSM6AAAAABOR2TWAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJRGYYTSNZWHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ahbarnett · 2024-11-25T22:11:27Z

Hi @janden Thanks for drafting 2.3.X branch. For Chaithyagr's commits to make sense we'll also need to choose the following doc commits I did:
cf25e43
c7d8ac4
6e6d04f
They are precisely all the 10-8-2024 commits.
I am doing them now. It's a PITA since each causes conflicts.

I am a bit confused about the route then for what master is updated to in terms of version number - it can't be 2.3.anything since that's a different track from 2.4.0. So is it 2.4.0beta?

ahbarnett · 2024-11-25T22:14:41Z

Dear Chaithyagr, Sorry, I had to also cherry-pick the docs to go with your changes. This also changed cufinufft/impl.h (just spacing/comment changes, but you should re-pull and check). We would only run a unit test for this option if it didn't have mri-nufft dependency. The best would be a simple python file to check the pure-spreading works as you intend it to.

If you're ok with the above - I will make a tag for it - we'll make a pypi release 2.3.1. Thanks! Alex

ahbarnett · 2024-11-25T23:24:40Z

Dear @janden and @chaithyagr - Between us, Joakim & I made
https://github.com/flatironinstitute/finufft/releases/tag/v2.3.1
It wasn't much fun, due to git cherry-picking :(
However, we want users to be happy. So, please check it works for you before we do a pypi release for it.
Thanks! Alex

chaithyagr added 6 commits September 19, 2024 22:35

WIP, setup

8b3ba60

Minor updates

f322afe

remove wip

135d504

update debug and also remove unwanted

909f3f5

Added error

64ee309

Add error rst update

70fa5c6

paquiteau mentioned this pull request Sep 20, 2024

Add support for density compensation estimation with cufinufft mind-inria/mri-nufft#195

Open