Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for gpu_spreadinterponly=1 #564

Merged
merged 19 commits into from
Oct 8, 2024

Conversation

chaithyagr
Copy link
Contributor

Hi there,

Thank you again for the amazing work. We re-ran our MRI benchmark for different nufft suits, and we are pleased to announce that cufinufft is the fastest NUFFT in the west!, and by a wide margin. Below are our benchmark results:

2D:
image

3D:
image

However, a major reason not to use it (as discussed in the past in #295) was the lack of density compensation estimation, which needed only the spreader and interpolator.
We understand that it takes a lot of effort to maintain API ends for the spreader and interpolator for Julia, Python, and MATLAB, so my earlier PR was not taken up.
However, we recently saw an internal option in cufinufft opts, gpu_spreadinterponly, and felt we could exploit it to allow density compensation estimators on our side.

This pull request implements just that (without exposing internal interfaces, unlike #308)

This could help the MR community forward in using cufinufft. We are open to discussing the implementation in detail and how we could further contribute to (cu)finufft.

To understand the advantage of using density compensation, please see our example (where we plan to add cufinufft eventually): https://mind-inria.github.io/mri-nufft/generated/autoexamples/GPU/example_density.html
Feel free to see our other examples and notice how we use density compensation extensively.

Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear Chaithya,

Thanks for the tests! This PR all seems very reasonable, and is a very light change to the API. I'm so glad that you are able to use our option to switch off FFTs. I see you also skipped the kernel fseries calc, a good idea. It's great you added the error code to the docs, but my main request is to add something about upsampfac=1.0 in the GPU docs, which is a new use case to us! :)

What I still don't understand is how you use our spreadinterp, specifically, what if we were to change our kernel coefficients (very possible, since we already rescaled them in going from v2.2 to v2.3) - how will the density compensation user know? Won't it break your math? Do you access our kernel eval yourself (by bypassing our API) ?

By the way, I really like your new docs at MRI-nufft, with the formulae, NUFFT usage, etc. Very useful! (I have only skimmed so far).

Assuming others are happy, we can bring this in in a matter of days. I'd be willing to consider a CPU version of the above too, if it helps you and MRI people.

Best, Alex

@ahbarnett
Copy link
Collaborator

ahbarnett commented Sep 20, 2024

I should add two main questions on the topic of your tests:

  1. what speedup factor did you notice from cufinufft in v2.2 going to v2.3 ? It looks like it changed a lot from your 3-week-old tests to the ones you posted today, right? (ie, cufinufft did not used always to be the fastest, but now it is?)
  2. What GPU do you use? (and for that matter, CPU, since they are included in the tests)

Thanks, Alex

@chaithyagr
Copy link
Contributor Author

Ill detail the perftest based on old and new cufinufft soon, need to dig up some results. gpuNUFFT was faster than cufinufft in the past for 3D multi-coil from what I remember. The tests was on A100 GPU I think, need to check the exact details and update here.

By the way, I really like your new docs at MRI-nufft, with the formulae, NUFFT usage, etc. Very useful! (I have only skimmed so far).

Thank you for your interest. While it is collaborative effort, I must say majority of this work, formulation and documentation was by @paquiteau (also the original setup for the benchmark).

@ahbarnett
Copy link
Collaborator

ahbarnett commented Sep 20, 2024

Another note to clean up the PR: currently opts.gpu_spreadinterponly is listed in docs as "diagnostic" and only to be used as such, and that it produces "garbage" (!). This PR should move it to be a "Data handling option", and have the MRI use case with upsampfac=1.0 mentioned.

PS Thanks @paquiteau for the tests and great new docs here:
https://mind-inria.github.io/mri-nufft/index.html

@DiamonDinoia
Copy link
Collaborator

I will review next week when I have time. I read the conversation gpu_spreadinterponly is maintained since we use it in type3. There is no harm in exposing this to the user. With proper documentation as it configures cufinufft to not do a nufft, but garbage in most contests :)

I think having this option in finufft is a good idea. In that case we should spend some time thinking how to reduce code duplication in a similar fashion as cufinufft.

@paquiteau
Copy link

paquiteau commented Sep 21, 2024

Hello there ! Thanks @ahbarnett, for the positive feedback on our docs; it really is a team effort. Feel free to open any issue on our side if something is unclear or missing (for instance, the explanation of the how and why of density compensation is still missing)

What I still don't understand is how you use our spreadinterp, specifically, what if we were to change our kernel coefficients (very possible, since we already rescaled them in going from v2.2 to v2.3) - how will the density compensation user know? Won't it break your math? Do you access our kernel eval yourself (by bypassing our API) ?

That's the best part: it won't break anything, it fixes everything :) density compensation reweights the non-uniform points with respect to how they are processed together in the NUFFT

Regarding the benchmark, hopefully, you can run it on your hardware as well:

https://github.com/mind-inria/mri-nufft-benchmark/

It is designed to stress test the NUFFT in challenging MRI applications (high spatial resolution in 3D notably). We did not configure much of the parameters either (e.g., upsampfac, which indeed has an impact on both memory and perf)

what speedup factor did you notice from cufinufft in v2.2 going to v2.3 ? It looks like it changed a lot from your 3-week-old tests to the ones you posted today, right? (ie, cufinufft did not used always to be the fastest, but now it is?)

cufinufft has always been faster in 2D; for 3D, it was more subtle; the multi-coil computations of gpunufft (which uses async copy and computes) were putting it ahead, IIRC . Nevertheless, gpuNUFFT remains more memory efficient as we don't have to create two separate plans for type 1 and type 2 like in cufinufft

@chaithyagr
Copy link
Contributor Author

I didnt have to add this change on this PR as it seemed like someone had done it already:

if (!d_plan->opts.gpu_spreadinterponly) {
if ((ier = checkCudaErrors(cudaMallocWrapper(
&d_plan->fw, maxbatchsize * nf1 * nf2 * nf3 * sizeof(cuda_complex<T>),
stream, d_plan->supports_pools))))
goto finalize;
if ((ier = checkCudaErrors(
cudaMallocWrapper(&d_plan->fwkerhalf1, (nf1 / 2 + 1) * sizeof(T), stream,
d_plan->supports_pools))))
goto finalize;
if ((ier = checkCudaErrors(
cudaMallocWrapper(&d_plan->fwkerhalf2, (nf2 / 2 + 1) * sizeof(T), stream,
d_plan->supports_pools))))
goto finalize;
if ((ier = checkCudaErrors(
cudaMallocWrapper(&d_plan->fwkerhalf3, (nf3 / 2 + 1) * sizeof(T), stream,
d_plan->supports_pools))))
goto finalize;
}

It seems like we don't allocate fw, fkernel etc if we have gpu_spreadinterponly=1

@DiamonDinoia
Copy link
Collaborator

This PR breaks type3, before I review could you fix it?
That option is used for the type3 plan where we do not do an FFT but we rely on the internal type2 plan for the NUFFT.

Does this behaviour conflict with your use case?

@DiamonDinoia
Copy link
Collaborator

Hello there ! Thanks @ahbarnett, for the positive feedback on our docs; it really is a team effort. Feel free to open any issue on our side if something is unclear or missing (for instance, the explanation of the how and why of density compensation is still missing)

What I still don't understand is how you use our spreadinterp, specifically, what if we were to change our kernel coefficients (very possible, since we already rescaled them in going from v2.2 to v2.3) - how will the density compensation user know? Won't it break your math? Do you access our kernel eval yourself (by bypassing our API) ?

That's the best part: it won't break anything, it fixes everything :) density compensation reweights the non-uniform points with respect to how they are processed together in the NUFFT

Regarding the benchmark, hopefully, you can run it on your hardware as well:

mind-inria/mri-nufft-benchmark

It is designed to stress test the NUFFT in challenging MRI applications (high spatial resolution in 3D notably). We did not configure much of the parameters either (e.g., upsampfac, which indeed has an impact on both memory and perf)

what speedup factor did you notice from cufinufft in v2.2 going to v2.3 ? It looks like it changed a lot from your 3-week-old tests to the ones you posted today, right? (ie, cufinufft did not used always to be the fastest, but now it is?)

cufinufft has always been faster in 2D; for 3D, it was more subtle; the multi-coil computations of gpunufft (which uses async copy and computes) were putting it ahead, IIRC . Nevertheless, gpuNUFFT remains more memory efficient as we don't have to create two separate plans for type 1 and type 2 like in cufinufft

Have you installed finufft from pip? Did you try building from source? -march=native and gpu native optimizations might change the performance drammatically.

@chaithyagr
Copy link
Contributor Author

@DiamonDinoia I think I am good with this PR, ill let you review it now.

Copy link
Collaborator

@DiamonDinoia DiamonDinoia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some minor changes and cleanup.

include/cufinufft/impl.h Show resolved Hide resolved
include/cufinufft/impl.h Show resolved Hide resolved
src/cuda/1d/cufinufft1d.cu Show resolved Hide resolved
src/cuda/1d/cufinufft1d.cu Show resolved Hide resolved
src/cuda/2d/cufinufft2d.cu Show resolved Hide resolved
src/cuda/3d/cufinufft3d.cu Show resolved Hide resolved
@chaithyagr
Copy link
Contributor Author

Is there something left on my side on this PR? Just want to be sure that you are not waiting on me for anything.

Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Very clean in the end. Thx for adding the docs (I may tweak them). Best, Alex

include/cufinufft/impl.h Show resolved Hide resolved
src/cuda/1d/cufinufft1d.cu Show resolved Hide resolved
@ahbarnett
Copy link
Collaborator

That's the best part: it won't break anything, it fixes everything :) density compensation reweights the non-uniform points with respect to how they are processed together in the NUFFT

This is still a mystery to me - precisely, how your MRI application knows what kernel we use for spreadinterp, without accessing our spreading function another way (I could guess it does a single-NU-point call to extract the kernel first?).

I'm about to bring in the PR...

@ahbarnett
Copy link
Collaborator

SInce the logic is tricky here (XNOR; t3, USF=1, etc) and t3 broke and was fixed, I want to leave the individual commits.

@ahbarnett ahbarnett merged commit cabda72 into flatironinstitute:master Oct 8, 2024
167 checks passed
@chaithyagr
Copy link
Contributor Author

Thank you for the merge.

This is still a mystery to me - precisely, how your MRI application knows what kernel we use for spreadinterp, without accessing our spreading function another way (I could guess it does a single-NU-point call to extract the kernel first?).

Well the density is estimated by a spread (S) followed by interp (I) call on all ones: I(S(1)). This way, we get an approximate estimate of how the spreading / interpolate kernels scale / weight each frequency element for a given sampling pattern. This can later be used iteratively to calculate how to "compensate" for it.

Here is the paper describing in detail: https://pubmed.ncbi.nlm.nih.gov/21688320/ (Eq 2).

PS: Do you have an apporximate idea on when we can get this PR to a release? If you anticipate it to take longer, Would a minor release be possible so that we can update mri-nufft accordingly?

@chaithyagr
Copy link
Contributor Author

chaithyagr commented Oct 24, 2024

@ahbarnett do you plan a short release with this feature anytime soon? It could really help us to have say v2.3.1 with this feature, as it would streamline mri-nufft's use of it.
We have a PR mind-inria/mri-nufft#195 ready to tackle this specific support.

@chaithyagr chaithyagr deleted the fix_spreadinterponly branch October 25, 2024 10:15
@janden
Copy link
Collaborator

janden commented Nov 19, 2024

@chaithyagr Sorry for the delay on this, but I've cherry-picked your changes onto our v2.3.X branch (see here) in preparation for our 2.3.1 release. (The reason is that the main branch currently contains a lot of other changes, such as type 3 and a major reorganization of the code, that we're not quite ready to release yet).

Since there are no unit tests for this new feature, would you mind running your tests on this branch to make sure I've transferred all your changes properly? Even better, if you could supply a unit test of some sort, that would help us out greatly in the future to make sure things remain stable.

@chaithyagr
Copy link
Contributor Author

Hey @janden , thank you for this. We will get back soon with updates.

@chaithyagr
Copy link
Contributor Author

@janden it seems to work good. Please do let us know if you make a release. I will make a PR in future testing it.

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Nov 21, 2024

Great, thank you! May I ask how you test it and if it is possible to write a unit test for it?

I am worried about future changes breaking the code.

@chaithyagr
Copy link
Contributor Author

chaithyagr commented Nov 22, 2024 via email

@ahbarnett
Copy link
Collaborator

Hi @janden Thanks for drafting 2.3.X branch. For Chaithyagr's commits to make sense we'll also need to choose the following doc commits I did:
cf25e43
c7d8ac4
6e6d04f
They are precisely all the 10-8-2024 commits.
I am doing them now. It's a PITA since each causes conflicts.

I am a bit confused about the route then for what master is updated to in terms of version number - it can't be 2.3.anything since that's a different track from 2.4.0. So is it 2.4.0beta?

@ahbarnett
Copy link
Collaborator

Dear Chaithyagr, Sorry, I had to also cherry-pick the docs to go with your changes. This also changed cufinufft/impl.h (just spacing/comment changes, but you should re-pull and check). We would only run a unit test for this option if it didn't have mri-nufft dependency. The best would be a simple python file to check the pure-spreading works as you intend it to.

If you're ok with the above - I will make a tag for it - we'll make a pypi release 2.3.1. Thanks! Alex

@ahbarnett
Copy link
Collaborator

Dear @janden and @chaithyagr - Between us, Joakim & I made
https://github.com/flatironinstitute/finufft/releases/tag/v2.3.1
It wasn't much fun, due to git cherry-picking :(
However, we want users to be happy. So, please check it works for you before we do a pypi release for it.
Thanks! Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants