Please tell us about your use of FINUFFT! #398

ahbarnett · 2023-12-13T19:53:16Z

ahbarnett
Dec 13, 2023
Maintainer

This is a discussion thread for users to post how they use FINUFFT. This will help us optimize and extend the software.

The main questions to answer are:
what type (1,2,3) and dimension?
what number of modes N = (N1,N2,...) ie uniform grid shape? (if type 1 or 2)
what number of nonuniform pts M? And are these points clustered or quasi-uniform?
what tolerance? (and do you use single- or double-precision library?)
what ntransf (if you use vectorized interface)?
Do you use CPU, GPU, or both?
For CPU, how many threads do you use? Do you use single-threaded calls in parallel?
Do you use the guru or the simple interfaces?
What language wrappers do you use?
What is your application area? (possibly link to paper or your package)
What feature requests do you have?

If you want to give a link to the part of your code that uses FINUFFT, that might also be useful.

Thanks so much! Alex

ahbarnett · 2023-12-13T20:13:30Z

ahbarnett
Dec 13, 2023
Maintainer Author

As an example of a GPU use case with all of the parameters listed: #323 (reply in thread)

0 replies

paquiteau · 2023-12-14T09:50:06Z

paquiteau
Dec 14, 2023

Hello (again)!

We (the team doing Compressed Sensing for MRI, at Neurospin, France) uses both finufft and cufinufft in our pipeline for iterative reconstruction of non-cartesian MRI.

In practice, we work with data that is:

2D images for prototyping and POCs (192x192 px image domain (resolution of 1mm) and 10^4^-10^5^ non-uniform samples "trajectory points" in the k-space
3D images with 1mm isotropic or lower resolution (e.g. 192x192x128 or bigger image domain size) with 10^5^-10^7^ samples in k-space and 32 coils. This is where real data is used
With NUFFT and typical (not parametrized!) trajectories, we get pretty pictures¹².
4D (3D+time) functional MRI data (multiply the 3D case by 100-400 time frames, with potentially different sampling trajectories at each frame)

NB: A key aspect of the Non-Cartesian Trajectories is that they are composed of "shots" that each passes through the center point of the k-space.

Due to the number of coils, the potential use of extra interpolators (to correct for static field inhomogeneities³), the number of calls to type 1 (adjoint) and type 2 (forward) NUFFT is in the hundreds if not thousands.

As there are many NUFFT implementations available, we have developed mri-nufft (in Python with numpy, cupy, and torch backend):

It augments the NUFFT operation for MRI (notably with SENSE for adding coil sensitivity profiles and density compensation to have better condition numbers), the rest of our reconstruction pipeline is handled in Modopt and pysap-mri.
It also gathers a large collection of routines to create and augments Non Cartesian trajectories

In term of parametrization:

we use exclusively single precision (for memory and compute efficiency)
finufft (launched in multithreading mode) is fine in 2D cases, but we move to cufinufft for 3D and beyond
mri-nufft implements ntransf, but we lack benchmark on the subject (and how it performs when we also have the sensitivity maps)
Due to our peculiar needs and for better memory management, we rely on the guru interfaces.

finufft and cufinufft are the most stable implementations we can find out there. There are some performance challengers (well, just gpuNUFFT in some cases ⁴, but the recent merging of #330 calls for a new study)

In terms of feature requests the discussion at #306 and #308 is a first point: getting preconditioning weights (similar to what is done in astronomy) helps for iterative reconstruction, and is essential for deep-learning based approaches⁵⁶

@chaithyagr, @philouc, @matthieutrs

https://onlinelibrary.wiley.com/doi/full/10.1002/mrm.29702 ↩
https://ieeexplore.ieee.org/document/9729575 ↩
https://onlinelibrary.wiley.com/doi/full/10.1002/mrm.29297 ↩
https://github.com/flatironinstitut/cufinufft/issues + a recent benchmark submitted to ISMRM24 ↩
NC-PDNet: https://pubmed.ncbi.nlm.nih.gov/35041598/ ↩
Jointly Learning trajectories and reconstruction network: https://www.mdpi.com/2306-5354/10/2/158 ↩

0 replies

turbotage · 2023-12-19T14:02:13Z

turbotage
Dec 19, 2023

I missed this discussion post and therefore accidentally created an issue explaining how finufft is used in MRI! I also see @paquiteau have already answered quite thourougly. Many of the things he describes is also the case here where I do my research in Compressed Sensing at Umeå University. We mainly work with 4D-flow MRI data (3D+time) and not 4D (functional MRI) but how this library is used I imagine is rather the same. We also, due to multiple coils and multiple velocity/phase encodings have to do potentially hundreds or thousands of nuffts every iteration. As I didn't take the time to rewrite my issue post more suitably for this discussion I have simply pasted what I wrote there here instead, and added some extra information.

ISSUE POST

I created this issue as a discussion on what other extensions the finufft library could provide that are still within the grasp of what it tries to achieve. Mainly, I hope it could bring some light to the reply @ahbarnett provided in #306. As such, the listed points will be written as if answered to that comment, so it is probably nice to read that comment first and also perhaps the expertly written inverse nufft tutorial.

First, I see how exposing the spreading/interpolation operations is problematic. I agree that the nufft is the crucial feature of the library, and if other spreaders/interpolators that would make the nufft faster were found that should be prioritized! I think exposing just the spreader and interpolator in C/C++ would still be useful, but users would have to be carefull or aware that they may change between versions and so on. If asking me, by no means make this a priority. Although, as mentioned in @ahbarnett's reply, the sinc^2 weights of Geengard, Inati, et al, would be very useful! So, if it would be feasible to implement an interface to sinc^2 quadrature weights that would be great! If exposing the fast sinc^2 in the process seems reasonable that would also be nice.

As pointed out. In MRI applications one often tries to minimize ||Ax-b|| where A represents a nufft. A may be both a 2D or 3D nufft depending on the particular problem. The iterative solvers have to repeatedly perform the operation A^HA, the normal operator. Sometimes to improve convergence, one instead solves ||W(Ax-b)||, where W is a diagonal matrix such that the iterations computes A^HWA instead. Offering up best likelihood estimation for convergence speed. Preconditioning might seem like a better alternative, and indeed sometimes it is, but sometimes also not. Preconditioning makes some of the proximal algorithms different and more difficult. The W used for convergence speedup is also often the same as the "density compensation weights" mentioned. Sometimes the density compensation weights are also used to perform a "gridding reconstruction". That is, one finds W such that instead of solving ||Ax-b|| via an iterative solver one just does x = A^H(Wb), this is approximate but fast! Also mentioned in the inverse tutorial (and often utilized in MRI), the toeplitz approach can greatly increase the speed of performing A^HA. But this approach is also sometimes not viable. For some problems, multiple different A^HA (coordinates differ) has to be performed in each iteration. Storing the toeplitz kernels for these kernels takes up too much memory so computing A^HA via nuffts has to be used instead. Also sometimes other modifications to the minimization problem taking into account off-resonance effects and such makes the toeplitz approach less feasible. Actually these problems are discussed in the Fessler article referenced in the inversion tutorial.

After all this text I would like to provide some features that would be really useful for MRI users that hopefully are still within the finufft grasp. I list them in order of importance.

A fast normal operator! It would be beneficial to provide an interface that exposes the normal operator. That is an operation that computes A^HA. I was hoping that perhaps this lends itself to speedup also, or, smaller memory footprint. As you don't need two fft plans right? And perhaps fewer fftshifts are required aswell? It would be beneficial here if an alternative diagonal vector could be supplied for this operation also, such that operations as A^HWA can be performed. I don't think this operation would expose any inner working parts of how the nuffts are performed and seems to be a reasonable extension of finufft as it is basically just what is required to perform inverse nufft.
A toeplitz kernel creation interface. For users who wants to use toeplitz for inversion it would be great if finufft provided an interface that calculated the kernels. Hopefully so they are real and with hermitian symmetry too!
sinc^2 quadrature weights. Also, other reasonably good (very unspecific, I know), if available/you know of, quadrature weights would also be interesting!

As I see it these features are all within the subject of what finufft strives to achieve, especially the first one. Atleast, if one considers inverse nufft as a goal. And I also think that changes to inner workings of how the nufft is performed could be made without this interface being unstable.

EXTRA INFO

We use only fp32
As of now, we only use cufinufft
We use the C/C++ only
We use Windows, potentially Linux in the future
N1 = 128-320, N2=128-320, N3=128-320, M = 65000 - 20,800,000, tol = 1e-5
Presorting probably beneficial since we run the same nufft many times each iteration

Another feature I didn't list but would be appreciated is the 1.25 upsampling factor mentioned in the cufinufft repo (issue 126)

3 replies

paquiteau Dec 19, 2023

FWIW, I think that the 1.25 upsampling factor is already available (upsampfac=1.25 and gpu_kerevalmeth=0 are required as arguments for the cufinufft Plan in this case), I don't know if it is supported behavior though.

I sustain the point about the normal operator (which would be "a reverse type-3": Grid-> NU -> Grid). An alternative (simpler?) would be to have a shared cufft plan for both types of transform (or across plans of the same type ?), reducing the overall memory footprint of cufinufft.

turbotage Dec 20, 2023

Hmm, I need to check that out! Btw the Grid -> NU -> Grid is the most important, but we actually have usage for a NU -> Grid -> NU normal operator too, when using dual algorithms, working in K-space instead of on the picture. Ofc this is just the type3 but we need to be able to multiply by weights in between the transforms.

Also, the shared cufft plan for both types is an alternative, although if there are potential performance wins when computing the normal operator other than just using the same plan for less memory usage, then perhaps it should be its own thing.

ahbarnett Dec 24, 2023
Maintainer Author

Dear turbotage, paquiteau, chaithyagr, et al,

These are wonderful and detailed responses - thank-you! - it will take us a little while to process them. So cool you're using up to 20M points in 3D and 5-digit accuracy!

Sinc^2 weights can already be produced by https://github.com/hannahlawrence/sinctransform
https://github.com/gauteh/fsinc although I haven't checked those in a while.
I have a feeling that short helper routines (in Py, Jl, MATLAB, etc) would be best for setting up (Toeplitz vector) then applying A^H A or A^H W A (via padded FFT). Crude MATLAB codes exist already in the inv1d2 tutorial and https://github.com/flatironinstitute/gp-shootout/tree/main/algs/EFGP .
Using short high-level helpers would avoid extending our language bindings (a big pain).
This would address all 3 requests of turbotage.

It would be interesting to see how @DiamonDinoia Marco's group usage for 4D MRI (XD-GRASP) differs from what's posted here.

I plan to write a inv2d2 tutorial in similar style to inv1d2, which would be my crude understanding of basic Fourier image recon in MRI (but also VLBI, SAR, etc) settings. I'm sure you will help correct me :)

Happy holidays! Alex

DiamonDinoia · 2023-12-26T10:34:20Z

DiamonDinoia
Dec 26, 2023
Maintainer

The way we use it is quite different. Our entire 4D-MRI reconstruction is based on the XD-GRASP¹. We implemented everything in c++. Give the structure of the algorithm we do some pre-processing then we split the reconstruction into independent conjugate-gradient problems along the Z-dimension. This has two advantages:

We create/delete threads once
When we use the GPU, we can overlap data transfers/reconstruction. This roughly double the performance. Figure 4²

We further split the problem into multiple respiration phases, this allows to do multiple 2D transforms. Since, these transforms do not saturate the GPU we queue multiple of them in parallel. At each reconstruction we do thousands of 2D reconstruction. To further minimize overhead we only use the GURU interface and FFTW_MEASURE and we create the plans once per thread re-using them until the end of the program.

Details:

FP32 only (on GPU FP16 might be enough) but not supported by cufinufft as of now.
Oversampling Factor 1.25
N = 336x336. N largely depends on the FOV, we would like to go higher but that depends on the scanner³.
tol $10^{-3}$
All calls are single-threaded (compiled without OpenMP/Threads)

0 replies

Sebastian-Belkner · 2024-04-12T11:07:49Z

Sebastian-Belkner
Apr 12, 2024

question	answer
type, dimension	1/2, 2-dimensional
number of modes N = (N1,N2,...)	up to ~(1e7,1e7)
number of pts M? clustered, quasi-uniform	Same as N, quasi-uniform
tolerance? (single- or double-precision library)	1e-5 (single) to 1e-7 (double) generally, can be up to 1e-14
ntransf	-
CPU, GPU, or both	both
CPU, how many threads	single-threaded calls in parallel: TBD, TBD
guru or simple interfaces	simple for now, perhaps guru soon to remove planning step from the pipeline
language wrapper	Python
application area	see use case below

Our use case is CMB lensing, and we work on iterative lensing reconstruction¹.
This includes a Wiener-filtering with a data-model that contains the lensing operation.
(lensing-operator times unlensed CMB gives lensed CMB. In essence, we Wiener-filter the data by the lensing-operator times unlensed CMB, and we build the lensing operator from the optimally reconstructed deflection field.)

The Wiener-filtering is done with a conjugate-gradient (cg) descent. For each cg-iteration, we have to perform the lensing operation.
The lensing operation can be done, e.g. with a (to name a few)

bicubic spline interpolation lenspix,
Taylor expansion Taylens, or,
using non-uniform FFTs lenspyx.

To do the lensing operation using non-uniform FFTs, we,

synthesize the SHT coefficients onto a CAR grid,
double the CAR grid map in theta direction,
perfrom a 2D FFT to get the fourier coefficients,
calculate the deflection angles and obtain the new pixel locations (they will be on a non-uniform grid!),
use 2d nuFFT with the new pixel locations, and the fourier coeffiencts.

This approach is tremendously better and faster than previous approaches as discussed in our lenspyx paper², that implements a CPU code.

We currently explore how much faster the lensing operation is on a GPU using cufinuFFT. All the enumerated steps above have to be put on the GPU. We use SHTns for the GPU SHT calculation.

The spin-0 implemenation of the lensing operation is almost done and speed ups look quite good at the moment! Plan is to also include spin-2 (or spin-n, really), and eventually integrate this into delensalot.

0 replies

nikos2wheels · 2024-10-17T03:39:20Z

nikos2wheels
Oct 17, 2024

We are using the FINUFFT for a poisson solver and I have a basic question. When the sample is non-equally spaced is there aliasing when taking the type 1 and type 2 NUFFT? Is it possible to eliminate the aliasing? If not, how to best deal with aliasing?

0 replies

ahbarnett · 2024-10-17T14:50:25Z

ahbarnett
Oct 17, 2024
Maintainer Author

Hello Nikos, Aliasing can be interpreted as quadrature error. Eg, to take a simple case, applying the FFT to a regular grid of samples to estimate the Fourier series coefficients of a function induces aliasing error (shifted copies of the spectrum) which can be completely understood as quadrature error of the periodic trapezoid rule. The exact answer would be the Euler-Fourier formula, an integral, but the grid approximates it (with equal weights). Now, if you are using a NUFFT (eg type 1 to get to F series coeffs), then you must have used non-constant quadrature weights. And, yes, that would induce aliasing, but not straightforward shifted copies. Usually a similar Nyquist principle for the maximum quadrature node spacing would ensure small aliasing errors. I suggest going through my tutorial examples to play with this. https://finufft.readthedocs.io/en/latest/tut.html Best, Alex

…

On Wed, Oct 16, 2024 at 11:39 PM nikos2wheels ***@***.***> wrote: We are using the FINUFFT for a poisson solver and I have a basic question. When the sample is non-equally spaced is there aliasing when taking the type 1 and type 2 NUFFT? Is it possible to eliminate the aliasing? If not, how to best deal with aliasing? — Reply to this email directly, view it on GitHub <#398 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSSKBN5LWU2GL46XHT3Z34WP3AVCNFSM6AAAAABQCZHX2SVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJWGY2TCMA> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

0 replies

remy-abergel · 2024-10-18T07:18:50Z

remy-abergel
Oct 18, 2024

Hi,

I made a Python package that strongly relies on FINUFFT. The package is named PyEPRI and is dedicated to Electron Paramagnetic Resonance Imaging (EPRI).

In few words, EPRI is an imaging technique of paramagnetic species, with applications in biomedical sciences and chemistry. Many demonstration examples are available in the PyEPRI documentation.

In terms of modeling, a standard EPRI acquisition corresponds to a sequence of 1D projections (called a sinogram), each projection roughly corresponds to the convolution between the spectral signature of the paramagnetic sample (called the reference spectrum) and the Radon transform of the image to be reconstructed (its direction depends on the projection and is controlled during the acquisition).

The direct operator $A$ (= projection operator) involved in EPRI is strongly linked to the non-uniform discrete Fourier transform operator as a consequence of Fourier Slice Theorem. Similarly, the adjoint $A^*$ of the direct operator (= backprojection operator) strongly relies on the adjoint NUFFT operator. More precise explanations are provided here.

The typical use case is 3D, although 2D is sometimes considered to reduce the acquisition time. The input and output data size depends on the experimental constraints (mainly the acquisition time). For instance, in this particular in vitro experiment, the input dataset contains 8836 projections, each projection contains 360 measurement points, leading to up to M = 8836 x 360 = 3180960 irregularly spaced nodes in the frequency domain. Still for that particular example, the reconstructed image has shape (55, 55, 75) (corresponding to 200 μm resolution), I can easily increase to (200, 200, 200) or more on my personal laptop (although there is no interest to increase the resolution for this particular experiment). Also, more complex experiments (e.g., EPR sources separation, as presented here) can lead to more intensive computations. Note that in vivo experiments are in general very constrained by the acquisition time, leading to smaller datasets.

Modern image reconstruction techniques rely on variational models and iterative algorithms. Most standard optimization schemes would require the evaluation of $A$ and that of $A^*$ at each iteration. However, some particular classes of optimization algorithms only involve the evaluation $A^* A$ at each iteration of the scheme. Since $A^* A$ has a Toeplitz structure (as explained here), it can be evaluated as a convolution with a fixed kernel (named hereafter as a Toeplitz kernel). In practice, an EPRI reconstruction algorithm involves FINUFFT for one single evaluation of $A^*$over the input dataset and for the evaluation of the Toeplitz kernel. Then, the evaluation of $A^* A$ at each iteration of the scheme is carried out using the standard FFT. It must be noted that the Toeplitz kernel as shape (2*N1, 2*N2, 2*N3) when the image to be reconstructed has size (N1, N2, N3).

Thanks to FINUFFT, the PyEPRI package is CPU & GPU compatible. I use it mostly in float32 precision to reduce the memory usage. Targeted audience is mainly researchers with standard laptops or computers (with a couple of CPU cores and optionally one standard GPU), although experiments involving quite intensive computations can also be considered for teams with appropriate equipment.

I would like to thank you for developing and maintaining such a great library 🙏. Its impressive fastness and low memory consumption, and also its GPU compatibility greatly motivated me to create the PyEPRI package.

Thank you also again for your support and the fix related to #454. My roadmap for the next release of PyEPRI contains many items, I may post several suggestions related to some additional new features that I would like to add to PyEPRI in a near future.

Best,
Rémy

0 replies

srikrrish · 2024-10-22T11:56:22Z

srikrrish
Oct 22, 2024

Hi everyone,

I am Sriram a researcher from the Math department in Jülich Supercomputing Centre, Germany. I use FINUFFT or specifically cuFINUFFT for Particle-In-Fourier simulations of kinetic plasma.

Here are my answers to the questions

Types 1 and 2 mainly 3D but also sometimes 2D
Typical number of modes: 32^3 - 256^3 (or 512^3)
Number of nonuniform points: millions to billions and more (with distributed parallelism), depending on the test case they can be
more or less uniform or clustered
Tolerance: 1E-3 - 1E-8 (depending on time step size and level of conservation accuracy needed (for very high accuracy simulations 1E-10/1E-12)
Currently ntransf=1 to minimize memory footprint but can use ntransf=3 for type 2 transforms for gathering the electric and magnetic field from Fourier space to particles in 3D
Double precision until now but willing to explore to single precision
Mainly run on GPUs
Guru GPU interface
C++
Kinetic plasma simulations (see here and here)
This is the code base https://github.com/IPPL-framework/ippl and the FINUFFT/cuFINUFFT part is in this fork/branch (https://github.com/srikrrish/ippl/tree/131-implement-electrostatic-particle-in-fourier). The cuFINUFFT interface is limited
to this file https://github.com/srikrrish/ippl/blob/131-implement-electrostatic-particle-in-fourier/src/FFT/FFT.h#L356 and the corresponding hpp file
I don't have a specific feature request at the moment. But in future it would be great to have a distributed version if possible and also to support different types of GPUs. But this is not critical at the moment.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please tell us about your use of FINUFFT! #398

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Please tell us about your use of FINUFFT! #398

ahbarnett Dec 13, 2023 Maintainer

Replies: 9 comments · 3 replies

ahbarnett Dec 13, 2023 Maintainer Author

Footnotes

ISSUE POST

EXTRA INFO

ahbarnett Dec 24, 2023 Maintainer Author

DiamonDinoia Dec 26, 2023 Maintainer

Footnotes

Footnotes

ahbarnett Oct 17, 2024 Maintainer Author

ahbarnett
Dec 13, 2023
Maintainer

Replies: 9 comments 3 replies

ahbarnett
Dec 13, 2023
Maintainer Author

ahbarnett Dec 24, 2023
Maintainer Author

DiamonDinoia
Dec 26, 2023
Maintainer

ahbarnett
Oct 17, 2024
Maintainer Author