From d1034a8ef4cdffe3657cf02558e6a99399576087 Mon Sep 17 00:00:00 2001
From: ahbarnett <alex.h.barnett@gmail.com>
Date: Tue, 8 Oct 2024 13:52:59 -0400
Subject: [PATCH] fix after cherry-pick cf25e43

---
 CHANGELOG      |  5 +++++
 docs/c_gpu.rst |  9 ++++++---
 docs/error.rst |  2 +-
 docs/opts.rst  | 36 +++++++++++++++++++-----------------
 4 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/CHANGELOG b/CHANGELOG
index 41c310842..01107e1bf 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,6 +1,11 @@
 List of features / changes made / release notes, in reverse chronological order.
 If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).
 
+V 2.3.1 (10/8/24) support release
+
+* Support and docs for opts.gpu_spreadinterponly=1 for MRI "density compensation
+  estimation" type 1&2 use-case with upsampfac=1.0 PR564 (Chaithya G R).
+
 V 2.3.0 (9/5/24)
 
 * Switched C++ standards from C++14 to C++17, allowing various templating
diff --git a/docs/c_gpu.rst b/docs/c_gpu.rst
index a5f050220..e67a9c533 100644
--- a/docs/c_gpu.rst
+++ b/docs/c_gpu.rst
@@ -1,3 +1,5 @@
+.. _c_gpu:
+
 C interface (GPU)
 =================
 
@@ -311,8 +313,9 @@ while ``modeord=1`` selects FFT-style ordering starting at zero and wrapping ove
 
 **gpu_device_id**: Sets the GPU device ID. Leave at default unless you know what you're doing. [To be documented]
 
-**gpu_spreadinterponly**: if ``0`` do the NUFFT as intended. If ``1``, omit the FFT and kernel FT deconvolution steps and return garbage answers.
-Nonzero value can be used *only* with `upsampfac=1` and `kerevalmeth=1` for using cufinufft for only spread and interpolate mode. This has applications into estimating the density compensation, which is conventionally used in MRI. (See [MRI-NUFFT](https://mind-inria.github.io/mri-nufft/nufft.html))
+**gpu_spreadinterponly**: if ``0`` do the NUFFT as intended. If ``1``, omit the FFT and deconvolution (diagonal division by kernel Fourier transform) steps, which returns *garbage answers as a NUFFT*, but allows advanced users to perform an isolated spreading or interpolation using the usual type 1 or type 2 ``cufinufft`` interface. To do this, the nonzero flag value must be used *only* with ``upsampfac=1.0`` (since the input and output grids are the same size, and neither represents Fourier coefficients), and ``kerevalmeth=1``. The known use-case here is estimating so-called density compensation, conventionally used in MRI. (See [MRI-NUFFT](https://mind-inria.github.io/mri-nufft/nufft.html)) Please note that this flag is also internally used by type 3 transforms (which was its original use case).
+
+
 
 
 Algorithm performance options
@@ -323,7 +326,7 @@ Algorithm performance options
 * ``gpu_method=0`` : makes an automatic choice of one of the below methods, based on our heuristics.
 
 * ``gpu_method=1`` : uses a nonuniform points-driven method, either unsorted which is referred to as GM in our paper, or sorted which is called GM-sort in our paper, depending on option ``gpu_sort`` below
-  
+
 * ``gpu_method=2`` : for spreading only, ie, type 1 transforms, uses a shared memory output-block driven method, referred to as SM in our paper. Has no effect for interpolation (type 2 transforms)
 
 * ``gpu_method>2`` : (various upsupported experimental methods due to Melody Shih, not for regular users. Eg ``3`` tests an idea of Paul Springer's to group NU points when spreading, ``4`` is a block gather method of possible interest.)
diff --git a/docs/error.rst b/docs/error.rst
index 6e12d2859..447fda939 100644
--- a/docs/error.rst
+++ b/docs/error.rst
@@ -29,7 +29,7 @@ has the following meanings (see codes in ``include/finufft_errors.h``):
   18 size of bins for subprob/blockgather invalid
   19 GPU shmem too small for subprob/blockgather parameters
   20 invalid number of nonuniform points: nj or nk negative, or too big (see defs.h)
-  23 invalid upsampfac set while using gpu_spreadinterponly mode.
+  23 invalid upsampfac set while using gpu_spreadinterponly mode
 
 When ``ier=1`` (warning only) the transform(s) is/are still completed, at the smallest epsilon achievable, so, with that caveat, the answer should still be usable.
 
diff --git a/docs/opts.rst b/docs/opts.rst
index 8e32829d8..423ddb844 100644
--- a/docs/opts.rst
+++ b/docs/opts.rst
@@ -20,7 +20,7 @@ to the simple, vectorized, or guru makeplan routines.
 Recall how to do this from C++:
 
 .. code-block:: C++
-                
+
   // (... set up M,x,c,tol,N, and allocate F here...)
   finufft_opts* opts;
   finufft_default_opts(opts);
@@ -30,7 +30,7 @@ Recall how to do this from C++:
 This setting produces more timing output to ``stdout``.
 
 .. warning::
-   
+
  In C/C++ and Fortran, don't forget to call the command which sets default options
  (``finufft_default_opts`` or ``finufftf_default_opts``)
  before you start changing them and passing them to FINUFFT.
@@ -51,9 +51,9 @@ Here are their default settings (from ``src/finufft.cpp:finufft_default_opts``):
 .. literalinclude:: ../src/finufft.cpp
    :start-after: @defopts_start
    :end-before: @defopts_end
-  
+
 As for quick advice, the main options you'll want to play with are:
-  
+
 - ``modeord`` to flip ("fftshift") the Fourier mode ordering
 - ``debug`` to look at timing output (to determine if your problem is spread/interpolation dominated, vs FFT dominated)
 - ``nthreads`` to run with a different number of threads than the current maximum available through OpenMP (a large number can sometimes be detrimental, and very small problems can sometimes run faster on 1 thread)
@@ -92,7 +92,7 @@ Data handling options
   .. note:: The index *sets* are the same in the two ``modeord`` choices; their ordering differs only by a cyclic shift. The FFT ordering cyclically shifts the CMCL indices $\mbox{floor}(N/2)$ to the left (often called an "fftshift").
 
 **chkbnds**: [DEPRECATED] has no effect.
-  
+
 
 Diagnostic options
 ~~~~~~~~~~~~~~~~~~~~~~~
@@ -100,7 +100,7 @@ Diagnostic options
 **debug**: Controls the amount of overall debug/timing output to stdout.
 
 * ``debug=0`` : silent
-  
+
 * ``debug=1`` : print some information
 
 * ``debug=2`` : prints more information
@@ -113,11 +113,11 @@ Diagnostic options
 
   * ``spread_debug=2`` : prints lots. This can print thousands of lines since it includes one line per *subproblem*.
 
-   
+
 **showwarn**: Whether to print warnings (these go to stderr).
-    
+
 * ``showwarn=0`` : suppresses such warnings
-  
+
 * ``showwarn=1`` : prints warnings
 
 
@@ -160,29 +160,31 @@ There is thus little reason for the nonexpert to mess with this option.
 
 **upsampfac**: This is the internal real factor by which the FFT (fine grid)
 is chosen larger than
-the number of requested modes in each dimension, for type 1 and 2 transforms.
+the number of requested modes in each dimension, for type 1 and 2 transforms. It must be greater than 1.
 We have built efficient kernels
-for only two settings, as follows. Otherwise, setting it to zero chooses a good heuristic:
+for only two settings that are greater than 1, as follows. Otherwise, setting it to zero chooses a good heuristic:
 
-* ``upsampfac=0.0`` : use heuristics to choose ``upsampfac`` as one of the below values, and use this value internally. The value chosen is visible in the text output via setting ``debug>=2``. This setting is recommended for basic users; however, if you seek more performance it is quick to try the other of the below.
+* ``upsampfac=0.0`` : use heuristics to choose ``upsampfac`` as one of the below values exceeding 1, and use this value internally. The value chosen is visible in the text output via setting ``debug>=2``. This setting is recommended for basic users; however, if you seek more performance it is quick to try the other of the values exceeding 1.
 
-* ``upsampfac=2.0`` : standard setting of upsampling. This is necessary if you need to exceed 9 digits of accuracy.
+* ``upsampfac=2.0`` : standard setting of upsampling. Due to kernel width restrictions, this is necessary if you need to exceed 9 digits of accuracy.
 
 * ``upsampfac=1.25`` : low-upsampling option, with lower RAM, smaller FFTs, but wider spreading kernel. The latter can be much faster than the standard when the number of nonuniform points is similar or smaller to the number of modes, and/or if low accuracy is required. It is especially much (2 to 3 times) faster for type 3 transforms. However, the kernel widths :math:`w` are about 50% larger in each dimension, which can lead to slower spreading (it can also be faster due to the smaller size of the fine grid). Because the kernel width is limited to 16, currently, thus only 9-digit accuracy can currently be reached when using ``upsampfac=1.25``.
 
+* ``upsampfac=1.0`` : an obscure advanced setting peculiar to a "spread/interpolate only" mode specific to the GPU (see :ref:`GPU usage<c_gpu>`), which does not even in fact execute a NUFFT. Thus, do not use this unless you know what you are doing!
+
 **spread_thread**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), controls how multithreading is used to spread/interpolate each batch of data.
 
 * ``spread_thread=0`` : makes an automatic choice between the below. Recommended.
-  
+
 * ``spread_thread=1`` : acts on each vector in the batch in sequence, using multithreaded spread/interpolate on that vector. It can be slightly better than ``2`` for large problems.
-    
+
 * ``spread_thread=2`` : acts on all vectors in a batch (of size chosen typically to be the number of threads) simultaneously, assigning each a thread which performs a single-threaded spread/interpolate.  It is much better than ``1`` for all but large problems. (Historical note: this was used by Melody Shih for the original "2dmany" interface in 2018.)
 
   .. note::
-  
+
     Historical note: A former option ``3`` has been removed. This was like ``2`` except allowing nested OMP parallelism, so multi-threaded spread-interpolate was used for each of the vectors in a batch in parallel. This was used by Andrea Malleo in 2019. We have not yet found a case where this beats both ``1`` and ``2``, hence removed it due to complications with changing the OMP nesting state in both old and new OMP versions.
 
-     
+
 **maxbatchsize**:  in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.
 Here ``0`` makes an automatic choice. If you are unhappy with this, then for small problems it should equal the number of threads, while for large problems it appears that ``1`` often better (since otherwise too much simultaneous RAM movement occurs). Some further work is needed to optimize this parameter.