flatironinstitute · DiamonDinoia · Sep 12, 2024 · Jul 3, 2024 · Jul 4, 2024 · Jul 8, 2024
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,6 +1,15 @@
 List of features / changes made / release notes, in reverse chronological order.
 If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).
 
+V 2.4.0 (08/28/24)
+* Support for type 3 in 1D, 2D, and 3D in the GPU library cufinufft (PR #517).
+* Removed the CPU fseries computation (only used for benchmark no longer needed).
+* Added complex arithmetic support for cuda_complex type
+* Added tests for type 3 in 1D, 2D, and 3D and cuda_complex arithmetic
+* Minor fixes on the GPU code:
+    - removed memory leaks in case of errors
+    - renamed maxbatchsize to batchsize
+
 V 2.3.0-rc1 (8/6/24)
 
 * Switched C++ standards from C++14 to C++17, allowing various templating

diff --git a/docs/devnotes.rst b/docs/devnotes.rst
@@ -51,6 +51,8 @@ Developer notes
 
 * CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``.
 
+* Note to the nvcc developer. nvcc with debug symbols causes a stack overflow that is undetected at both compile and runtime. This goes undetected until ns>=10, for ns<10, one can use -G and debug the code with cuda-gdb. The way to avoid is to not use Debug symbols, possibly using ``--generate-line-info`` might work (not tested). As a side note, compute-sanitizers do not detect the issue.
+
 * Testing cufinufft (for FI, mostly):
 
 .. code-block:: sh

diff --git a/docs/opts.rst b/docs/opts.rst
@@ -20,7 +20,7 @@ to the simple, vectorized, or guru makeplan routines.
 Recall how to do this from C++:
 
 .. code-block:: C++
-                
+
   // (... set up M,x,c,tol,N, and allocate F here...)
   finufft_opts* opts;
   finufft_default_opts(opts);
@@ -30,7 +30,7 @@ Recall how to do this from C++:
 This setting produces more timing output to ``stdout``.
 
 .. warning::
-   
+
  In C/C++ and Fortran, don't forget to call the command which sets default options
  (``finufft_default_opts`` or ``finufftf_default_opts``)
  before you start changing them and passing them to FINUFFT.
@@ -51,9 +51,9 @@ Here are their default settings (from ``src/finufft.cpp:finufft_default_opts``):
 .. literalinclude:: ../src/finufft.cpp
    :start-after: @defopts_start
    :end-before: @defopts_end
-  
+
 As for quick advice, the main options you'll want to play with are:
-  
+
 - ``modeord`` to flip ("fftshift") the Fourier mode ordering
 - ``debug`` to look at timing output (to determine if your problem is spread/interpolation dominated, vs FFT dominated)
 - ``nthreads`` to run with a different number of threads than the current maximum available through OpenMP (a large number can sometimes be detrimental, and very small problems can sometimes run faster on 1 thread)
@@ -92,15 +92,15 @@ Data handling options
   .. note:: The index *sets* are the same in the two ``modeord`` choices; their ordering differs only by a cyclic shift. The FFT ordering cyclically shifts the CMCL indices $\mbox{floor}(N/2)$ to the left (often called an "fftshift").
 
 **chkbnds**: [DEPRECATED] has no effect.
-  
+
 
 Diagnostic options
 ~~~~~~~~~~~~~~~~~~~~~~~
 
 **debug**: Controls the amount of overall debug/timing output to stdout.
 
 * ``debug=0`` : silent
-  
+
 * ``debug=1`` : print some information
 
 * ``debug=2`` : prints more information
@@ -113,11 +113,11 @@ Diagnostic options
 
   * ``spread_debug=2`` : prints lots. This can print thousands of lines since it includes one line per *subproblem*.
 
-   
+
 **showwarn**: Whether to print warnings (these go to stderr).
-    
+
 * ``showwarn=0`` : suppresses such warnings
-  
+
 * ``showwarn=1`` : prints warnings
 
 
@@ -173,17 +173,17 @@ for only two settings, as follows. Otherwise, setting it to zero chooses a good
 **spread_thread**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), controls how multithreading is used to spread/interpolate each batch of data.
 
 * ``spread_thread=0`` : makes an automatic choice between the below. Recommended.
-  
+
 * ``spread_thread=1`` : acts on each vector in the batch in sequence, using multithreaded spread/interpolate on that vector. It can be slightly better than ``2`` for large problems.
-    
+
 * ``spread_thread=2`` : acts on all vectors in a batch (of size chosen typically to be the number of threads) simultaneously, assigning each a thread which performs a single-threaded spread/interpolate.  It is much better than ``1`` for all but large problems. (Historical note: this was used by Melody Shih for the original "2dmany" interface in 2018.)
 
   .. note::
-  
+
     Historical note: A former option ``3`` has been removed. This was like ``2`` except allowing nested OMP parallelism, so multi-threaded spread-interpolate was used for each of the vectors in a batch in parallel. This was used by Andrea Malleo in 2019. We have not yet found a case where this beats both ``1`` and ``2``, hence removed it due to complications with changing the OMP nesting state in both old and new OMP versions.
 
-     
-**maxbatchsize**:  in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.
+
+**batchsize**:  in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.
 Here ``0`` makes an automatic choice. If you are unhappy with this, then for small problems it should equal the number of threads, while for large problems it appears that ``1`` often better (since otherwise too much simultaneous RAM movement occurs). Some further work is needed to optimize this parameter.
 
 **spread_nthr_atomic**: if non-negative: for numbers of threads up to this value, an OMP critical block for ``add_wrapped_subgrid`` is used in spreading (type 1 transforms). Above this value, instead OMP atomic writes are used, which scale better for large thread numbers. If negative, the heuristic default in the spreader is used, set in ``src/spreadinterp.cpp:setup_spreader()``.

diff --git a/include/cufinufft/common.h b/include/cufinufft/common.h
@@ -7,27 +7,37 @@
 #include <finufft_errors.h>
 #include <finufft_spread_opts.h>
 
-#include <complex.h>
+#include <complex>
+#include <optional>
 
 namespace cufinufft {
 namespace common {
 template<typename T>
-__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f,
-                                       cuDoubleComplex *a, T *fwkerhalf1, T *fwkerhalf2,
+__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f, T *a,
+                                       T *fwkerhalf1, T *fwkerhalf2, T *fwkerhalf3,
+                                       int ns);
+template<typename T>
+__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f, T *a, T *kx,
+                                       T *ky, T *kz, T *fwkerhalf1, T *fwkerhalf2,
                                        T *fwkerhalf3, int ns);
 template<typename T>
-int cufserieskernelcompute(int dim, int nf1, int nf2, int nf3, T *d_f,
-                           cuDoubleComplex *d_a, T *d_fwkerhalf1, T *d_fwkerhalf2,
+int cufserieskernelcompute(int dim, int nf1, int nf2, int nf3, T *d_f, T *d_a,
+                           T *d_fwkerhalf1, T *d_fwkerhalf2, T *d_fwkerhalf3, int ns,
+                           cudaStream_t stream);
+template<typename T>
+int cufserieskernelcompute(int dim, int nf1, int nf2, int nf3, T *d_f, T *d_a, T *d_kx,
+                           T *d_ky, T *d_kz, T *d_fwkerhalf1, T *d_fwkerhalf2,
                            T *d_fwkerhalf3, int ns, cudaStream_t stream);
 template<typename T>
 int setup_spreader_for_nufft(finufft_spread_opts &spopts, T eps, cufinufft_opts opts);
 
 void set_nf_type12(CUFINUFFT_BIGINT ms, cufinufft_opts opts, finufft_spread_opts spopts,
                    CUFINUFFT_BIGINT *nf, CUFINUFFT_BIGINT b);
-template<typename T>
-void onedim_fseries_kernel(CUFINUFFT_BIGINT nf, T *fwkerhalf, finufft_spread_opts opts);
-template<typename T>
-void onedim_fseries_kernel_precomp(CUFINUFFT_BIGINT nf, T *f, std::complex<double> *a,
+// template<typename T>
+// void onedim_fseries_kernel(CUFINUFFT_BIGINT nf, T *fwkerhalf, finufft_spread_opts
+// opts);
+template<typename T, bool>
+void onedim_fseries_kernel_precomp(CUFINUFFT_BIGINT nf, T *f, T *a,
                                    finufft_spread_opts opts);
 template<typename T>
 void onedim_fseries_kernel_compute(CUFINUFFT_BIGINT nf, T *f, std::complex<double> *a,

diff --git a/include/cufinufft/contrib/helper_math.h b/include/cufinufft/contrib/helper_math.h
@@ -0,0 +1,168 @@
+#ifndef FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H
+#define FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H
+
+#include <cuComplex.h>
+
+// Addition for cuDoubleComplex (double) with cuDoubleComplex (double)
+__host__ __device__ __forceinline__ cuDoubleComplex operator+(
+    const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept {
+  return cuCadd(a, b);
+}
+
+// Subtraction for cuDoubleComplex (double) with cuDoubleComplex (double)
+__host__ __device__ __forceinline__ cuDoubleComplex operator-(
+    const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept {
+  return cuCsub(a, b);
+}
+
+// Multiplication for cuDoubleComplex (double) with cuDoubleComplex (double)
+__host__ __device__ __forceinline__ cuDoubleComplex operator*(
+    const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept {
+  return cuCmul(a, b);
+}
+
+// Division for cuDoubleComplex (double) with cuDoubleComplex (double)
+__host__ __device__ __forceinline__ cuDoubleComplex operator/(
+    const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept {
+  return cuCdiv(a, b);
+}
+
+// Equality for cuDoubleComplex (double) with cuDoubleComplex (double)
+__host__ __device__ __forceinline__ bool operator==(const cuDoubleComplex &a,
+                                                    const cuDoubleComplex &b) noexcept {
+  return cuCreal(a) == cuCreal(b) && cuCimag(a) == cuCimag(b);
+}
+
+// Inequality for cuDoubleComplex (double) with cuDoubleComplex (double)
+__host__ __device__ __forceinline__ bool operator!=(const cuDoubleComplex &a,
+                                                    const cuDoubleComplex &b) noexcept {
+  return !(a == b);
+}
+
+// Addition for cuDoubleComplex (double) with double
+__host__ __device__ __forceinline__ cuDoubleComplex operator+(const cuDoubleComplex &a,
+                                                              double b) noexcept {
+  return make_cuDoubleComplex(cuCreal(a) + b, cuCimag(a));
+}
+
+__host__ __device__ __forceinline__ cuDoubleComplex operator+(
+    double a, const cuDoubleComplex &b) noexcept {
+  return make_cuDoubleComplex(a + cuCreal(b), cuCimag(b));
+}
+
+// Subtraction for cuDoubleComplex (double) with double
+__host__ __device__ __forceinline__ cuDoubleComplex operator-(const cuDoubleComplex &a,
+                                                              double b) noexcept {
+  return make_cuDoubleComplex(cuCreal(a) - b, cuCimag(a));
+}
+
+__host__ __device__ __forceinline__ cuDoubleComplex operator-(
+    double a, const cuDoubleComplex &b) noexcept {
+  return make_cuDoubleComplex(a - cuCreal(b), -cuCimag(b));
+}
+
+// Multiplication for cuDoubleComplex (double) with double
+__host__ __device__ __forceinline__ cuDoubleComplex operator*(const cuDoubleComplex &a,
+                                                              double b) noexcept {
+  return make_cuDoubleComplex(cuCreal(a) * b, cuCimag(a) * b);
+}
+
+__host__ __device__ __forceinline__ cuDoubleComplex operator*(
+    double a, const cuDoubleComplex &b) noexcept {
+  return make_cuDoubleComplex(a * cuCreal(b), a * cuCimag(b));
+}
+
+// Division for cuDoubleComplex (double) with double
+__host__ __device__ __forceinline__ cuDoubleComplex operator/(const cuDoubleComplex &a,
+                                                              double b) noexcept {
+  return make_cuDoubleComplex(cuCreal(a) / b, cuCimag(a) / b);
+}
+
+__host__ __device__ __forceinline__ cuDoubleComplex operator/(
+    double a, const cuDoubleComplex &b) noexcept {
+  double denom = cuCreal(b) * cuCreal(b) + cuCimag(b) * cuCimag(b);
+  return make_cuDoubleComplex((a * cuCreal(b)) / denom, (-a * cuCimag(b)) / denom);
+}
+
+// Addition for cuFloatComplex (float) with cuFloatComplex (float)
+__host__ __device__ __forceinline__ cuFloatComplex operator+(
+    const cuFloatComplex &a, const cuFloatComplex &b) noexcept {
+  return cuCaddf(a, b);
+}
+
+// Subtraction for cuFloatComplex (float) with cuFloatComplex (float)
+__host__ __device__ __forceinline__ cuFloatComplex operator-(
+    const cuFloatComplex &a, const cuFloatComplex &b) noexcept {
+  return cuCsubf(a, b);
+}
+
+// Multiplication for cuFloatComplex (float) with cuFloatComplex (float)
+__host__ __device__ __forceinline__ cuFloatComplex operator*(
+    const cuFloatComplex &a, const cuFloatComplex &b) noexcept {
+  return cuCmulf(a, b);
+}
+
+// Division for cuFloatComplex (float) with cuFloatComplex (float)
+__host__ __device__ __forceinline__ cuFloatComplex operator/(
+    const cuFloatComplex &a, const cuFloatComplex &b) noexcept {
+  return cuCdivf(a, b);
+}
+
+// Equality for cuFloatComplex (float) with cuFloatComplex (float)
+__host__ __device__ __forceinline__ bool operator==(const cuFloatComplex &a,
+                                                    const cuFloatComplex &b) noexcept {
+  return cuCrealf(a) == cuCrealf(b) && cuCimagf(a) == cuCimagf(b);
+}
+
+// Inequality for cuFloatComplex (float) with cuFloatComplex (float)
+__host__ __device__ __forceinline__ bool operator!=(const cuFloatComplex &a,
+                                                    const cuFloatComplex &b) noexcept {
+  return !(a == b);
+}
+
+// Addition for cuFloatComplex (float) with float
+__host__ __device__ __forceinline__ cuFloatComplex operator+(const cuFloatComplex &a,
+                                                             float b) noexcept {
+  return make_cuFloatComplex(cuCrealf(a) + b, cuCimagf(a));
+}
+
+__host__ __device__ __forceinline__ cuFloatComplex operator+(
+    float a, const cuFloatComplex &b) noexcept {
+  return make_cuFloatComplex(a + cuCrealf(b), cuCimagf(b));
+}
+
+// Subtraction for cuFloatComplex (float) with float
+__host__ __device__ __forceinline__ cuFloatComplex operator-(const cuFloatComplex &a,
+                                                             float b) noexcept {
+  return make_cuFloatComplex(cuCrealf(a) - b, cuCimagf(a));
+}
+
+__host__ __device__ __forceinline__ cuFloatComplex operator-(
+    float a, const cuFloatComplex &b) noexcept {
+  return make_cuFloatComplex(a - cuCrealf(b), -cuCimagf(b));
+}
+
+// Multiplication for cuFloatComplex (float) with float
+__host__ __device__ __forceinline__ cuFloatComplex operator*(const cuFloatComplex &a,
+                                                             float b) noexcept {
+  return make_cuFloatComplex(cuCrealf(a) * b, cuCimagf(a) * b);
+}
+
+__host__ __device__ __forceinline__ cuFloatComplex operator*(
+    float a, const cuFloatComplex &b) noexcept {
+  return make_cuFloatComplex(a * cuCrealf(b), a * cuCimagf(b));
+}
+
+// Division for cuFloatComplex (float) with float
+__host__ __device__ __forceinline__ cuFloatComplex operator/(const cuFloatComplex &a,
+                                                             float b) noexcept {
+  return make_cuFloatComplex(cuCrealf(a) / b, cuCimagf(a) / b);
+}
+
+__host__ __device__ __forceinline__ cuFloatComplex operator/(
+    float a, const cuFloatComplex &b) noexcept {
+  float denom = cuCrealf(b) * cuCrealf(b) + cuCimagf(b) * cuCimagf(b);
+  return make_cuFloatComplex((a * cuCrealf(b)) / denom, (-a * cuCimagf(b)) / denom);
+}
+
+#endif // FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H
diff --git a/include/cufinufft/defs.h b/include/cufinufft/defs.h
@@ -1,15 +1,18 @@
 #ifndef CUFINUFFT_DEFS_H
 #define CUFINUFFT_DEFS_H
 
+#include <complex>
 #include <limits>
-
 // constants needed within common
 // upper bound on w, ie nspread, even when padded (see evaluate_kernel_vector); also for
 // common
-#define MAX_NSPREAD 16
+#define MAX_NSPREAD          16
 
 // max number of positive quadr nodes
-#define MAX_NQUAD   100
+#define MAX_NQUAD            100
+
+// Fraction growth cut-off in utils:arraywidcen, sets when translate in type-3
+#define ARRAYWIDCEN_GROWFRAC 0.1
 
 // FIXME: If cufft ever takes N > INT_MAX...
 constexpr int32_t MAX_NF = std::numeric_limits<int32_t>::max();