Gpu type 3 #517

DiamonDinoia · 2024-08-08T19:32:02Z

This PR implements Type3 on GPU using CUDA.

…into gpu-optimizations

DiamonDinoia · 2024-08-28T18:01:25Z

Changelog:

Support for type 3 in 1D, 2D, and 3D in the GPU library cufinufft
Removed the CPU fseries computation (only used for benchmark no longer needed).
Added complex arithmetic support for cuda_complex type
Added tests for type 3 in 1D, 2D, and 3D and cuda_complex arithmetic
Minor fixes on the GPU code:
- removed memory leaks in case of errors
- renamed maxbatchsize to batchsize

DiamonDinoia · 2024-08-29T16:10:42Z

The test failure is not an issue. I will disable that test later.

ahbarnett

Great job! A big one. Just a few things to tidy up, should be at most 1-2 hrs if you don't address the the phase-winding tidy-up.

[However, have a look at removing the whole phase-winding and "a" stuff (it is way more confusing that it needs be, due to old code left by Melody and Robert), which I could have a stab at after merge, anyway, unless you want to. It might add 0.5-1 days work for you. Sit down and write out the math, or go back to the CPU code sans phase-winding, just using plain cos(). If it's not clear, we can together after Sept 3.]

ahbarnett · 2024-08-29T18:55:52Z

CHANGELOG

@@ -1,6 +1,15 @@
 List of features / changes made / release notes, in reverse chronological order.
 If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).

+V 2.4.0 (08/28/24)
+* Support for type 3 in 1D, 2D, and 3D in the GPU library cufinufft (PR #517).
+* Removed the CPU fseries computation (only used for benchmark no longer needed).


using a separate * for each line here implies they are distinct features. Suggest use hyphen as below to group a sublist for GPU.

ahbarnett · 2024-08-29T18:56:24Z

docs/devnotes.rst

@@ -51,6 +51,8 @@ Developer notes

 * CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``.

+* Note to the nvcc developer. nvcc with debug symbols causes a stack overflow that is undetected at both compile and runtime. This goes undetected until ns>=10, for ns<10, one can use -G and debug the code with cuda-gdb. The way to avoid is to not use Debug symbols, possibly using ``--generate-line-info`` might work (not tested). As a side note, compute-sanitizers do not detect the issue.


great. Maybe mention only d=3?

ahbarnett · 2024-08-29T18:57:51Z

docs/opts.rst

-     
-**maxbatchsize**:  in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.
+
+**batchsize**:  in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.


I'm confused -this seems to rename a CPU option? We don't want to change them. I thought we were changing maxbatchsize -> batchsize only internally and only in the GPU code ?

include/cufinufft/contrib/helper_math.h

include/cufinufft/impl.h

ahbarnett · 2024-08-30T03:28:09Z

test/cuda/cufinufft1d_test.cu

+    }
+    s.resize(N1);
+    for (int i = 0; i < N1; i++) {
+      s[i] = M_PI * randm11();


In test/finufft1d_test.cpp the "s" array is scaled by S = N1/2. I suggest you do this here too, to make the space-freq product grow (hence the FFT size) as the number of modes. This matches CPU testers. Also for 2d, 3d, below.

test/cuda/cufinufft1d_test.cu

ahbarnett · 2024-08-30T03:29:41Z

test/cuda/cufinufft2d_test.cu

+    s.resize(N1 * N2);
+    t.resize(N1 * N2);
+    for (int i = 0; i < N1 * N2; i++) {
+      s[i] = M_PI * randm11();


see prev review on N1, N2 scaling of the s,t test freq pts.

ahbarnett · 2024-08-30T03:30:45Z

test/cuda/cufinufft3d_test.cu

+    t.resize(N1 * N2 * N3);
+    u.resize(N1 * N2 * N3);
+    for (int i = 0; i < N1 * N2 * N3; i++) {
+      s[i] = M_PI * randm11();


ditto scaling above.

ahbarnett · 2024-08-30T03:33:46Z

test/cuda/cufinufft_math_test.cu

+
+// Helper function to compare cuComplex with std::complex<T> using 1 - ratio as error
+template<typename T>
+bool compareComplex(const cuda_complex<T> &a, const std::complex<T> &b,


As we discussed today, this is too harsh a test, since when a complex number "lands" near the real axis, its imag part (by itself) may have high relative error, and that's ok. Instead of separately testing Re and Im rel errors, use cabs(a-b)/cabs(a) < epsilon. Ask if still doesn't make sense.

janden

Looks good on my end. Just a few questions.

janden · 2024-09-03T09:45:12Z

include/cufinufft/common.h

@@ -7,27 +7,37 @@
 #include <finufft_errors.h>
 #include <finufft_spread_opts.h>

-#include <complex.h>
+#include <complex>
+#include <optional>


janden · 2024-09-03T09:49:15Z

include/cufinufft/defs.h

@@ -1,15 +1,18 @@
 #ifndef CUFINUFFT_DEFS_H
 #define CUFINUFFT_DEFS_H

+#include <complex>


include/cufinufft/types.h

janden · 2024-09-03T10:12:39Z

include/cufinufft/types.h

  cuda_complex<T> *c;
  cuda_complex<T> *fw;
  cuda_complex<T> *fk;

+  // Type 3 specific
+  struct {
+    T X1, C1, S1, D1, h1, gam1; // x dim: X=halfwid C=center D=freqcen h,gam=rescale


include/cufinufft/utils.h

DiamonDinoia · 2024-09-04T21:21:21Z

Integrated most of the requested changes. Commented on the PR when not applicable.
On the timing, maybe worth relying on the nvidia profiler as it provides a nice output and is shipped with cuda. Otherwise I will think about it, as one has to use the CUDA event API to measure the time.

the fseries should be reworked in a separate PR as it can be combined with the flipwind changes.

DiamonDinoia added 30 commits July 3, 2024 09:43

basic benchmarks

45333fa

added plotting script

b95a082

optimised plotting

ae55ca5

fixed plotting and metrics

16e27f0

fixed the plot script

49d1f21

bin_size_x is as function of the shared memory available

2fdae68

bin_size_x is as function of the shared memory available

c0d9923

minor optimizations in 1D

907797c

otpimized nupts driven

60f4780

Optimized 1D and 2D

35dcc66

Merge branch 'master' into gpu-optimizations

e1ad9bb

3D integer operations

366295d

3D SM and GM optimized

24bf6be

bump cuda version

960117a

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

4295a86

changed matlab to generate necessary cuda upsampfact files

c1b14c6

added new coeffs

f300d2d

Merge remote-tracking branch 'refs/remotes/origin/gpu-optimizations' …

e86c762

…into gpu-optimizations

restoring .m from master

db0457a

updated hook

d0ce11e

updated matlab upsampfact

513ce4b

updated coefficients

798717d

new coeffs

282baf5

updated cufinufft to new coeff

12822a2

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

badf22f

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

bf6328b

picked good defaults for method

ae783da

update configuration

d29fcf5

upated build system

73f937b

fixing jenkins

0724866

DiamonDinoia added 5 commits August 28, 2024 12:34

XMerge remote-tracking branch 'flatiron/master' into gpu-type-3

d29cbba

added extended lambda flag to tests

71ad464

CleanUP

a494518

Updated changelog

5788320

fixed printf warning

4c7388e

DiamonDinoia marked this pull request as ready for review August 28, 2024 18:01

DiamonDinoia requested review from lu1and10, blackwer, janden and ahbarnett August 28, 2024 18:01

DiamonDinoia added 4 commits August 28, 2024 15:08

restored fftw behaviour

46eb1d4

Added devnotes on the issue

0ada7a0

removed sprurious changes

671e4ac

Minor cleanup

7a7cff5

fixed math test

9b0da66

ahbarnett requested changes Aug 30, 2024

View reviewed changes

janden requested changes Sep 3, 2024

View reviewed changes

Addressed review comments

d3d4d34

DiamonDinoia requested review from janden and ahbarnett September 4, 2024 21:21

DiamonDinoia added 4 commits September 11, 2024 16:11

Merge remote-tracking branch 'flatiron/master' into gpu-type-3

52cd6cc

splitting onedim_f_series in two functions

1355818

GPU flipwind type 1-2; fseries and nuft renaming to match CPU code

bc64a92

fixed complex math test

96980d3

DiamonDinoia merged commit 481b70e into flatironinstitute:master Sep 12, 2024
166 of 167 checks passed

DiamonDinoia deleted the gpu-type-3 branch September 12, 2024 17:07

ahbarnett mentioned this pull request Sep 13, 2024

Type 3 on GPU #489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu type 3 #517

Gpu type 3 #517

DiamonDinoia commented Aug 8, 2024 •

edited

Loading

DiamonDinoia commented Aug 28, 2024 •

edited

Loading

DiamonDinoia commented Aug 29, 2024

ahbarnett left a comment

ahbarnett Aug 29, 2024

ahbarnett Aug 29, 2024

ahbarnett Aug 29, 2024

ahbarnett Aug 30, 2024

ahbarnett Aug 30, 2024

ahbarnett Aug 30, 2024

ahbarnett Aug 30, 2024

janden left a comment

janden Sep 3, 2024

janden Sep 3, 2024

janden Sep 3, 2024

DiamonDinoia commented Sep 4, 2024

		@@ -51,6 +51,8 @@ Developer notes

		* CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``.

		* Note to the nvcc developer. nvcc with debug symbols causes a stack overflow that is undetected at both compile and runtime. This goes undetected until ns>=10, for ns<10, one can use -G and debug the code with cuda-gdb. The way to avoid is to not use Debug symbols, possibly using ``--generate-line-info`` might work (not tested). As a side note, compute-sanitizers do not detect the issue.


		maxbatchsize: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.

		batchsize: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.

Gpu type 3 #517

Gpu type 3 #517

Conversation

DiamonDinoia commented Aug 8, 2024 • edited Loading

DiamonDinoia commented Aug 28, 2024 • edited Loading

DiamonDinoia commented Aug 29, 2024

ahbarnett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DiamonDinoia commented Sep 4, 2024

DiamonDinoia commented Aug 8, 2024 •

edited

Loading

DiamonDinoia commented Aug 28, 2024 •

edited

Loading