Optimising foldrescale #440

DiamonDinoia · 2024-05-08T20:56:15Z

Optimized foldrescale using @mreineck suggestions. This removed the range limitation and made the code faster. Results in the comments.

List of changes:

DiamonDinoia · 2024-05-08T21:01:25Z

Performance: ./spreadtestnd 3 1e07 10e7 1e-6 1 0 1
Before:

setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
	sorted (1 threads):	0.00121 s
	spread 3D (M=1; N1=464,N2=464,N3=464; pir=0), nthr=1
	zero output array	0.212 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	3e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 9.99e+07 U pts, dir=1, tol=1e-06: nspread=7
	sorted (1 threads):	0.332 s
	spread 3D (M=10000000; N1=464,N2=464,N3=464; pir=0), nthr=1
	zero output array	0.212 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	6.45 s (100 subprobs)
    1e+07 NU pts in 7 s 	1.43e+06 pts/s 	4.9e+08 spread pts/s
    rel err in total over grid:      3.28e-07
making more random NU pts...
spreadinterp 3D, 9.99e+07 U pts, dir=2, tol=1e-06: nspread=7
	sorted (1 threads):	0.364 s
	interp 3D (M=10000000; N1=464,N2=464,N3=464; pir=0), nthr=1
	t2 spreading loop: 	4.75 s
    1e+07 NU pts in 5.11 s 	1.96e+06 pts/s 	6.71e+08 spread pts/s
    max rel err in values at NU pts: 3.39e-06

Using the new FOLDRESCALE, force inlining:

setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
	sorted (1 threads):	0.00127 s
	spread 3D (M=1; N1=464,N2=464,N3=464; pir=0), nthr=1
	zero output array	0.216 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	2.9e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 9.99e+07 U pts, dir=1, tol=1e-06: nspread=7
	sorted (1 threads):	0.314 s
	spread 3D (M=10000000; N1=464,N2=464,N3=464; pir=0), nthr=1
	zero output array	0.215 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	5.8 s (100 subprobs)
    1e+07 NU pts in 6.33 s 	1.58e+06 pts/s 	5.42e+08 spread pts/s
    rel err in total over grid:      3.28e-07
making more random NU pts...
spreadinterp 3D, 9.99e+07 U pts, dir=2, tol=1e-06: nspread=7
	sorted (1 threads):	0.313 s
	interp 3D (M=10000000; N1=464,N2=464,N3=464; pir=0), nthr=1
	t2 spreading loop: 	4.75 s
    1e+07 NU pts in 5.07 s 	1.97e+06 pts/s 	6.77e+08 spread pts/s
    max rel err in values at NU pts: 3.39e-06

According to this test the change made the code slower. However, it is not a fair evaluation as the test now evaluates with pirange=1 instead of 0, which is slower: #436

setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
	sorted (1 threads):	0.00172 s
	spread 3D (M=1; N1=464,N2=464,N3=464), nthr=1
	zero output array	0.214 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	3.5e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 9.99e+07 U pts, dir=1, tol=1e-06: nspread=7
	sorted (1 threads):	0.313 s
	spread 3D (M=10000000; N1=464,N2=464,N3=464), nthr=1
	zero output array	0.214 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	6.77 s (100 subprobs)
    1e+07 NU pts in 7.29 s 	1.37e+06 pts/s 	4.7e+08 spread pts/s
    rel err in total over grid:      8.88e-07
making more random NU pts...
spreadinterp 3D, 9.99e+07 U pts, dir=2, tol=1e-06: nspread=7
	sorted (1 threads):	0.316 s
	interp 3D (M=10000000; N1=464,N2=464,N3=464), nthr=1
	t2 spreading loop: 	5.41 s
    1e+07 NU pts in 5.73 s 	1.75e+06 pts/s 	5.99e+08 spread pts/s
    max rel err in values at NU pts: 3.36e-06

CMakeLists.txt

ahbarnett · 2024-05-09T18:01:10Z

CMakeLists.txt

@@ -31,6 +31,7 @@ option(FINUFFT_USE_OPENMP "Whether to use OpenMP for parallelization. If disable
 option(FINUFFT_USE_CUDA "Whether to build CUDA accelerated FINUFFT library (libcufinufft). This is completely independent of the main FINUFFT library" OFF)
 option(FINUFFT_USE_CPU "Whether to build the ordinary FINUFFT library (libfinufft)." ON)
 option(FINUFFT_STATIC_LINKING "Whether to link the static FINUFFT library (libfinufft_static)." ON)
+option(FINUFTT_BUILD_DEVEL "Whether to build developement executables" OFF)


typo FINUFFT

also typo development

CMakeLists.txt

ahbarnett

Just some minor tweaks. The main one I'd prefer to revert is the introduction of all the DEFAULT macros - it is a matter of taste (I can see potential advnatages were we to test if user has changed from default), but the disadvantage of unreadability is worse, to my mind. Also it has broken the sphinx doc system which actually serves that piece of source code to the docs. Currently the only help is in testing if a single deprecated opt is changed from default (and see below). I would prefer to leave that default hard-coded as 1, and test if it changed from 1 where the warning is given. In general if we as the devs want to insert code to test if the user has changed an opt from default, we can create a new struct, call finufft_default_opts() on it, and then compare that field. Would you be able to revert just the DEFAULT macros aspect (if no-one else has major feelings)? Rest is great, and you can proceed with killing pirange (I haven't checked docs/*.rst yet - we will all have to) Thanks! ALex

ahbarnett · 2024-05-09T18:04:55Z

include/finufft_opts.h

@@ -5,6 +5,30 @@
 #ifndef FINUFFT_OPTS_H
 #define FINUFFT_OPTS_H

+// Marco Barbone: 5.8.2024
+// These are user-facing to that the user can reset to the default value


The user would have to recompile finufft lib to change these, correct? Couldn't they also just change finufft_default_opts() function themselves instead?

To change the MACROS? They are not supposed to be changed.
I think having values lying around instead of a named macro or constant is error prone. If we start adding more or removing options this can get out of have. I would like to pay the price now and have something like this now that delay to the future and let the dust pile. @blackwer what do you think?

I'm on the fence. The macros are a lot of pollution, but it's nice having everything in one place -- especially when you need it multiple times. My inclination is to make a constexpr finufft_opts FINUFFT_DEFAULT_OPTS{...}; in a header so that it can be autodocced and essentially namespace itself, grouping all relevant opts under the same header (alternatively prefixing FINUFFT_DEFAULT_OPTS_{OPTION} is reasonable, if we fix the doc issue). with the constexpr, finufft_default_opts() can just copy it. when you need a comparison it won't pay a cost of lookup (as compared to a static variable somewhere). there might be a problem I'm overlooking, but thoughts?

ahbarnett · 2024-05-09T18:05:54Z

src/finufft.cpp

@@ -136,6 +136,13 @@ int setup_spreader_for_nufft(finufft_spread_opts &spopts, FLT eps, finufft_opts
    spopts.atomic_threshold = opts.spread_nthr_atomic;
  if (opts.spread_max_sp_size>0)      // overrides
    spopts.max_subproblem_size = opts.spread_max_sp_size;
+  if (opts.chkbnds != FINUFFT_CHKBND_DEFAULT) {


ok, I see why default macros needed in this case.

ahbarnett · 2024-05-09T18:07:47Z

src/finufft.cpp

-  o->maxbatchsize = 0;
-  o->spread_nthr_atomic = -1;
-  o->spread_max_sp_size = 0;
+  o->modeord = FINUFFT_MODEORD_DEFAULT;


I find this now too abstract and hard to track down for the user reading the code - also the defaults cannot be viewed in the sphinx docs (note tags). Let's decide about this...

ahbarnett · 2024-05-09T18:09:22Z

src/spreadinterp.cpp

@@ -178,11 +160,9 @@ int spreadcheck(BIGINT N1, BIGINT N2, BIGINT N3, BIGINT M, FLT *kx, FLT *ky,
 /* This does just the input checking and reporting for the spreader.
   See spreadinterp() for input arguments and meaning of returned value.
   Split out by Melody Shih, Jun 2018. Finiteness chk Barnett 7/30/18.
-   Bypass FOLDRESCALE macro which has inevitable rounding err even nr +pi,
-   giving fake invalids well inside the [-3pi,3pi] domain, 4/9/21.
+   Marco Barbone 5.8.24 removed bounds check as new foldrescale is not limited to [-3pi,3pi)


good thks for the docs

ahbarnett · 2024-05-09T18:10:56Z

src/spreadinterp.cpp

-    kx,ky,kz - length-M arrays of real coords of NU pts, in the domain
-               for FOLDRESCALE, which includes [0,N1], [0,N2], [0,N3]
-               respectively, if opts.pirange=0; or [-pi,pi] if opts.pirange=1.
+    kx,ky,kz - length-M arrays of real coords of NU pts.


for this PR we'll want to keep that doc comment (pirange not gone yet:)

ahbarnett · 2024-05-09T18:11:54Z

test/dumbinputs.cpp

@@ -21,6 +21,7 @@
   Either precision with dual-prec lib funcs 7/3/20.
   Added a chkbnds case to 1d1, 4/9/21.
   Made pass-fail, obviating results/dumbinputs.refout. Barnett 6/16/23.
+   Removed the chkbnds case to 1d1, 05/08/2024.


good, nice catch.

ahbarnett · 2024-05-09T18:12:57Z

src/spreadinterp.cpp

-
-  }   // namespace
+/* local NU coord fold+rescale macro: does the following affine transform to x:
+     when p=true:   x mod PI    each to [0,N)


you mean (x+PI) mod 2PI

ahbarnett · 2024-05-09T18:23:00Z

Re docs, I saw docs/math.rst still has 3pi in it. Maybe there are other places to remove from the docs (matlab, etc)? Did you findgrep on 3pi or 3\pi or 3 pi or 3 \pi ? :)

docs/cguru.doc

docs/cguru.docsrc

docs/matlabhelp.doc

matlab/finufft1d1.m

lu1and10

Looks good!
minor typo in .m files and matlabhelp.doc, outsied->outside

python/finufft/finufft/_interfaces.py

lu1and10

thanks! almost done! just find some cleanup in the comments.

lu1and10 · 2024-05-10T19:25:17Z

docs/opts.rst

+<<<<<<< Updated upstream
+**chkbnds**: [DEPRECATED] has no effect.

+=======
+**chkbnds**: [DEPRECATED] It does nothing now.
+>>>>>>> Stashed changes


Is this from merge conflict?

lu1and10 · 2024-05-10T21:31:39Z

docs/cguru.doc

-       type 2, "targets". In contrast, for type 3 there are no restrictions on
-       them, or on s, t, u, other than the resulting size of the internal fine
+     * For type 1 and 2, the values in x (and if nonempty, y and z) can be in any
+       interval, they will be folded to [-pi, pi]. Note: for large numbers outside


there are some places using [-pi, pi], and some using [-pi, pi), maybe stick to [-pi, pi)?

lu1and10 · 2024-05-10T21:31:46Z

CHANGELOG

@@ -9,6 +9,7 @@ If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).
 * MAX_NF increased from 1e11 to 1e12, since machines grow.
 * improved GPU python docs: migration guide; usage from cupy, numba, torch,
  pycuda. PyPI pkg still at 2.2.0beta.
+* Used new foldrescale and removed tests for the range


also pirange is removed?

lu1and10 · 2024-05-10T21:32:16Z

src/spreadinterp.cpp

+<<<<<<< Updated upstream
    kx,ky,kz - length-M arrays of real coords of NU pts, in the domain
               for FOLDRESCALE, which includes [0,N1], [0,N2], [0,N3]
               respectively, if opts.pirange=0; or [-pi,pi] if opts.pirange=1.
+=======
+    kx,ky,kz - length-M arrays of real coords of NU pts. Domain is [-pi, pi],
+                points outside are folded in.
+>>>>>>> Stashed changes


seems also from git

ahbarnett

I may tweak some docs once in master

ahbarnett · 2024-05-14T02:33:49Z

perftest/spreadtestnd.cpp

@@ -146,10 +144,10 @@ int main(int argc, char* argv[])
    unsigned int se=MY_OMP_GET_THREAD_NUM();  // needed for parallel random #s
 #pragma omp for schedule(dynamic,1000000) reduction(+:strre,strim)
    for (BIGINT i=0; i<M; ++i) {
-      kx[i]=rand01r(&se)*N;
+      kx[i]=randm11r(&se)*3*M_PI;


should be randm11r * M_PI/2 here, unless you explicitly want to test folding.

etc same 5x belwo. I can fix this minor tweak after merging.

Ok, Marco says this is to test folding, deliberately.

ahbarnett · 2024-05-14T02:35:41Z

examples/cuda/getting_started.cpp

@@ -53,7 +53,7 @@ int main() {
    c = (float _Complex *)malloc(M * sizeof(float _Complex));
    f = (float _Complex *)malloc(N * sizeof(float _Complex));

-    // Fill with random numbers. Frequencies must be in the interval [-pi, pi]
+    // Fill with random numbers. Frequencies must be in the interval [-pi, pi)


freqs can be anything due to folding. Just a minor tweak I can fix after merging.

ahbarnett · 2024-05-14T19:51:48Z

docs/matlabhelp.doc

@@ -11,7 +11,9 @@
     f(k1) =  SUM c[j] exp(+/-i k1 x(j))  for -ms/2 <= k1 <= (ms-1)/2
              j=1
   Inputs:
-     x     locations of nonuniform sources on interval [-3pi,3pi), length nj
+     x     locations of nonuniform sources on interval [-pi, pi) length nj.


If you read docs/README (or makefile) you'll see this file docs/matlabhelp.doc is overwritten in the make docs process. So, will go away.

ahbarnett · 2024-05-14T22:44:11Z

PR #440 tests on AMD laptop 5700U CPU (8-core)

We pick tests in 1D v poor tol (so that spreading negligible)

MASTER branch 79de0847 :  ........................................

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 perftest/spreadtestnd 1 1e7 1e6 1e-1 1 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	0.000317 s
	spread 1D (M=1; N1=1000000,N2=1,N3=1; pir=0), nthr=1
	zero output array	0.00144 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	2.3e-05 s (1 subprobs)
making random data...
spreadinterp 1D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (1 threads):	0.136 s
	spread 1D (M=10000000; N1=1000000,N2=1,N3=1; pir=0), nthr=1
	zero output array	0.00144 s
	capping subproblem sizes to max of 10000
	t1 fancy spread: 	0.237 s (1000 subprobs)
    1e+07 NU pts in 0.382 s 	2.62e+07 pts/s 	5.24e+07 spread pts/s
    rel err in total over grid:      0.04
making more random NU pts...
spreadinterp 1D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (1 threads):	0.133 s
	interp 1D (M=10000000; N1=1000000,N2=1,N3=1; pir=0), nthr=1
	t2 spreading loop: 	0.339 s
    1e+07 NU pts in 0.478 s 	2.09e+07 pts/s 	4.18e+07 spread pts/s
    max rel err in values at NU pts: 0.0954

[note for single-thread t2: sorting helps, but default opt=2 doesn't choose it]

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 perftest/spreadtestnd 1 1e8 1e6 1e-1 2 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	0.000287 s
	spread 1D (M=1; N1=1000000,N2=1,N3=1; pir=0), nthr=8
	zero output array	0.00139 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	0.000771 s (1 subprobs)
making random data...
spreadinterp 1D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (8 threads):	0.631 s
	spread 1D (M=100000000; N1=1000000,N2=1,N3=1; pir=0), nthr=8
	zero output array	0.00154 s
	capping subproblem sizes to max of 10000
	t1 fancy spread: 	1.04 s (10000 subprobs)
    1e+08 NU pts in 1.77 s 	5.66e+07 pts/s 	1.13e+08 spread pts/s
    rel err in total over grid:      0.0303
making more random NU pts...
spreadinterp 1D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	not sorted (sort=2): 	0.0647 s
	interp 1D (M=100000000; N1=1000000,N2=1,N3=1; pir=0), nthr=8
	t2 spreading loop: 	0.769 s
    1e+08 NU pts in 0.905 s 	1.1e+08 pts/s 	2.21e+08 spread pts/s
    max rel err in values at NU pts: 0.0954

[note for multi-thread t2: sorting doesn't helps and default opt=2 doesn't choose it... good]

fold PR #440 ..........................................

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 perftest/spreadtestnd 1 1e7 1e6 1e-1 1 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	0.000316 s
	spread 1D (M=1; N1=1000000,N2=1,N3=1), nthr=1
	zero output array	0.00142 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	3.4e-05 s (1 subprobs)
making random data...
spreadinterp 1D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (1 threads):	0.136 s
	spread 1D (M=10000000; N1=1000000,N2=1,N3=1), nthr=1
	zero output array	0.00145 s
	capping subproblem sizes to max of 10000
	t1 fancy spread: 	0.223 s (1000 subprobs)
    1e+07 NU pts in 0.367 s 	2.72e+07 pts/s 	5.44e+07 spread pts/s
    rel err in total over grid:      0.0475
making more random NU pts...
spreadinterp 1D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (1 threads):	0.134 s
	interp 1D (M=10000000; N1=1000000,N2=1,N3=1), nthr=1
	t2 spreading loop: 	0.308 s
    1e+07 NU pts in 0.448 s 	2.23e+07 pts/s 	4.46e+07 spread pts/s
    max rel err in values at NU pts: 0.0954

(base) alex@ross /home/alex/numerics/finufft>  OMP_NUM_THREADS=8 perftest/spreadtestnd 1 1e8 1e6 1e-1 2 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	0.00028 s
	spread 1D (M=1; N1=1000000,N2=1,N3=1), nthr=8
	zero output array	0.00137 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	0.000328 s (1 subprobs)
making random data...
spreadinterp 1D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (8 threads):	0.634 s
	spread 1D (M=100000000; N1=1000000,N2=1,N3=1), nthr=8
	zero output array	0.00137 s
	capping subproblem sizes to max of 10000
	t1 fancy spread: 	1.04 s (10000 subprobs)
    1e+08 NU pts in 1.77 s 	5.65e+07 pts/s 	1.13e+08 spread pts/s
    rel err in total over grid:      0.0477
making more random NU pts...
spreadinterp 1D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	not sorted (sort=2): 	0.064 s
	interp 1D (M=100000000; N1=1000000,N2=1,N3=1), nthr=8
	t2 spreading loop: 	0.759 s
    1e+08 NU pts in 0.895 s 	1.12e+08 pts/s 	2.24e+08 spread pts/s
    max rel err in values at NU pts: 0.0954

............................

1D Concl: single-thread 7% speedup interp (dir=2) - none to do with sorting
                        5% speedup spread dir=1.
          multi-thread  no significant change (~1% level).       

Also noted: PR #440 compile time for spreadinterp.o is 10x longer than before (~5 sec)


=================================================
3D tests: (poor tol to give foldrescale a chance to shine; 3 coords done each NU pt):

MASTER:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 perftest/spreadtestnd 3 1e7 1e6 1e-1 1 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	2.9e-05 s
	spread 3D (M=1; N1=100,N2=100,N3=100; pir=0), nthr=1
	zero output array	0.00141 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	2.5e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (1 threads):	0.137 s
	spread 3D (M=10000000; N1=100,N2=100,N3=100; pir=0), nthr=1
	zero output array	0.00136 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	0.782 s (100 subprobs)
    1e+07 NU pts in 0.927 s 	1.08e+07 pts/s 	8.63e+07 spread pts/s
    rel err in total over grid:      0.189
making more random NU pts...
spreadinterp 3D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (1 threads):	0.134 s
	interp 3D (M=10000000; N1=100,N2=100,N3=100; pir=0), nthr=1
	t2 spreading loop: 	0.752 s
    1e+07 NU pts in 0.892 s 	1.12e+07 pts/s 	8.97e+07 spread pts/s
    max rel err in values at NU pts: 0.315
    
(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 perftest/spreadtestnd 3 1e8 1e6 1e-1 2 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	1.7e-05 s
	spread 3D (M=1; N1=100,N2=100,N3=100; pir=0), nthr=8
	zero output array	0.00147 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	0.000397 s (1 subprobs)
making random data...
spreadinterp 3D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (8 threads):	0.315 s
	spread 3D (M=100000000; N1=100,N2=100,N3=100; pir=0), nthr=8
	zero output array	0.00138 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	1.91 s (1000 subprobs)
    1e+08 NU pts in 2.32 s 	4.31e+07 pts/s 	3.45e+08 spread pts/s
    rel err in total over grid:      0.165
making more random NU pts...
spreadinterp 3D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (8 threads):	0.311 s
	interp 3D (M=100000000; N1=100,N2=100,N3=100; pir=0), nthr=8
	t2 spreading loop: 	2.04 s
    1e+08 NU pts in 2.45 s 	4.08e+07 pts/s 	3.26e+08 spread pts/s
    max rel err in values at NU pts: 0.315


PR #440:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 perftest/spreadtestnd 3 1e7 1e6 1e-1 1 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	2e-05 s
	spread 3D (M=1; N1=100,N2=100,N3=100), nthr=1
	zero output array	0.00142 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	3.3e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (1 threads):	0.136 s
	spread 3D (M=10000000; N1=100,N2=100,N3=100), nthr=1
	zero output array	0.00135 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	0.794 s (100 subprobs)
    1e+07 NU pts in 0.937 s 	1.07e+07 pts/s 	8.53e+07 spread pts/s
    rel err in total over grid:      0.143
making more random NU pts...
spreadinterp 3D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (1 threads):	0.135 s
	interp 3D (M=10000000; N1=100,N2=100,N3=100), nthr=1
	t2 spreading loop: 	0.687 s
    1e+07 NU pts in 0.829 s 	1.21e+07 pts/s 	9.65e+07 spread pts/s
    max rel err in values at NU pts: 0.315
    
(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 perftest/spreadtestnd 3 1e8 1e6 1e-1 2 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	1.8e-05 s
	spread 3D (M=1; N1=100,N2=100,N3=100), nthr=8
	zero output array	0.0014 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	0.000358 s (1 subprobs)
making random data...
spreadinterp 3D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (8 threads):	0.31 s
	spread 3D (M=100000000; N1=100,N2=100,N3=100), nthr=8
	zero output array	0.00132 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	1.92 s (1000 subprobs)
    1e+08 NU pts in 2.33 s 	4.29e+07 pts/s 	3.43e+08 spread pts/s
    rel err in total over grid:      0.167
making more random NU pts...
spreadinterp 3D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (8 threads):	0.319 s
	interp 3D (M=100000000; N1=100,N2=100,N3=100), nthr=8
	t2 spreading loop: 	2.02 s
    1e+08 NU pts in 2.44 s 	4.1e+07 pts/s 	3.28e+08 spread pts/s
    max rel err in values at NU pts: 0.315

concl: single-thread: spread no change; interp is 9% faster
       8-thread :    spread no change; interp no change.

Overall: only affects single-core perf, and by 9% or less.

(Of course, advantage of no 3pi-restriction is good too)

lu1and10 reviewed May 8, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

ahbarnett reviewed May 9, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

ahbarnett reviewed May 9, 2024

View reviewed changes