Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimising foldrescale #440

Merged

Conversation

DiamonDinoia
Copy link
Collaborator

@DiamonDinoia DiamonDinoia commented May 8, 2024

Optimized foldrescale using @mreineck suggestions. This removed the range limitation and made the code faster. Results in the comments.

List of changes:

  • Integrate @mreineck version
  • Change from macro to inline function
  • remove bounds checking
  • deprecate chkbnds
  • update docs for chkbnds
  • change tests to remove chkbnds
  • remove pirange
  • update docs for pirange

@DiamonDinoia
Copy link
Collaborator Author

DiamonDinoia commented May 8, 2024

Performance: ./spreadtestnd 3 1e07 10e7 1e-6 1 0 1
Before:

setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
	sorted (1 threads):	0.00121 s
	spread 3D (M=1; N1=464,N2=464,N3=464; pir=0), nthr=1
	zero output array	0.212 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	3e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 9.99e+07 U pts, dir=1, tol=1e-06: nspread=7
	sorted (1 threads):	0.332 s
	spread 3D (M=10000000; N1=464,N2=464,N3=464; pir=0), nthr=1
	zero output array	0.212 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	6.45 s (100 subprobs)
    1e+07 NU pts in 7 s 	1.43e+06 pts/s 	4.9e+08 spread pts/s
    rel err in total over grid:      3.28e-07
making more random NU pts...
spreadinterp 3D, 9.99e+07 U pts, dir=2, tol=1e-06: nspread=7
	sorted (1 threads):	0.364 s
	interp 3D (M=10000000; N1=464,N2=464,N3=464; pir=0), nthr=1
	t2 spreading loop: 	4.75 s
    1e+07 NU pts in 5.11 s 	1.96e+06 pts/s 	6.71e+08 spread pts/s
    max rel err in values at NU pts: 3.39e-06

Using the new FOLDRESCALE, force inlining:

setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
	sorted (1 threads):	0.00127 s
	spread 3D (M=1; N1=464,N2=464,N3=464; pir=0), nthr=1
	zero output array	0.216 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	2.9e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 9.99e+07 U pts, dir=1, tol=1e-06: nspread=7
	sorted (1 threads):	0.314 s
	spread 3D (M=10000000; N1=464,N2=464,N3=464; pir=0), nthr=1
	zero output array	0.215 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	5.8 s (100 subprobs)
    1e+07 NU pts in 6.33 s 	1.58e+06 pts/s 	5.42e+08 spread pts/s
    rel err in total over grid:      3.28e-07
making more random NU pts...
spreadinterp 3D, 9.99e+07 U pts, dir=2, tol=1e-06: nspread=7
	sorted (1 threads):	0.313 s
	interp 3D (M=10000000; N1=464,N2=464,N3=464; pir=0), nthr=1
	t2 spreading loop: 	4.75 s
    1e+07 NU pts in 5.07 s 	1.97e+06 pts/s 	6.77e+08 spread pts/s
    max rel err in values at NU pts: 3.39e-06

According to this test the change made the code slower. However, it is not a fair evaluation as the test now evaluates with pirange=1 instead of 0, which is slower: #436

setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
	sorted (1 threads):	0.00172 s
	spread 3D (M=1; N1=464,N2=464,N3=464), nthr=1
	zero output array	0.214 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	3.5e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 9.99e+07 U pts, dir=1, tol=1e-06: nspread=7
	sorted (1 threads):	0.313 s
	spread 3D (M=10000000; N1=464,N2=464,N3=464), nthr=1
	zero output array	0.214 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	6.77 s (100 subprobs)
    1e+07 NU pts in 7.29 s 	1.37e+06 pts/s 	4.7e+08 spread pts/s
    rel err in total over grid:      8.88e-07
making more random NU pts...
spreadinterp 3D, 9.99e+07 U pts, dir=2, tol=1e-06: nspread=7
	sorted (1 threads):	0.316 s
	interp 3D (M=10000000; N1=464,N2=464,N3=464), nthr=1
	t2 spreading loop: 	5.41 s
    1e+07 NU pts in 5.73 s 	1.75e+06 pts/s 	5.99e+08 spread pts/s
    max rel err in values at NU pts: 3.36e-06

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated
@@ -31,6 +31,7 @@ option(FINUFFT_USE_OPENMP "Whether to use OpenMP for parallelization. If disable
option(FINUFFT_USE_CUDA "Whether to build CUDA accelerated FINUFFT library (libcufinufft). This is completely independent of the main FINUFFT library" OFF)
option(FINUFFT_USE_CPU "Whether to build the ordinary FINUFFT library (libfinufft)." ON)
option(FINUFFT_STATIC_LINKING "Whether to link the static FINUFFT library (libfinufft_static)." ON)
option(FINUFTT_BUILD_DEVEL "Whether to build developement executables" OFF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo FINUFFT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also typo development

CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor tweaks. The main one I'd prefer to revert is the introduction of all the DEFAULT macros - it is a matter of taste (I can see potential advnatages were we to test if user has changed from default), but the disadvantage of unreadability is worse, to my mind. Also it has broken the sphinx doc system which actually serves that piece of source code to the docs. Currently the only help is in testing if a single deprecated opt is changed from default (and see below). I would prefer to leave that default hard-coded as 1, and test if it changed from 1 where the warning is given. In general if we as the devs want to insert code to test if the user has changed an opt from default, we can create a new struct, call finufft_default_opts() on it, and then compare that field. Would you be able to revert just the DEFAULT macros aspect (if no-one else has major feelings)? Rest is great, and you can proceed with killing pirange (I haven't checked docs/*.rst yet - we will all have to) Thanks! ALex

@@ -5,6 +5,30 @@
#ifndef FINUFFT_OPTS_H
#define FINUFFT_OPTS_H

// Marco Barbone: 5.8.2024
// These are user-facing to that the user can reset to the default value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user would have to recompile finufft lib to change these, correct? Couldn't they also just change finufft_default_opts() function themselves instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To change the MACROS? They are not supposed to be changed.
I think having values lying around instead of a named macro or constant is error prone. If we start adding more or removing options this can get out of have. I would like to pay the price now and have something like this now that delay to the future and let the dust pile. @blackwer what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence. The macros are a lot of pollution, but it's nice having everything in one place -- especially when you need it multiple times. My inclination is to make a constexpr finufft_opts FINUFFT_DEFAULT_OPTS{...}; in a header so that it can be autodocced and essentially namespace itself, grouping all relevant opts under the same header (alternatively prefixing FINUFFT_DEFAULT_OPTS_{OPTION} is reasonable, if we fix the doc issue). with the constexpr, finufft_default_opts() can just copy it. when you need a comparison it won't pay a cost of lookup (as compared to a static variable somewhere). there might be a problem I'm overlooking, but thoughts?

src/finufft.cpp Outdated
@@ -136,6 +136,13 @@ int setup_spreader_for_nufft(finufft_spread_opts &spopts, FLT eps, finufft_opts
spopts.atomic_threshold = opts.spread_nthr_atomic;
if (opts.spread_max_sp_size>0) // overrides
spopts.max_subproblem_size = opts.spread_max_sp_size;
if (opts.chkbnds != FINUFFT_CHKBND_DEFAULT) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see why default macros needed in this case.

src/finufft.cpp Outdated
o->maxbatchsize = 0;
o->spread_nthr_atomic = -1;
o->spread_max_sp_size = 0;
o->modeord = FINUFFT_MODEORD_DEFAULT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this now too abstract and hard to track down for the user reading the code - also the defaults cannot be viewed in the sphinx docs (note tags). Let's decide about this...

@@ -178,11 +160,9 @@ int spreadcheck(BIGINT N1, BIGINT N2, BIGINT N3, BIGINT M, FLT *kx, FLT *ky,
/* This does just the input checking and reporting for the spreader.
See spreadinterp() for input arguments and meaning of returned value.
Split out by Melody Shih, Jun 2018. Finiteness chk Barnett 7/30/18.
Bypass FOLDRESCALE macro which has inevitable rounding err even nr +pi,
giving fake invalids well inside the [-3pi,3pi] domain, 4/9/21.
Marco Barbone 5.8.24 removed bounds check as new foldrescale is not limited to [-3pi,3pi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good thks for the docs

kx,ky,kz - length-M arrays of real coords of NU pts, in the domain
for FOLDRESCALE, which includes [0,N1], [0,N2], [0,N3]
respectively, if opts.pirange=0; or [-pi,pi] if opts.pirange=1.
kx,ky,kz - length-M arrays of real coords of NU pts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this PR we'll want to keep that doc comment (pirange not gone yet:)

@@ -21,6 +21,7 @@
Either precision with dual-prec lib funcs 7/3/20.
Added a chkbnds case to 1d1, 4/9/21.
Made pass-fail, obviating results/dumbinputs.refout. Barnett 6/16/23.
Removed the chkbnds case to 1d1, 05/08/2024.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, nice catch.


} // namespace
/* local NU coord fold+rescale macro: does the following affine transform to x:
when p=true: x mod PI each to [0,N)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean (x+PI) mod 2PI

@ahbarnett
Copy link
Collaborator

Re docs, I saw docs/math.rst still has 3pi in it. Maybe there are other places to remove from the docs (matlab, etc)? Did you findgrep on 3pi or 3\pi or 3 pi or 3 \pi ? :)

docs/cguru.doc Outdated Show resolved Hide resolved
docs/cguru.docsrc Outdated Show resolved Hide resolved
docs/matlabhelp.doc Outdated Show resolved Hide resolved
docs/matlabhelp.doc Outdated Show resolved Hide resolved
matlab/finufft1d1.m Outdated Show resolved Hide resolved
Copy link
Member

@lu1and10 lu1and10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
minor typo in .m files and matlabhelp.doc, outsied->outside

Copy link
Member

@lu1and10 lu1and10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! almost done! just find some cleanup in the comments.

docs/opts.rst Outdated
Comment on lines 94 to 99
<<<<<<< Updated upstream
**chkbnds**: [DEPRECATED] has no effect.

=======
**chkbnds**: [DEPRECATED] It does nothing now.
>>>>>>> Stashed changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from merge conflict?

docs/cguru.doc Outdated
type 2, "targets". In contrast, for type 3 there are no restrictions on
them, or on s, t, u, other than the resulting size of the internal fine
* For type 1 and 2, the values in x (and if nonempty, y and z) can be in any
interval, they will be folded to [-pi, pi]. Note: for large numbers outside
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some places using [-pi, pi], and some using [-pi, pi), maybe stick to [-pi, pi)?

@@ -9,6 +9,7 @@ If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).
* MAX_NF increased from 1e11 to 1e12, since machines grow.
* improved GPU python docs: migration guide; usage from cupy, numba, torch,
pycuda. PyPI pkg still at 2.2.0beta.
* Used new foldrescale and removed tests for the range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pirange is removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines 187 to 194
<<<<<<< Updated upstream
kx,ky,kz - length-M arrays of real coords of NU pts, in the domain
for FOLDRESCALE, which includes [0,N1], [0,N2], [0,N3]
respectively, if opts.pirange=0; or [-pi,pi] if opts.pirange=1.
=======
kx,ky,kz - length-M arrays of real coords of NU pts. Domain is [-pi, pi],
points outside are folded in.
>>>>>>> Stashed changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems also from git

Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may tweak some docs once in master

@@ -146,10 +144,10 @@ int main(int argc, char* argv[])
unsigned int se=MY_OMP_GET_THREAD_NUM(); // needed for parallel random #s
#pragma omp for schedule(dynamic,1000000) reduction(+:strre,strim)
for (BIGINT i=0; i<M; ++i) {
kx[i]=rand01r(&se)*N;
kx[i]=randm11r(&se)*3*M_PI;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be randm11r * M_PI/2 here, unless you explicitly want to test folding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etc same 5x belwo. I can fix this minor tweak after merging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Marco says this is to test folding, deliberately.

@@ -53,7 +53,7 @@ int main() {
c = (float _Complex *)malloc(M * sizeof(float _Complex));
f = (float _Complex *)malloc(N * sizeof(float _Complex));

// Fill with random numbers. Frequencies must be in the interval [-pi, pi]
// Fill with random numbers. Frequencies must be in the interval [-pi, pi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freqs can be anything due to folding. Just a minor tweak I can fix after merging.

@@ -11,7 +11,9 @@
f(k1) = SUM c[j] exp(+/-i k1 x(j)) for -ms/2 <= k1 <= (ms-1)/2
j=1
Inputs:
x locations of nonuniform sources on interval [-3pi,3pi), length nj
x locations of nonuniform sources on interval [-pi, pi) length nj.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you read docs/README (or makefile) you'll see this file docs/matlabhelp.doc is overwritten in the make docs process. So, will go away.

@ahbarnett
Copy link
Collaborator

PR #440 tests on AMD laptop 5700U CPU (8-core)

We pick tests in 1D v poor tol (so that spreading negligible)

MASTER branch 79de0847 :  ........................................

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 perftest/spreadtestnd 1 1e7 1e6 1e-1 1 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	0.000317 s
	spread 1D (M=1; N1=1000000,N2=1,N3=1; pir=0), nthr=1
	zero output array	0.00144 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	2.3e-05 s (1 subprobs)
making random data...
spreadinterp 1D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (1 threads):	0.136 s
	spread 1D (M=10000000; N1=1000000,N2=1,N3=1; pir=0), nthr=1
	zero output array	0.00144 s
	capping subproblem sizes to max of 10000
	t1 fancy spread: 	0.237 s (1000 subprobs)
    1e+07 NU pts in 0.382 s 	2.62e+07 pts/s 	5.24e+07 spread pts/s
    rel err in total over grid:      0.04
making more random NU pts...
spreadinterp 1D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (1 threads):	0.133 s
	interp 1D (M=10000000; N1=1000000,N2=1,N3=1; pir=0), nthr=1
	t2 spreading loop: 	0.339 s
    1e+07 NU pts in 0.478 s 	2.09e+07 pts/s 	4.18e+07 spread pts/s
    max rel err in values at NU pts: 0.0954

[note for single-thread t2: sorting helps, but default opt=2 doesn't choose it]

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 perftest/spreadtestnd 1 1e8 1e6 1e-1 2 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	0.000287 s
	spread 1D (M=1; N1=1000000,N2=1,N3=1; pir=0), nthr=8
	zero output array	0.00139 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	0.000771 s (1 subprobs)
making random data...
spreadinterp 1D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (8 threads):	0.631 s
	spread 1D (M=100000000; N1=1000000,N2=1,N3=1; pir=0), nthr=8
	zero output array	0.00154 s
	capping subproblem sizes to max of 10000
	t1 fancy spread: 	1.04 s (10000 subprobs)
    1e+08 NU pts in 1.77 s 	5.66e+07 pts/s 	1.13e+08 spread pts/s
    rel err in total over grid:      0.0303
making more random NU pts...
spreadinterp 1D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	not sorted (sort=2): 	0.0647 s
	interp 1D (M=100000000; N1=1000000,N2=1,N3=1; pir=0), nthr=8
	t2 spreading loop: 	0.769 s
    1e+08 NU pts in 0.905 s 	1.1e+08 pts/s 	2.21e+08 spread pts/s
    max rel err in values at NU pts: 0.0954

[note for multi-thread t2: sorting doesn't helps and default opt=2 doesn't choose it... good]

fold PR #440 ..........................................

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 perftest/spreadtestnd 1 1e7 1e6 1e-1 1 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	0.000316 s
	spread 1D (M=1; N1=1000000,N2=1,N3=1), nthr=1
	zero output array	0.00142 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	3.4e-05 s (1 subprobs)
making random data...
spreadinterp 1D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (1 threads):	0.136 s
	spread 1D (M=10000000; N1=1000000,N2=1,N3=1), nthr=1
	zero output array	0.00145 s
	capping subproblem sizes to max of 10000
	t1 fancy spread: 	0.223 s (1000 subprobs)
    1e+07 NU pts in 0.367 s 	2.72e+07 pts/s 	5.44e+07 spread pts/s
    rel err in total over grid:      0.0475
making more random NU pts...
spreadinterp 1D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (1 threads):	0.134 s
	interp 1D (M=10000000; N1=1000000,N2=1,N3=1), nthr=1
	t2 spreading loop: 	0.308 s
    1e+07 NU pts in 0.448 s 	2.23e+07 pts/s 	4.46e+07 spread pts/s
    max rel err in values at NU pts: 0.0954

(base) alex@ross /home/alex/numerics/finufft>  OMP_NUM_THREADS=8 perftest/spreadtestnd 1 1e8 1e6 1e-1 2 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	0.00028 s
	spread 1D (M=1; N1=1000000,N2=1,N3=1), nthr=8
	zero output array	0.00137 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	0.000328 s (1 subprobs)
making random data...
spreadinterp 1D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (8 threads):	0.634 s
	spread 1D (M=100000000; N1=1000000,N2=1,N3=1), nthr=8
	zero output array	0.00137 s
	capping subproblem sizes to max of 10000
	t1 fancy spread: 	1.04 s (10000 subprobs)
    1e+08 NU pts in 1.77 s 	5.65e+07 pts/s 	1.13e+08 spread pts/s
    rel err in total over grid:      0.0477
making more random NU pts...
spreadinterp 1D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	not sorted (sort=2): 	0.064 s
	interp 1D (M=100000000; N1=1000000,N2=1,N3=1), nthr=8
	t2 spreading loop: 	0.759 s
    1e+08 NU pts in 0.895 s 	1.12e+08 pts/s 	2.24e+08 spread pts/s
    max rel err in values at NU pts: 0.0954

............................

1D Concl: single-thread 7% speedup interp (dir=2) - none to do with sorting
                        5% speedup spread dir=1.
          multi-thread  no significant change (~1% level).       

Also noted: PR #440 compile time for spreadinterp.o is 10x longer than before (~5 sec)


=================================================
3D tests: (poor tol to give foldrescale a chance to shine; 3 coords done each NU pt):

MASTER:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 perftest/spreadtestnd 3 1e7 1e6 1e-1 1 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	2.9e-05 s
	spread 3D (M=1; N1=100,N2=100,N3=100; pir=0), nthr=1
	zero output array	0.00141 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	2.5e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (1 threads):	0.137 s
	spread 3D (M=10000000; N1=100,N2=100,N3=100; pir=0), nthr=1
	zero output array	0.00136 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	0.782 s (100 subprobs)
    1e+07 NU pts in 0.927 s 	1.08e+07 pts/s 	8.63e+07 spread pts/s
    rel err in total over grid:      0.189
making more random NU pts...
spreadinterp 3D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (1 threads):	0.134 s
	interp 3D (M=10000000; N1=100,N2=100,N3=100; pir=0), nthr=1
	t2 spreading loop: 	0.752 s
    1e+07 NU pts in 0.892 s 	1.12e+07 pts/s 	8.97e+07 spread pts/s
    max rel err in values at NU pts: 0.315
    
(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 perftest/spreadtestnd 3 1e8 1e6 1e-1 2 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	1.7e-05 s
	spread 3D (M=1; N1=100,N2=100,N3=100; pir=0), nthr=8
	zero output array	0.00147 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	0.000397 s (1 subprobs)
making random data...
spreadinterp 3D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (8 threads):	0.315 s
	spread 3D (M=100000000; N1=100,N2=100,N3=100; pir=0), nthr=8
	zero output array	0.00138 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	1.91 s (1000 subprobs)
    1e+08 NU pts in 2.32 s 	4.31e+07 pts/s 	3.45e+08 spread pts/s
    rel err in total over grid:      0.165
making more random NU pts...
spreadinterp 3D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (8 threads):	0.311 s
	interp 3D (M=100000000; N1=100,N2=100,N3=100; pir=0), nthr=8
	t2 spreading loop: 	2.04 s
    1e+08 NU pts in 2.45 s 	4.08e+07 pts/s 	3.26e+08 spread pts/s
    max rel err in values at NU pts: 0.315


PR #440:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 perftest/spreadtestnd 3 1e7 1e6 1e-1 1 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	2e-05 s
	spread 3D (M=1; N1=100,N2=100,N3=100), nthr=1
	zero output array	0.00142 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	3.3e-05 s (1 subprobs)
making random data...
spreadinterp 3D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (1 threads):	0.136 s
	spread 3D (M=10000000; N1=100,N2=100,N3=100), nthr=1
	zero output array	0.00135 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	0.794 s (100 subprobs)
    1e+07 NU pts in 0.937 s 	1.07e+07 pts/s 	8.53e+07 spread pts/s
    rel err in total over grid:      0.143
making more random NU pts...
spreadinterp 3D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (1 threads):	0.135 s
	interp 3D (M=10000000; N1=100,N2=100,N3=100), nthr=1
	t2 spreading loop: 	0.687 s
    1e+07 NU pts in 0.829 s 	1.21e+07 pts/s 	9.65e+07 spread pts/s
    max rel err in values at NU pts: 0.315
    
(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 perftest/spreadtestnd 3 1e8 1e6 1e-1 2 0 1
setup_spreader (kerevalmeth=1) eps=0.1 sigma=2: chose ns=2 beta=4.4
	sorted (1 threads):	1.8e-05 s
	spread 3D (M=1; N1=100,N2=100,N3=100), nthr=8
	zero output array	0.0014 s
	using low-density speed rescue nb=M...
	t1 fancy spread: 	0.000358 s (1 subprobs)
making random data...
spreadinterp 3D, 1e+06 U pts, dir=1, tol=0.1: nspread=2
	sorted (8 threads):	0.31 s
	spread 3D (M=100000000; N1=100,N2=100,N3=100), nthr=8
	zero output array	0.00132 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	1.92 s (1000 subprobs)
    1e+08 NU pts in 2.33 s 	4.29e+07 pts/s 	3.43e+08 spread pts/s
    rel err in total over grid:      0.167
making more random NU pts...
spreadinterp 3D, 1e+06 U pts, dir=2, tol=0.1: nspread=2
	sorted (8 threads):	0.319 s
	interp 3D (M=100000000; N1=100,N2=100,N3=100), nthr=8
	t2 spreading loop: 	2.02 s
    1e+08 NU pts in 2.44 s 	4.1e+07 pts/s 	3.28e+08 spread pts/s
    max rel err in values at NU pts: 0.315

concl: single-thread: spread no change; interp is 9% faster
       8-thread :    spread no change; interp no change.

Overall: only affects single-core perf, and by 9% or less.

(Of course, advantage of no 3pi-restriction is good too)

@ahbarnett ahbarnett merged commit 401273d into flatironinstitute:master May 14, 2024
9 checks passed
@DiamonDinoia DiamonDinoia mentioned this pull request Jul 17, 2024
8 tasks
@DiamonDinoia DiamonDinoia deleted the optimising-foldrescale branch July 23, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants