-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu type 3 #517
Gpu type 3 #517
Changes from 76 commits
45333fa
b95a082
ae55ca5
16e27f0
49d1f21
2fdae68
c0d9923
907797c
60f4780
35dcc66
e1ad9bb
366295d
24bf6be
960117a
4295a86
c1b14c6
f300d2d
e86c762
db0457a
d0ce11e
513ce4b
798717d
282baf5
12822a2
badf22f
bf6328b
ae783da
d29fcf5
73f937b
0724866
8cd50fc
49a9d7e
041a536
54683c3
4f19103
dc3a628
b3237f7
db80aad
c225fb5
394550f
5606aa0
74ccd71
ee28d05
466ddff
3f60ca4
fb48ff8
8c42061
b64f68e
7c810a5
9d44993
074dda5
332b5b7
53a7c63
9f517e3
096cf1e
3cfe406
1842f68
c13a6a9
f0a0fa4
066906e
6da956b
6098edc
e89a4f9
fe1da53
d415f0d
289fb4f
bca0a73
d29cbba
71ad464
a494518
5788320
4c7388e
46eb1d4
0ada7a0
671e4ac
7a7cff5
9b0da66
d3d4d34
52cd6cc
1355818
bc64a92
96980d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,6 +51,8 @@ Developer notes | |
|
||
* CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``. | ||
|
||
* Note to the nvcc developer. nvcc with debug symbols causes a stack overflow that is undetected at both compile and runtime. This goes undetected until ns>=10, for ns<10, one can use -G and debug the code with cuda-gdb. The way to avoid is to not use Debug symbols, possibly using ``--generate-line-info`` might work (not tested). As a side note, compute-sanitizers do not detect the issue. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. great. Maybe mention only d=3? |
||
|
||
* Testing cufinufft (for FI, mostly): | ||
|
||
.. code-block:: sh | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,7 +20,7 @@ to the simple, vectorized, or guru makeplan routines. | |
Recall how to do this from C++: | ||
|
||
.. code-block:: C++ | ||
|
||
// (... set up M,x,c,tol,N, and allocate F here...) | ||
finufft_opts* opts; | ||
finufft_default_opts(opts); | ||
|
@@ -30,7 +30,7 @@ Recall how to do this from C++: | |
This setting produces more timing output to ``stdout``. | ||
|
||
.. warning:: | ||
|
||
In C/C++ and Fortran, don't forget to call the command which sets default options | ||
(``finufft_default_opts`` or ``finufftf_default_opts``) | ||
before you start changing them and passing them to FINUFFT. | ||
|
@@ -51,9 +51,9 @@ Here are their default settings (from ``src/finufft.cpp:finufft_default_opts``): | |
.. literalinclude:: ../src/finufft.cpp | ||
:start-after: @defopts_start | ||
:end-before: @defopts_end | ||
|
||
As for quick advice, the main options you'll want to play with are: | ||
|
||
- ``modeord`` to flip ("fftshift") the Fourier mode ordering | ||
- ``debug`` to look at timing output (to determine if your problem is spread/interpolation dominated, vs FFT dominated) | ||
- ``nthreads`` to run with a different number of threads than the current maximum available through OpenMP (a large number can sometimes be detrimental, and very small problems can sometimes run faster on 1 thread) | ||
|
@@ -92,15 +92,15 @@ Data handling options | |
.. note:: The index *sets* are the same in the two ``modeord`` choices; their ordering differs only by a cyclic shift. The FFT ordering cyclically shifts the CMCL indices $\mbox{floor}(N/2)$ to the left (often called an "fftshift"). | ||
|
||
**chkbnds**: [DEPRECATED] has no effect. | ||
|
||
|
||
Diagnostic options | ||
~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
**debug**: Controls the amount of overall debug/timing output to stdout. | ||
|
||
* ``debug=0`` : silent | ||
|
||
* ``debug=1`` : print some information | ||
|
||
* ``debug=2`` : prints more information | ||
|
@@ -113,11 +113,11 @@ Diagnostic options | |
|
||
* ``spread_debug=2`` : prints lots. This can print thousands of lines since it includes one line per *subproblem*. | ||
|
||
|
||
**showwarn**: Whether to print warnings (these go to stderr). | ||
|
||
* ``showwarn=0`` : suppresses such warnings | ||
|
||
* ``showwarn=1`` : prints warnings | ||
|
||
|
||
|
@@ -173,17 +173,17 @@ for only two settings, as follows. Otherwise, setting it to zero chooses a good | |
**spread_thread**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), controls how multithreading is used to spread/interpolate each batch of data. | ||
|
||
* ``spread_thread=0`` : makes an automatic choice between the below. Recommended. | ||
|
||
* ``spread_thread=1`` : acts on each vector in the batch in sequence, using multithreaded spread/interpolate on that vector. It can be slightly better than ``2`` for large problems. | ||
|
||
* ``spread_thread=2`` : acts on all vectors in a batch (of size chosen typically to be the number of threads) simultaneously, assigning each a thread which performs a single-threaded spread/interpolate. It is much better than ``1`` for all but large problems. (Historical note: this was used by Melody Shih for the original "2dmany" interface in 2018.) | ||
|
||
.. note:: | ||
|
||
Historical note: A former option ``3`` has been removed. This was like ``2`` except allowing nested OMP parallelism, so multi-threaded spread-interpolate was used for each of the vectors in a batch in parallel. This was used by Andrea Malleo in 2019. We have not yet found a case where this beats both ``1`` and ``2``, hence removed it due to complications with changing the OMP nesting state in both old and new OMP versions. | ||
|
||
**maxbatchsize**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors. | ||
|
||
**batchsize**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm confused -this seems to rename a CPU option? We don't want to change them. I thought we were changing maxbatchsize -> batchsize only internally and only in the GPU code ? |
||
Here ``0`` makes an automatic choice. If you are unhappy with this, then for small problems it should equal the number of threads, while for large problems it appears that ``1`` often better (since otherwise too much simultaneous RAM movement occurs). Some further work is needed to optimize this parameter. | ||
|
||
**spread_nthr_atomic**: if non-negative: for numbers of threads up to this value, an OMP critical block for ``add_wrapped_subgrid`` is used in spreading (type 1 transforms). Above this value, instead OMP atomic writes are used, which scale better for large thread numbers. If negative, the heuristic default in the spreader is used, set in ``src/spreadinterp.cpp:setup_spreader()``. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,27 +7,37 @@ | |
#include <finufft_errors.h> | ||
#include <finufft_spread_opts.h> | ||
|
||
#include <complex.h> | ||
#include <complex> | ||
#include <optional> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Used? |
||
|
||
namespace cufinufft { | ||
namespace common { | ||
template<typename T> | ||
__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f, | ||
cuDoubleComplex *a, T *fwkerhalf1, T *fwkerhalf2, | ||
__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f, T *a, | ||
T *fwkerhalf1, T *fwkerhalf2, T *fwkerhalf3, | ||
int ns); | ||
template<typename T> | ||
__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f, T *a, T *kx, | ||
T *ky, T *kz, T *fwkerhalf1, T *fwkerhalf2, | ||
T *fwkerhalf3, int ns); | ||
template<typename T> | ||
int cufserieskernelcompute(int dim, int nf1, int nf2, int nf3, T *d_f, | ||
cuDoubleComplex *d_a, T *d_fwkerhalf1, T *d_fwkerhalf2, | ||
int cufserieskernelcompute(int dim, int nf1, int nf2, int nf3, T *d_f, T *d_a, | ||
T *d_fwkerhalf1, T *d_fwkerhalf2, T *d_fwkerhalf3, int ns, | ||
cudaStream_t stream); | ||
template<typename T> | ||
int cufserieskernelcompute(int dim, int nf1, int nf2, int nf3, T *d_f, T *d_a, T *d_kx, | ||
T *d_ky, T *d_kz, T *d_fwkerhalf1, T *d_fwkerhalf2, | ||
T *d_fwkerhalf3, int ns, cudaStream_t stream); | ||
template<typename T> | ||
int setup_spreader_for_nufft(finufft_spread_opts &spopts, T eps, cufinufft_opts opts); | ||
|
||
void set_nf_type12(CUFINUFFT_BIGINT ms, cufinufft_opts opts, finufft_spread_opts spopts, | ||
CUFINUFFT_BIGINT *nf, CUFINUFFT_BIGINT b); | ||
template<typename T> | ||
void onedim_fseries_kernel(CUFINUFFT_BIGINT nf, T *fwkerhalf, finufft_spread_opts opts); | ||
template<typename T> | ||
void onedim_fseries_kernel_precomp(CUFINUFFT_BIGINT nf, T *f, std::complex<double> *a, | ||
// template<typename T> | ||
// void onedim_fseries_kernel(CUFINUFFT_BIGINT nf, T *fwkerhalf, finufft_spread_opts | ||
// opts); | ||
template<typename T, bool> | ||
void onedim_fseries_kernel_precomp(CUFINUFFT_BIGINT nf, T *f, T *a, | ||
finufft_spread_opts opts); | ||
template<typename T> | ||
void onedim_fseries_kernel_compute(CUFINUFFT_BIGINT nf, T *f, std::complex<double> *a, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
#ifndef FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H | ||
ahbarnett marked this conversation as resolved.
Show resolved
Hide resolved
|
||
#define FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H | ||
|
||
#include <cuComplex.h> | ||
|
||
// Addition for cuDoubleComplex (double) with cuDoubleComplex (double) | ||
__host__ __device__ __forceinline__ cuDoubleComplex operator+( | ||
const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept { | ||
return cuCadd(a, b); | ||
} | ||
|
||
// Subtraction for cuDoubleComplex (double) with cuDoubleComplex (double) | ||
__host__ __device__ __forceinline__ cuDoubleComplex operator-( | ||
const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept { | ||
return cuCsub(a, b); | ||
} | ||
|
||
// Multiplication for cuDoubleComplex (double) with cuDoubleComplex (double) | ||
__host__ __device__ __forceinline__ cuDoubleComplex operator*( | ||
const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept { | ||
return cuCmul(a, b); | ||
} | ||
|
||
// Division for cuDoubleComplex (double) with cuDoubleComplex (double) | ||
__host__ __device__ __forceinline__ cuDoubleComplex operator/( | ||
const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept { | ||
return cuCdiv(a, b); | ||
} | ||
|
||
// Equality for cuDoubleComplex (double) with cuDoubleComplex (double) | ||
__host__ __device__ __forceinline__ bool operator==(const cuDoubleComplex &a, | ||
const cuDoubleComplex &b) noexcept { | ||
return cuCreal(a) == cuCreal(b) && cuCimag(a) == cuCimag(b); | ||
} | ||
|
||
// Inequality for cuDoubleComplex (double) with cuDoubleComplex (double) | ||
__host__ __device__ __forceinline__ bool operator!=(const cuDoubleComplex &a, | ||
const cuDoubleComplex &b) noexcept { | ||
return !(a == b); | ||
} | ||
|
||
// Addition for cuDoubleComplex (double) with double | ||
__host__ __device__ __forceinline__ cuDoubleComplex operator+(const cuDoubleComplex &a, | ||
double b) noexcept { | ||
return make_cuDoubleComplex(cuCreal(a) + b, cuCimag(a)); | ||
} | ||
|
||
__host__ __device__ __forceinline__ cuDoubleComplex operator+( | ||
double a, const cuDoubleComplex &b) noexcept { | ||
return make_cuDoubleComplex(a + cuCreal(b), cuCimag(b)); | ||
} | ||
|
||
// Subtraction for cuDoubleComplex (double) with double | ||
__host__ __device__ __forceinline__ cuDoubleComplex operator-(const cuDoubleComplex &a, | ||
double b) noexcept { | ||
return make_cuDoubleComplex(cuCreal(a) - b, cuCimag(a)); | ||
} | ||
|
||
__host__ __device__ __forceinline__ cuDoubleComplex operator-( | ||
double a, const cuDoubleComplex &b) noexcept { | ||
return make_cuDoubleComplex(a - cuCreal(b), -cuCimag(b)); | ||
} | ||
|
||
// Multiplication for cuDoubleComplex (double) with double | ||
__host__ __device__ __forceinline__ cuDoubleComplex operator*(const cuDoubleComplex &a, | ||
double b) noexcept { | ||
return make_cuDoubleComplex(cuCreal(a) * b, cuCimag(a) * b); | ||
} | ||
|
||
__host__ __device__ __forceinline__ cuDoubleComplex operator*( | ||
double a, const cuDoubleComplex &b) noexcept { | ||
return make_cuDoubleComplex(a * cuCreal(b), a * cuCimag(b)); | ||
} | ||
|
||
// Division for cuDoubleComplex (double) with double | ||
__host__ __device__ __forceinline__ cuDoubleComplex operator/(const cuDoubleComplex &a, | ||
double b) noexcept { | ||
return make_cuDoubleComplex(cuCreal(a) / b, cuCimag(a) / b); | ||
} | ||
|
||
__host__ __device__ __forceinline__ cuDoubleComplex operator/( | ||
double a, const cuDoubleComplex &b) noexcept { | ||
double denom = cuCreal(b) * cuCreal(b) + cuCimag(b) * cuCimag(b); | ||
return make_cuDoubleComplex((a * cuCreal(b)) / denom, (-a * cuCimag(b)) / denom); | ||
} | ||
|
||
// Addition for cuFloatComplex (float) with cuFloatComplex (float) | ||
__host__ __device__ __forceinline__ cuFloatComplex operator+( | ||
const cuFloatComplex &a, const cuFloatComplex &b) noexcept { | ||
return cuCaddf(a, b); | ||
} | ||
|
||
// Subtraction for cuFloatComplex (float) with cuFloatComplex (float) | ||
__host__ __device__ __forceinline__ cuFloatComplex operator-( | ||
const cuFloatComplex &a, const cuFloatComplex &b) noexcept { | ||
return cuCsubf(a, b); | ||
} | ||
|
||
// Multiplication for cuFloatComplex (float) with cuFloatComplex (float) | ||
__host__ __device__ __forceinline__ cuFloatComplex operator*( | ||
const cuFloatComplex &a, const cuFloatComplex &b) noexcept { | ||
return cuCmulf(a, b); | ||
} | ||
|
||
// Division for cuFloatComplex (float) with cuFloatComplex (float) | ||
__host__ __device__ __forceinline__ cuFloatComplex operator/( | ||
const cuFloatComplex &a, const cuFloatComplex &b) noexcept { | ||
return cuCdivf(a, b); | ||
} | ||
|
||
// Equality for cuFloatComplex (float) with cuFloatComplex (float) | ||
__host__ __device__ __forceinline__ bool operator==(const cuFloatComplex &a, | ||
const cuFloatComplex &b) noexcept { | ||
return cuCrealf(a) == cuCrealf(b) && cuCimagf(a) == cuCimagf(b); | ||
} | ||
|
||
// Inequality for cuFloatComplex (float) with cuFloatComplex (float) | ||
__host__ __device__ __forceinline__ bool operator!=(const cuFloatComplex &a, | ||
const cuFloatComplex &b) noexcept { | ||
return !(a == b); | ||
} | ||
|
||
// Addition for cuFloatComplex (float) with float | ||
__host__ __device__ __forceinline__ cuFloatComplex operator+(const cuFloatComplex &a, | ||
float b) noexcept { | ||
return make_cuFloatComplex(cuCrealf(a) + b, cuCimagf(a)); | ||
} | ||
|
||
__host__ __device__ __forceinline__ cuFloatComplex operator+( | ||
float a, const cuFloatComplex &b) noexcept { | ||
return make_cuFloatComplex(a + cuCrealf(b), cuCimagf(b)); | ||
} | ||
|
||
// Subtraction for cuFloatComplex (float) with float | ||
__host__ __device__ __forceinline__ cuFloatComplex operator-(const cuFloatComplex &a, | ||
float b) noexcept { | ||
return make_cuFloatComplex(cuCrealf(a) - b, cuCimagf(a)); | ||
} | ||
|
||
__host__ __device__ __forceinline__ cuFloatComplex operator-( | ||
float a, const cuFloatComplex &b) noexcept { | ||
return make_cuFloatComplex(a - cuCrealf(b), -cuCimagf(b)); | ||
} | ||
|
||
// Multiplication for cuFloatComplex (float) with float | ||
__host__ __device__ __forceinline__ cuFloatComplex operator*(const cuFloatComplex &a, | ||
float b) noexcept { | ||
return make_cuFloatComplex(cuCrealf(a) * b, cuCimagf(a) * b); | ||
} | ||
|
||
__host__ __device__ __forceinline__ cuFloatComplex operator*( | ||
float a, const cuFloatComplex &b) noexcept { | ||
return make_cuFloatComplex(a * cuCrealf(b), a * cuCimagf(b)); | ||
} | ||
|
||
// Division for cuFloatComplex (float) with float | ||
__host__ __device__ __forceinline__ cuFloatComplex operator/(const cuFloatComplex &a, | ||
float b) noexcept { | ||
return make_cuFloatComplex(cuCrealf(a) / b, cuCimagf(a) / b); | ||
} | ||
|
||
__host__ __device__ __forceinline__ cuFloatComplex operator/( | ||
float a, const cuFloatComplex &b) noexcept { | ||
float denom = cuCrealf(b) * cuCrealf(b) + cuCimagf(b) * cuCimagf(b); | ||
return make_cuFloatComplex((a * cuCrealf(b)) / denom, (-a * cuCimagf(b)) / denom); | ||
} | ||
|
||
#endif // FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,18 @@ | ||
#ifndef CUFINUFFT_DEFS_H | ||
#define CUFINUFFT_DEFS_H | ||
|
||
#include <complex> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Used? |
||
#include <limits> | ||
|
||
// constants needed within common | ||
// upper bound on w, ie nspread, even when padded (see evaluate_kernel_vector); also for | ||
// common | ||
#define MAX_NSPREAD 16 | ||
#define MAX_NSPREAD 16 | ||
|
||
// max number of positive quadr nodes | ||
#define MAX_NQUAD 100 | ||
#define MAX_NQUAD 100 | ||
|
||
// Fraction growth cut-off in utils:arraywidcen, sets when translate in type-3 | ||
#define ARRAYWIDCEN_GROWFRAC 0.1 | ||
|
||
// FIXME: If cufft ever takes N > INT_MAX... | ||
constexpr int32_t MAX_NF = std::numeric_limits<int32_t>::max(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a separate * for each line here implies they are distinct features. Suggest use hyphen as below to group a sublist for GPU.