Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu type 3 #517

Merged
merged 82 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
45333fa
basic benchmarks
DiamonDinoia Jul 3, 2024
b95a082
added plotting script
DiamonDinoia Jul 4, 2024
ae55ca5
optimised plotting
DiamonDinoia Jul 8, 2024
16e27f0
fixed plotting and metrics
DiamonDinoia Jul 8, 2024
49d1f21
fixed the plot script
DiamonDinoia Jul 8, 2024
2fdae68
bin_size_x is as function of the shared memory available
DiamonDinoia Jul 8, 2024
c0d9923
bin_size_x is as function of the shared memory available
DiamonDinoia Jul 8, 2024
907797c
minor optimizations in 1D
DiamonDinoia Jul 9, 2024
60f4780
otpimized nupts driven
DiamonDinoia Jul 12, 2024
35dcc66
Optimized 1D and 2D
DiamonDinoia Jul 15, 2024
e1ad9bb
Merge branch 'master' into gpu-optimizations
DiamonDinoia Jul 15, 2024
366295d
3D integer operations
DiamonDinoia Jul 18, 2024
24bf6be
3D SM and GM optimized
DiamonDinoia Jul 18, 2024
960117a
bump cuda version
DiamonDinoia Jul 18, 2024
4295a86
Merge remote-tracking branch 'flatiron/master' into gpu-optimizations
DiamonDinoia Jul 23, 2024
c1b14c6
changed matlab to generate necessary cuda upsampfact files
DiamonDinoia Jul 23, 2024
f300d2d
added new coeffs
DiamonDinoia Jul 23, 2024
e86c762
Merge remote-tracking branch 'refs/remotes/origin/gpu-optimizations' …
DiamonDinoia Jul 23, 2024
db0457a
restoring .m from master
DiamonDinoia Jul 23, 2024
d0ce11e
updated hook
DiamonDinoia Jul 23, 2024
513ce4b
updated matlab upsampfact
DiamonDinoia Jul 23, 2024
798717d
updated coefficients
DiamonDinoia Jul 23, 2024
282baf5
new coeffs
DiamonDinoia Jul 23, 2024
12822a2
updated cufinufft to new coeff
DiamonDinoia Jul 23, 2024
badf22f
Merge remote-tracking branch 'flatiron/master' into gpu-optimizations
DiamonDinoia Jul 23, 2024
bf6328b
Merge remote-tracking branch 'flatiron/master' into gpu-optimizations
DiamonDinoia Jul 23, 2024
ae783da
picked good defaults for method
DiamonDinoia Jul 24, 2024
d29fcf5
update configuration
DiamonDinoia Jul 24, 2024
73f937b
upated build system
DiamonDinoia Jul 25, 2024
0724866
fixing jenkins
DiamonDinoia Jul 25, 2024
8cd50fc
using cuda 11.2
DiamonDinoia Jul 25, 2024
49a9d7e
using sm90 atomics
DiamonDinoia Jul 25, 2024
041a536
updated script
DiamonDinoia Jul 25, 2024
54683c3
fixed bin sizes
DiamonDinoia Jul 26, 2024
4f19103
Merge branch 'master' into gpu-optimizations
DiamonDinoia Jul 26, 2024
dc3a628
using floor in fold_rescale updated changelog
DiamonDinoia Jul 26, 2024
b3237f7
fixed a mistake
DiamonDinoia Jul 26, 2024
db80aad
added comments for review
DiamonDinoia Jul 26, 2024
c225fb5
fixing review comments
DiamonDinoia Jul 31, 2024
394550f
Merge remote-tracking branch 'flatiron/master' into gpu-optimizations
DiamonDinoia Jul 31, 2024
5606aa0
merged master
DiamonDinoia Jul 31, 2024
74ccd71
fixed cmake
DiamonDinoia Jul 31, 2024
ee28d05
Gcc-9 fixes; Ker size fixed too
DiamonDinoia Aug 1, 2024
466ddff
windows compatibility tweak; unit testing the 1.25 upsampfact
DiamonDinoia Aug 1, 2024
3f60ca4
Merge remote-tracking branch 'flatiron/master' into gpu-optimizations
DiamonDinoia Aug 1, 2024
fb48ff8
added forgotten c++17 flag
DiamonDinoia Aug 1, 2024
8c42061
Merge remote-tracking branch 'flatiron/master' into gpu-type-3
DiamonDinoia Aug 2, 2024
b64f68e
Preliminary type 3 commit. Incomplete setpts but greatly simplifies t…
DiamonDinoia Aug 8, 2024
7c810a5
Merge remote-tracking branch 'flatiron/master' into gpu-type-3
DiamonDinoia Aug 8, 2024
9d44993
testing
DiamonDinoia Aug 13, 2024
074dda5
Adding prephase and deconv with tests
DiamonDinoia Aug 14, 2024
332b5b7
first 3D working version
DiamonDinoia Aug 15, 2024
53a7c63
First working version, Horner breaks
DiamonDinoia Aug 15, 2024
9f517e3
Type 3 working
DiamonDinoia Aug 16, 2024
096cf1e
added 1D&2d type 3
DiamonDinoia Aug 20, 2024
3cfe406
fixed tests for type3
DiamonDinoia Aug 20, 2024
1842f68
fixed possible memory leaks
DiamonDinoia Aug 21, 2024
c13a6a9
minor changes, mainly for debug
DiamonDinoia Aug 21, 2024
f0a0fa4
small fixes
DiamonDinoia Aug 22, 2024
066906e
adding debug prints
DiamonDinoia Aug 23, 2024
6da956b
testing inner plan2 & using cudamemcpyasync
DiamonDinoia Aug 23, 2024
6098edc
testing the intter type 2 completely
DiamonDinoia Aug 26, 2024
e89a4f9
fixed type 3 without horner
DiamonDinoia Aug 27, 2024
fe1da53
type3 many support
DiamonDinoia Aug 27, 2024
d415f0d
type3 many tests for one target
DiamonDinoia Aug 28, 2024
289fb4f
updated docstring
DiamonDinoia Aug 28, 2024
bca0a73
removed small transf tests
DiamonDinoia Aug 28, 2024
d29cbba
XMerge remote-tracking branch 'flatiron/master' into gpu-type-3
DiamonDinoia Aug 28, 2024
71ad464
added extended lambda flag to tests
DiamonDinoia Aug 28, 2024
a494518
CleanUP
DiamonDinoia Aug 28, 2024
5788320
Updated changelog
DiamonDinoia Aug 28, 2024
4c7388e
fixed printf warning
DiamonDinoia Aug 28, 2024
46eb1d4
restored fftw behaviour
DiamonDinoia Aug 28, 2024
0ada7a0
Added devnotes on the issue
DiamonDinoia Aug 28, 2024
671e4ac
removed sprurious changes
DiamonDinoia Aug 28, 2024
7a7cff5
Minor cleanup
DiamonDinoia Aug 28, 2024
9b0da66
fixed math test
DiamonDinoia Aug 29, 2024
d3d4d34
Addressed review comments
DiamonDinoia Sep 4, 2024
52cd6cc
Merge remote-tracking branch 'flatiron/master' into gpu-type-3
DiamonDinoia Sep 11, 2024
1355818
splitting onedim_f_series in two functions
DiamonDinoia Sep 11, 2024
bc64a92
GPU flipwind type 1-2; fseries and nuft renaming to match CPU code
DiamonDinoia Sep 12, 2024
96980d3
fixed complex math test
DiamonDinoia Sep 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,15 @@ List of features / changes made / release notes, in reverse chronological order.
If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).

Master (9/10/24)

* reduced roundoff error in a[n] phase calc in CPU onedim_fseries_kernel().
#534 (Barnett).
* Support for type 3 in 1D, 2D, and 3D in the GPU library cufinufft (PR #517).
- Removed the CPU fseries computation (only used for benchmark no longer needed).
- Added complex arithmetic support for cuda_complex type
- Added tests for type 3 in 1D, 2D, and 3D and cuda_complex arithmetic
- Minor fixes on the GPU code:
a) removed memory leaks in case of errors
b) renamed maxbatchsize to batchsize

V 2.3.0 (9/5/24)

Expand Down
2 changes: 2 additions & 0 deletions docs/devnotes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ Developer notes

* CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``.

* Note to the nvcc developer. nvcc with debug symbols causes a stack overflow that is undetected at both compile and runtime. This goes undetected until ns>=10 and dim=3, for ns<10 or dim < 3, one can use -G and debug the code with cuda-gdb. The way to avoid is to not use Debug symbols, possibly using ``--generate-line-info`` might work (not tested). As a side note, compute-sanitizers do not detect the issue.

* Testing cufinufft (for FI, mostly):

.. code-block:: sh
Expand Down
28 changes: 17 additions & 11 deletions include/cufinufft/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,31 +7,37 @@
#include <finufft_errors.h>
#include <finufft_spread_opts.h>

#include <complex.h>
#include <complex>

namespace cufinufft {
namespace common {
template<typename T>
__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f,
cuDoubleComplex *a, T *fwkerhalf1, T *fwkerhalf2,
__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f, T *a,
T *fwkerhalf1, T *fwkerhalf2, T *fwkerhalf3,
int ns);
template<typename T>
__global__ void cu_nuft_kernel_compute(int nf1, int nf2, int nf3, T *f, T *z, T *kx,
T *ky, T *kz, T *fwkerhalf1, T *fwkerhalf2,
T *fwkerhalf3, int ns);
template<typename T>
int cufserieskernelcompute(int dim, int nf1, int nf2, int nf3, T *d_f,
cuDoubleComplex *d_a, T *d_fwkerhalf1, T *d_fwkerhalf2,
T *d_fwkerhalf3, int ns, cudaStream_t stream);
int fseries_kernel_compute(int dim, int nf1, int nf2, int nf3, T *d_f, T *d_phase,
T *d_fwkerhalf1, T *d_fwkerhalf2, T *d_fwkerhalf3, int ns,
cudaStream_t stream);
template<typename T>
int nuft_kernel_compute(int dim, int nf1, int nf2, int nf3, T *d_f, T *d_z, T *d_kx,
T *d_ky, T *d_kz, T *d_fwkerhalf1, T *d_fwkerhalf2,
T *d_fwkerhalf3, int ns, cudaStream_t stream);
template<typename T>
int setup_spreader_for_nufft(finufft_spread_opts &spopts, T eps, cufinufft_opts opts);

void set_nf_type12(CUFINUFFT_BIGINT ms, cufinufft_opts opts, finufft_spread_opts spopts,
CUFINUFFT_BIGINT *nf, CUFINUFFT_BIGINT b);

template<typename T>
void onedim_fseries_kernel(CUFINUFFT_BIGINT nf, T *fwkerhalf, finufft_spread_opts opts);
template<typename T>
void onedim_fseries_kernel_precomp(CUFINUFFT_BIGINT nf, T *f, std::complex<double> *a,
void onedim_fseries_kernel_precomp(CUFINUFFT_BIGINT nf, T *f, T *a,
finufft_spread_opts opts);
template<typename T>
void onedim_fseries_kernel_compute(CUFINUFFT_BIGINT nf, T *f, std::complex<double> *a,
T *fwkerhalf, finufft_spread_opts opts);
void onedim_nuft_kernel_precomp(T *f, T *zout, finufft_spread_opts opts);

template<typename T>
std::size_t shared_memory_required(int dim, int ns, int bin_size_x, int bin_size_y,
Expand Down
173 changes: 173 additions & 0 deletions include/cufinufft/contrib/helper_math.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
#ifndef FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H
ahbarnett marked this conversation as resolved.
Show resolved Hide resolved
#define FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H

#include <cuComplex.h>

// This header provides some helper functions for cuComplex types.
// It mainly wraps existing CUDA implementations to provide operator overloads
// e.g. cuAdd, cuSub, cuMul, cuDiv, cuCreal, cuCimag, cuCabs, cuCarg, cuConj are all
// provided by CUDA

// Addition for cuDoubleComplex (double) with cuDoubleComplex (double)
__host__ __device__ __forceinline__ cuDoubleComplex operator+(
const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept {
return cuCadd(a, b);
}

// Subtraction for cuDoubleComplex (double) with cuDoubleComplex (double)
__host__ __device__ __forceinline__ cuDoubleComplex operator-(
const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept {
return cuCsub(a, b);
}

// Multiplication for cuDoubleComplex (double) with cuDoubleComplex (double)
__host__ __device__ __forceinline__ cuDoubleComplex operator*(
const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept {
return cuCmul(a, b);
}

// Division for cuDoubleComplex (double) with cuDoubleComplex (double)
__host__ __device__ __forceinline__ cuDoubleComplex operator/(
const cuDoubleComplex &a, const cuDoubleComplex &b) noexcept {
return cuCdiv(a, b);
}

// Equality for cuDoubleComplex (double) with cuDoubleComplex (double)
__host__ __device__ __forceinline__ bool operator==(const cuDoubleComplex &a,
const cuDoubleComplex &b) noexcept {
return cuCreal(a) == cuCreal(b) && cuCimag(a) == cuCimag(b);
}

// Inequality for cuDoubleComplex (double) with cuDoubleComplex (double)
__host__ __device__ __forceinline__ bool operator!=(const cuDoubleComplex &a,
const cuDoubleComplex &b) noexcept {
return !(a == b);
}

// Addition for cuDoubleComplex (double) with double
__host__ __device__ __forceinline__ cuDoubleComplex operator+(const cuDoubleComplex &a,
double b) noexcept {
return make_cuDoubleComplex(cuCreal(a) + b, cuCimag(a));
}

__host__ __device__ __forceinline__ cuDoubleComplex operator+(
double a, const cuDoubleComplex &b) noexcept {
return make_cuDoubleComplex(a + cuCreal(b), cuCimag(b));
}

// Subtraction for cuDoubleComplex (double) with double
__host__ __device__ __forceinline__ cuDoubleComplex operator-(const cuDoubleComplex &a,
double b) noexcept {
return make_cuDoubleComplex(cuCreal(a) - b, cuCimag(a));
}

__host__ __device__ __forceinline__ cuDoubleComplex operator-(
double a, const cuDoubleComplex &b) noexcept {
return make_cuDoubleComplex(a - cuCreal(b), -cuCimag(b));
}

// Multiplication for cuDoubleComplex (double) with double
__host__ __device__ __forceinline__ cuDoubleComplex operator*(const cuDoubleComplex &a,
double b) noexcept {
return make_cuDoubleComplex(cuCreal(a) * b, cuCimag(a) * b);
}

__host__ __device__ __forceinline__ cuDoubleComplex operator*(
double a, const cuDoubleComplex &b) noexcept {
return make_cuDoubleComplex(a * cuCreal(b), a * cuCimag(b));
}

// Division for cuDoubleComplex (double) with double
__host__ __device__ __forceinline__ cuDoubleComplex operator/(const cuDoubleComplex &a,
double b) noexcept {
return make_cuDoubleComplex(cuCreal(a) / b, cuCimag(a) / b);
}

__host__ __device__ __forceinline__ cuDoubleComplex operator/(
double a, const cuDoubleComplex &b) noexcept {
double denom = cuCreal(b) * cuCreal(b) + cuCimag(b) * cuCimag(b);
return make_cuDoubleComplex((a * cuCreal(b)) / denom, (-a * cuCimag(b)) / denom);
}

// Addition for cuFloatComplex (float) with cuFloatComplex (float)
__host__ __device__ __forceinline__ cuFloatComplex operator+(
const cuFloatComplex &a, const cuFloatComplex &b) noexcept {
return cuCaddf(a, b);
}

// Subtraction for cuFloatComplex (float) with cuFloatComplex (float)
__host__ __device__ __forceinline__ cuFloatComplex operator-(
const cuFloatComplex &a, const cuFloatComplex &b) noexcept {
return cuCsubf(a, b);
}

// Multiplication for cuFloatComplex (float) with cuFloatComplex (float)
__host__ __device__ __forceinline__ cuFloatComplex operator*(
const cuFloatComplex &a, const cuFloatComplex &b) noexcept {
return cuCmulf(a, b);
}

// Division for cuFloatComplex (float) with cuFloatComplex (float)
__host__ __device__ __forceinline__ cuFloatComplex operator/(
const cuFloatComplex &a, const cuFloatComplex &b) noexcept {
return cuCdivf(a, b);
}

// Equality for cuFloatComplex (float) with cuFloatComplex (float)
__host__ __device__ __forceinline__ bool operator==(const cuFloatComplex &a,
const cuFloatComplex &b) noexcept {
return cuCrealf(a) == cuCrealf(b) && cuCimagf(a) == cuCimagf(b);
}

// Inequality for cuFloatComplex (float) with cuFloatComplex (float)
__host__ __device__ __forceinline__ bool operator!=(const cuFloatComplex &a,
const cuFloatComplex &b) noexcept {
return !(a == b);
}

// Addition for cuFloatComplex (float) with float
__host__ __device__ __forceinline__ cuFloatComplex operator+(const cuFloatComplex &a,
float b) noexcept {
return make_cuFloatComplex(cuCrealf(a) + b, cuCimagf(a));
}

__host__ __device__ __forceinline__ cuFloatComplex operator+(
float a, const cuFloatComplex &b) noexcept {
return make_cuFloatComplex(a + cuCrealf(b), cuCimagf(b));
}

// Subtraction for cuFloatComplex (float) with float
__host__ __device__ __forceinline__ cuFloatComplex operator-(const cuFloatComplex &a,
float b) noexcept {
return make_cuFloatComplex(cuCrealf(a) - b, cuCimagf(a));
}

__host__ __device__ __forceinline__ cuFloatComplex operator-(
float a, const cuFloatComplex &b) noexcept {
return make_cuFloatComplex(a - cuCrealf(b), -cuCimagf(b));
}

// Multiplication for cuFloatComplex (float) with float
__host__ __device__ __forceinline__ cuFloatComplex operator*(const cuFloatComplex &a,
float b) noexcept {
return make_cuFloatComplex(cuCrealf(a) * b, cuCimagf(a) * b);
}

__host__ __device__ __forceinline__ cuFloatComplex operator*(
float a, const cuFloatComplex &b) noexcept {
return make_cuFloatComplex(a * cuCrealf(b), a * cuCimagf(b));
}

// Division for cuFloatComplex (float) with float
__host__ __device__ __forceinline__ cuFloatComplex operator/(const cuFloatComplex &a,
float b) noexcept {
return make_cuFloatComplex(cuCrealf(a) / b, cuCimagf(a) / b);
}

__host__ __device__ __forceinline__ cuFloatComplex operator/(
float a, const cuFloatComplex &b) noexcept {
float denom = cuCrealf(b) * cuCrealf(b) + cuCimagf(b) * cuCimagf(b);
return make_cuFloatComplex((a * cuCrealf(b)) / denom, (-a * cuCimagf(b)) / denom);
}

#endif // FINUFFT_INCLUDE_CUFINUFFT_CONTRIB_HELPER_MATH_H
8 changes: 5 additions & 3 deletions include/cufinufft/defs.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,16 @@
#define CUFINUFFT_DEFS_H

#include <limits>

// constants needed within common
// upper bound on w, ie nspread, even when padded (see evaluate_kernel_vector); also for
// common
#define MAX_NSPREAD 16
#define MAX_NSPREAD 16

// max number of positive quadr nodes
#define MAX_NQUAD 100
#define MAX_NQUAD 100

// Fraction growth cut-off in utils:arraywidcen, sets when translate in type-3
#define ARRAYWIDCEN_GROWFRAC 0.1

// FIXME: If cufft ever takes N > INT_MAX...
constexpr int32_t MAX_NF = std::numeric_limits<int32_t>::max();
Expand Down
Loading
Loading