-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu type 3 #517
Gpu type 3 #517
Conversation
…into gpu-optimizations
Changelog:
|
The test failure is not an issue. I will disable that test later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! A big one. Just a few things to tidy up, should be at most 1-2 hrs if you don't address the the phase-winding tidy-up.
[However, have a look at removing the whole phase-winding and "a" stuff (it is way more confusing that it needs be, due to old code left by Melody and Robert), which I could have a stab at after merge, anyway, unless you want to. It might add 0.5-1 days work for you. Sit down and write out the math, or go back to the CPU code sans phase-winding, just using plain cos(). If it's not clear, we can together after Sept 3.]
CHANGELOG
Outdated
@@ -1,6 +1,15 @@ | |||
List of features / changes made / release notes, in reverse chronological order. | |||
If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately). | |||
|
|||
V 2.4.0 (08/28/24) | |||
* Support for type 3 in 1D, 2D, and 3D in the GPU library cufinufft (PR #517). | |||
* Removed the CPU fseries computation (only used for benchmark no longer needed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a separate * for each line here implies they are distinct features. Suggest use hyphen as below to group a sublist for GPU.
docs/devnotes.rst
Outdated
@@ -51,6 +51,8 @@ Developer notes | |||
|
|||
* CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``. | |||
|
|||
* Note to the nvcc developer. nvcc with debug symbols causes a stack overflow that is undetected at both compile and runtime. This goes undetected until ns>=10, for ns<10, one can use -G and debug the code with cuda-gdb. The way to avoid is to not use Debug symbols, possibly using ``--generate-line-info`` might work (not tested). As a side note, compute-sanitizers do not detect the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great. Maybe mention only d=3?
docs/opts.rst
Outdated
**maxbatchsize**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors. | ||
|
||
**batchsize**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused -this seems to rename a CPU option? We don't want to change them. I thought we were changing maxbatchsize -> batchsize only internally and only in the GPU code ?
test/cuda/cufinufft1d_test.cu
Outdated
} | ||
s.resize(N1); | ||
for (int i = 0; i < N1; i++) { | ||
s[i] = M_PI * randm11(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In test/finufft1d_test.cpp the "s" array is scaled by S = N1/2. I suggest you do this here too, to make the space-freq product grow (hence the FFT size) as the number of modes. This matches CPU testers. Also for 2d, 3d, below.
test/cuda/cufinufft2d_test.cu
Outdated
s.resize(N1 * N2); | ||
t.resize(N1 * N2); | ||
for (int i = 0; i < N1 * N2; i++) { | ||
s[i] = M_PI * randm11(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see prev review on N1, N2 scaling of the s,t test freq pts.
test/cuda/cufinufft3d_test.cu
Outdated
t.resize(N1 * N2 * N3); | ||
u.resize(N1 * N2 * N3); | ||
for (int i = 0; i < N1 * N2 * N3; i++) { | ||
s[i] = M_PI * randm11(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto scaling above.
test/cuda/cufinufft_math_test.cu
Outdated
|
||
// Helper function to compare cuComplex with std::complex<T> using 1 - ratio as error | ||
template<typename T> | ||
bool compareComplex(const cuda_complex<T> &a, const std::complex<T> &b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed today, this is too harsh a test, since when a complex number "lands" near the real axis, its imag part (by itself) may have high relative error, and that's ok. Instead of separately testing Re and Im rel errors, use cabs(a-b)/cabs(a) < epsilon. Ask if still doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on my end. Just a few questions.
include/cufinufft/common.h
Outdated
@@ -7,27 +7,37 @@ | |||
#include <finufft_errors.h> | |||
#include <finufft_spread_opts.h> | |||
|
|||
#include <complex.h> | |||
#include <complex> | |||
#include <optional> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used?
include/cufinufft/defs.h
Outdated
@@ -1,15 +1,18 @@ | |||
#ifndef CUFINUFFT_DEFS_H | |||
#define CUFINUFFT_DEFS_H | |||
|
|||
#include <complex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used?
include/cufinufft/types.h
Outdated
cuda_complex<T> *c; | ||
cuda_complex<T> *fw; | ||
cuda_complex<T> *fk; | ||
|
||
// Type 3 specific | ||
struct { | ||
T X1, C1, S1, D1, h1, gam1; // x dim: X=halfwid C=center D=freqcen h,gam=rescale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S = ?
Integrated most of the requested changes. Commented on the PR when not applicable. the fseries should be reworked in a separate PR as it can be combined with the flipwind changes. |
481b70e
into
flatironinstitute:master
This PR implements Type3 on GPU using CUDA.