Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu type 3 #517

Merged
merged 82 commits into from
Sep 12, 2024
Merged

Gpu type 3 #517

merged 82 commits into from
Sep 12, 2024

Conversation

DiamonDinoia
Copy link
Collaborator

@DiamonDinoia DiamonDinoia commented Aug 8, 2024

This PR implements Type3 on GPU using CUDA.

@DiamonDinoia DiamonDinoia marked this pull request as ready for review August 28, 2024 18:01
@DiamonDinoia
Copy link
Collaborator Author

DiamonDinoia commented Aug 28, 2024

Changelog:

  • Support for type 3 in 1D, 2D, and 3D in the GPU library cufinufft
  • Removed the CPU fseries computation (only used for benchmark no longer needed).
  • Added complex arithmetic support for cuda_complex type
  • Added tests for type 3 in 1D, 2D, and 3D and cuda_complex arithmetic
  • Minor fixes on the GPU code:
    • removed memory leaks in case of errors
    • renamed maxbatchsize to batchsize

@DiamonDinoia
Copy link
Collaborator Author

The test failure is not an issue. I will disable that test later.

Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! A big one. Just a few things to tidy up, should be at most 1-2 hrs if you don't address the the phase-winding tidy-up.

[However, have a look at removing the whole phase-winding and "a" stuff (it is way more confusing that it needs be, due to old code left by Melody and Robert), which I could have a stab at after merge, anyway, unless you want to. It might add 0.5-1 days work for you. Sit down and write out the math, or go back to the CPU code sans phase-winding, just using plain cos(). If it's not clear, we can together after Sept 3.]

CHANGELOG Outdated
@@ -1,6 +1,15 @@
List of features / changes made / release notes, in reverse chronological order.
If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).

V 2.4.0 (08/28/24)
* Support for type 3 in 1D, 2D, and 3D in the GPU library cufinufft (PR #517).
* Removed the CPU fseries computation (only used for benchmark no longer needed).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a separate * for each line here implies they are distinct features. Suggest use hyphen as below to group a sublist for GPU.

@@ -51,6 +51,8 @@ Developer notes

* CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``.

* Note to the nvcc developer. nvcc with debug symbols causes a stack overflow that is undetected at both compile and runtime. This goes undetected until ns>=10, for ns<10, one can use -G and debug the code with cuda-gdb. The way to avoid is to not use Debug symbols, possibly using ``--generate-line-info`` might work (not tested). As a side note, compute-sanitizers do not detect the issue.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great. Maybe mention only d=3?

docs/opts.rst Outdated
**maxbatchsize**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.

**batchsize**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused -this seems to rename a CPU option? We don't want to change them. I thought we were changing maxbatchsize -> batchsize only internally and only in the GPU code ?

include/cufinufft/contrib/helper_math.h Show resolved Hide resolved
include/cufinufft/impl.h Show resolved Hide resolved
}
s.resize(N1);
for (int i = 0; i < N1; i++) {
s[i] = M_PI * randm11();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test/finufft1d_test.cpp the "s" array is scaled by S = N1/2. I suggest you do this here too, to make the space-freq product grow (hence the FFT size) as the number of modes. This matches CPU testers. Also for 2d, 3d, below.

test/cuda/cufinufft1d_test.cu Show resolved Hide resolved
s.resize(N1 * N2);
t.resize(N1 * N2);
for (int i = 0; i < N1 * N2; i++) {
s[i] = M_PI * randm11();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see prev review on N1, N2 scaling of the s,t test freq pts.

t.resize(N1 * N2 * N3);
u.resize(N1 * N2 * N3);
for (int i = 0; i < N1 * N2 * N3; i++) {
s[i] = M_PI * randm11();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto scaling above.


// Helper function to compare cuComplex with std::complex<T> using 1 - ratio as error
template<typename T>
bool compareComplex(const cuda_complex<T> &a, const std::complex<T> &b,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed today, this is too harsh a test, since when a complex number "lands" near the real axis, its imag part (by itself) may have high relative error, and that's ok. Instead of separately testing Re and Im rel errors, use cabs(a-b)/cabs(a) < epsilon. Ask if still doesn't make sense.

Copy link
Collaborator

@janden janden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my end. Just a few questions.

@@ -7,27 +7,37 @@
#include <finufft_errors.h>
#include <finufft_spread_opts.h>

#include <complex.h>
#include <complex>
#include <optional>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used?

@@ -1,15 +1,18 @@
#ifndef CUFINUFFT_DEFS_H
#define CUFINUFFT_DEFS_H

#include <complex>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used?

include/cufinufft/types.h Show resolved Hide resolved
include/cufinufft/types.h Show resolved Hide resolved
cuda_complex<T> *c;
cuda_complex<T> *fw;
cuda_complex<T> *fk;

// Type 3 specific
struct {
T X1, C1, S1, D1, h1, gam1; // x dim: X=halfwid C=center D=freqcen h,gam=rescale
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S = ?

include/cufinufft/utils.h Show resolved Hide resolved
@DiamonDinoia
Copy link
Collaborator Author

Integrated most of the requested changes. Commented on the PR when not applicable.
On the timing, maybe worth relying on the nvidia profiler as it provides a nice output and is shipped with cuda. Otherwise I will think about it, as one has to use the CUDA event API to measure the time.

the fseries should be reworked in a separate PR as it can be combined with the flipwind changes.

@DiamonDinoia DiamonDinoia merged commit 481b70e into flatironinstitute:master Sep 12, 2024
166 of 167 checks passed
@DiamonDinoia DiamonDinoia deleted the gpu-type-3 branch September 12, 2024 17:07
@ahbarnett ahbarnett mentioned this pull request Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants