[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

areenraj · 2024-08-27T10:01:24Z

Proposed Changes

This is the modified version of SU2 code that supports CUDA usage for the FGMRES solver and the use of NVBLAS. The main focus is the offloading of the Matrix Vector Product in the FGMRES solver to the GPU using CUDA Kernels. This implementation shows promise with marginally better run times (all benchmarks were carried out with the GPU Error Checking switched off and in debug mode to check if the correct functions were being called).

The use of NVBLAS is secondary and while functionality has been added to make it usable, it is not activated as it doesn't cause an appreciative increase in performance.

Compilation and Usage

Compile using the following MESON Flag

-Denable-cuda=true

And activate the functions using the following Config File Option

ENABLE_CUDA=YES

NOTE ON IMPLEMENTATION

I've decided to go with a single version of the code where the CPU and GPU implementations co-exist in the same linear solver and can be disabled or switched using a combination of Meson and Config File options. This is why I have defined three classes - one over-arching class that is named CExecutionPath that has two child classes - CCpuExecution and CGpuExecution. These child classes contain the correct member function for each path - CPU or GPU functioning.

All of this could also be easily achieved with an if statement that switches between the two - but that particular implementation will access and run the statement for each call. In our case once a Matrix Vector Product object is created, it will immediately know whether to use CPU or GPU mode of execution.

Recommendations are most welcome to improve or make this implementation better

PR Checklist

Warning Levels do come (only at level 3) but they are all of the following type

style of line directive is a GCC extension

The documentation for compiling with CUDA needs to be added by forking the SU2 site repo and adding the relevant changes to it? Or do I need to contact someone to change things on the site itself?

DOxygen Documentations and config template are all updated.

I am submitting my contribution to the develop branch.
My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
I have added a test case that demonstrates my contribution, if necessary.
I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

…on as well

Created Final Report

pcarruscag

Nice work

pcarruscag · 2024-08-27T12:04:27Z

Common/src/linear_algebra/GPU_lin_alg.cu

+
+  gpuErrChk(cudaMemcpy((void*)(d_matrix), (void*)&matrix[0], (sizeof(ScalarType)*mat_size), cudaMemcpyHostToDevice));
+  gpuErrChk(cudaMemcpy((void*)(d_vec), (void*)&vec[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice));
+  gpuErrChk(cudaMemcpy((void*)(d_prod), (void*)&prod[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice));


you don't need to copy the product, you just need to memset to 0

Good catch, will add this. Thank you

pcarruscag · 2024-08-27T12:05:20Z

Common/src/linear_algebra/GPU_lin_alg.cu

+  double xDim = (double) 1024.0/(nVar*nEqn);
+  dim3 blockDim(floor(xDim), nVar, nEqn);
+  double gridx = (double) nPointDomain/xDim;
+  dim3 gridDim(ceil(gridx), 1, 1);
+


Can you document the choice of work distribution between blocks and threads?

Full report with Benchmark Results and other Explanations

Complete definition of how the Kernel is initialized

Explanation as to why the blocks are multi-directional

Algorithm for Block Matrix Calculations

pcarruscag · 2024-08-27T12:08:29Z

Common/src/linear_algebra/GPU_lin_alg.cu

+      for(int index = d_row_ptr[i]; index<d_row_ptr[i+1]; index++)
+      {
+        int matrix_index = index * nVar * nEqn;
+        int vec_index = d_col_ind[index] * nEqn;
+
+        res += matrix[matrix_index + (j * nEqn + k)] * vec[vec_index + k];
+      }


Is this based on some publication? Did you experiment with other divisions of work?
For example, I see you are going for coalesced access to the matrix blocks, but this requires multiple reads of the same vector entries.

I haven't experimented with different implementations as I went with this because it seemed optimal. It does access the same vector elements repeatedly while going through an entire row.

Haven't particularly checked out publications yet but you're right, there may be a better way to do this and I'll look into them. If you do have some recommendations on improvements then let me know.

Also, the current bottleneck is based on memory copy between the CPU and GPU while the kernel launch itself is 20 times faster than the repeated copy. Any insights on that as well would be very grateful.

Our current approach to circumvent this is to port not only single subroutines like matrix vector multiplication but the entire Krylov Solver loop where it searches over all the search directions in the subspace to the GPU. This would cut down on the repeated memory transfers.

You can do something like a flag in CSysMatrix that is set to true when the matrix is uploaded to GPU and set to false when the matrix changes (for example when we clear the matrix to write new blocks).
You can use pinned host memory to make the transfers faster.
You can try uploading the matrix in chunks and overlap the uploads with the CPU work of filling another chunk.
Ultimately, the issue of transferring the matrix only goes away by porting the entire code 😅

Regarding recommendations I have to be a bit cryptic because of my current job, but the general goals are coalesced access, and avoid reading or writing the same global memory location more than once.
I read this paper before my current job.
Optimization of Block Sparse Matrix-Vector Multiplication on Shared-Memory
Parallel Architectures
But like you said, there is much more to gain by porting more linear algebra operations than to micro-optimize the multiplications.

I'll go through the paper and follow up the pinned memory lead

Will continue to work on porting the solver, lets hope we get some interesting work in time for the conference 😄

If necessary, I'll contact you with updates either on this thread or catch you in the next dev meeting. Thanks for the help Pedro

kursatyurt · 2024-10-07T18:24:04Z

I could not follow the work here; I just watched the conference presentation. As much work has been done, have you considered using an existing linear algebra library such as https://github.com/ginkgo-project/ginkgo?

AFAIK the only thing you need to do is copy the matrix to Gingko format on GPU. Then Gingko will provide an efficient, scalable solver that works not only on NVIDIA but also on AMD and Intel.

GMRES and ILU preconditioners are available there, so it is pretty much ready to go for all problems.

areenraj · 2024-10-07T18:46:09Z

@kursatyurt Hello, thank you so much for the lead.

Our initial scope mostly involved writing our own kernels and I did explore some libraries at the start - I was planning on using CUSP as well but my main concern was its lack of being updated to the newly compatible versions of the toolkit. cuSolver and cuBLAS do exist, but I chose to go ahead with a "simple" kernel implementation to have more control. I also felt that if I could keep the block size of the grid in optimal territory then they could be just as fast as those options (please do correct me if my reading of the literature or the situation was incorrect)

I was not aware of Ginkgo and I will surely give it a go and try to produce some comparative results. I am currently super busy for this month and will get to working on the code with some delay.

Again, thank you for the lead!

areenraj · 2024-10-07T18:47:06Z

Also could you mention what you meant by "you could not follow the work here" If there is a specific doubt in the work then I would love to clarify it over slack whenever I get the time

kursatyurt · 2024-10-07T19:41:46Z

Also could you mention what you meant by "you could not follow the work here" If there is a specific doubt in the work then I would love to clarify it over slack whenever I get the time

I am not very familiar with the linear solver implementation in SU2 not about the work itself

kursatyurt · 2024-10-07T20:12:53Z

@kursatyurt Hello, thank you so much for the lead.

Our initial scope mostly involved writing our own kernels and I did explore some libraries at the start - I was planning on using CUSP as well but my main concern was its lack of being updated to the newly compatible versions of the toolkit. cuSolver and cuBLAS do exist, but I chose to go ahead with a "simple" kernel implementation to have more control. I also felt that if I could keep the block size of the grid in optimal territory then they could be just as fast as those options (please do correct me if my reading of the literature or the situation was incorrect)

To learn the basics, it's a good idea, but for large-scale projects, I prefer using existing libraries if possible.
Those libraries generally exploit state-of-the-art solution like mixed-precision computing. A gaming GPU is not way faster than a good CPU in double precision, but way faster in single precision, most of them have 64:1 ratio, however server class GPU have 2:1 ratio. Also when available they use vendor libraries like cuBLAS or hipBLAS. It is always nice to have you only care about connection and somebody else handle the solver as performant as possible. In future probably they will provide more and more solvers and it will be automagically works.

It is kind of light-weight too, not a huge dependency like Trilinos or PETSc.

I was not aware of Ginkgo and I will surely give it a go and try to produce some comparative results. I am currently super busy for this month and will get to working on the code with some delay.

Again, thank you for the lead!

I can test on various GPUs (P100/V100/A100 and 4070Mobile) on single node multi-gpu etc.

areenraj added 30 commits July 26, 2024 17:53

refresh everything

34c0bda

readme update

c953af8

final push

231968d

Enable GPU Mat Vec

4bddc30

New Branch and Optimized Memory Alloc on GPU Slightly

2da57f6

Finished GPU Mat-Vec with CPU Accuracy and Block Matrix Parallelizati…

aa35779

…on as well

Added fully template kernels to prevent type errors

8866936

Disabled NVBLAS Implementation in the DG Solver

12f01db

Added options and error check

d7cbd5e

reverting stuff to debug

103ec39

fixed turbulent case but Error Check is a performance hit

38d658b

Updated README for final report

02b9eb8

readme update

1a902ae

readme update

3fa00a9

readme update

57ffb74

Final Graph Changes

b8f14d3

Image Changes

3a000e2

Added Runtime Polymorphism for selecting execution Path

c31b9df

Added Runtime Polymorphism to select between CPU and GPU Execution

e131308

Added runtime polymorphism to select execution path

7695314

Added runtime polymorphism to select execution path

809c2d0

Added Runtime Polymorphism to select between CPU and GPU Execution

9832f33

Added Runtime Polymorphism to select between CPU and GPU Execution

1f41592

Create REPORT.md

de0810a

Created Final Report

Delete REPORT.md

66878d1

Added Preprocessor Directives

729bfc8

Making Repo PR Ready

3f351db

Making Repo PR Ready

f363809

Making it PR Ready

a0e09d7

Pre-Commit Hook Ran

b489d09

PR Ready

4926d34

areenraj changed the title ~~Addition of CUDA and GPU Execution Capabilities to FGMRES Linear Solver in SU2~~ Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 Aug 27, 2024

pcarruscag reviewed Aug 27, 2024

View reviewed changes

areenraj changed the title ~~Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2~~ [GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 Aug 29, 2024

added some fixes and error handling

c691960

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

areenraj commented Aug 27, 2024 •

edited

Loading

pcarruscag left a comment

pcarruscag Aug 27, 2024

areenraj Aug 29, 2024 •

edited

Loading

pcarruscag Aug 27, 2024

areenraj Aug 29, 2024 •

edited

Loading

pcarruscag Aug 27, 2024

areenraj Aug 29, 2024 •

edited

Loading

areenraj Aug 29, 2024

pcarruscag Aug 29, 2024

pcarruscag Aug 29, 2024

areenraj Aug 29, 2024

kursatyurt commented Oct 7, 2024

areenraj commented Oct 7, 2024

areenraj commented Oct 7, 2024

kursatyurt commented Oct 7, 2024

kursatyurt commented Oct 7, 2024

[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

Are you sure you want to change the base?

[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

Conversation

areenraj commented Aug 27, 2024 • edited Loading

Proposed Changes

Compilation and Usage

NOTE ON IMPLEMENTATION

PR Checklist

pcarruscag left a comment

Choose a reason for hiding this comment

pcarruscag Aug 27, 2024

Choose a reason for hiding this comment

areenraj Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

pcarruscag Aug 27, 2024

Choose a reason for hiding this comment

areenraj Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

pcarruscag Aug 27, 2024

Choose a reason for hiding this comment

areenraj Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

areenraj Aug 29, 2024

Choose a reason for hiding this comment

pcarruscag Aug 29, 2024

Choose a reason for hiding this comment

pcarruscag Aug 29, 2024

Choose a reason for hiding this comment

areenraj Aug 29, 2024

Choose a reason for hiding this comment

kursatyurt commented Oct 7, 2024

areenraj commented Oct 7, 2024

areenraj commented Oct 7, 2024

kursatyurt commented Oct 7, 2024

kursatyurt commented Oct 7, 2024

areenraj commented Aug 27, 2024 •

edited

Loading

areenraj Aug 29, 2024 •

edited

Loading

areenraj Aug 29, 2024 •

edited

Loading

areenraj Aug 29, 2024 •

edited

Loading