Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

Open
wants to merge 32 commits into
base: develop
Choose a base branch
from

Conversation

areenraj
Copy link

@areenraj areenraj commented Aug 27, 2024

Proposed Changes

This is the modified version of SU2 code that supports CUDA usage for the FGMRES solver and the use of NVBLAS. The main focus is the offloading of the Matrix Vector Product in the FGMRES solver to the GPU using CUDA Kernels. This implementation shows promise with marginally better run times (all benchmarks were carried out with the GPU Error Checking switched off and in debug mode to check if the correct functions were being called).

The use of NVBLAS is secondary and while functionality has been added to make it usable, it is not activated as it doesn't cause an appreciative increase in performance.

Compilation and Usage

Compile using the following MESON Flag

-Denable-cuda=true

And activate the functions using the following Config File Option

ENABLE_CUDA=YES

NOTE ON IMPLEMENTATION

I've decided to go with a single version of the code where the CPU and GPU implementations co-exist in the same linear solver and can be disabled or switched using a combination of Meson and Config File options. This is why I have defined three classes - one over-arching class that is named CExecutionPath that has two child classes - CCpuExecution and CGpuExecution. These child classes contain the correct member function for each path - CPU or GPU functioning.

All of this could also be easily achieved with an if statement that switches between the two - but that particular implementation will access and run the statement for each call. In our case once a Matrix Vector Product object is created, it will immediately know whether to use CPU or GPU mode of execution.

Recommendations are most welcome to improve or make this implementation better

PR Checklist

Warning Levels do come (only at level 3) but they are all of the following type

style of line directive is a GCC extension

The documentation for compiling with CUDA needs to be added by forking the SU2 site repo and adding the relevant changes to it? Or do I need to contact someone to change things on the site itself?

DOxygen Documentations and config template are all updated.

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
  • My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
  • I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

@areenraj areenraj changed the title Addition of CUDA and GPU Execution Capabilities to FGMRES Linear Solver in SU2 Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 Aug 27, 2024
Copy link
Member

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work


gpuErrChk(cudaMemcpy((void*)(d_matrix), (void*)&matrix[0], (sizeof(ScalarType)*mat_size), cudaMemcpyHostToDevice));
gpuErrChk(cudaMemcpy((void*)(d_vec), (void*)&vec[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice));
gpuErrChk(cudaMemcpy((void*)(d_prod), (void*)&prod[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to copy the product, you just need to memset to 0

Copy link
Author

@areenraj areenraj Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will add this. Thank you

Comment on lines +88 to +92
double xDim = (double) 1024.0/(nVar*nEqn);
dim3 blockDim(floor(xDim), nVar, nEqn);
double gridx = (double) nPointDomain/xDim;
dim3 gridDim(ceil(gridx), 1, 1);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the choice of work distribution between blocks and threads?

Copy link
Author

@areenraj areenraj Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +56 to +62
for(int index = d_row_ptr[i]; index<d_row_ptr[i+1]; index++)
{
int matrix_index = index * nVar * nEqn;
int vec_index = d_col_ind[index] * nEqn;

res += matrix[matrix_index + (j * nEqn + k)] * vec[vec_index + k];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this based on some publication? Did you experiment with other divisions of work?
For example, I see you are going for coalesced access to the matrix blocks, but this requires multiple reads of the same vector entries.

Copy link
Author

@areenraj areenraj Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't experimented with different implementations as I went with this because it seemed optimal. It does access the same vector elements repeatedly while going through an entire row.

Haven't particularly checked out publications yet but you're right, there may be a better way to do this and I'll look into them. If you do have some recommendations on improvements then let me know.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the current bottleneck is based on memory copy between the CPU and GPU while the kernel launch itself is 20 times faster than the repeated copy. Any insights on that as well would be very grateful.

Our current approach to circumvent this is to port not only single subroutines like matrix vector multiplication but the entire Krylov Solver loop where it searches over all the search directions in the subspace to the GPU. This would cut down on the repeated memory transfers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do something like a flag in CSysMatrix that is set to true when the matrix is uploaded to GPU and set to false when the matrix changes (for example when we clear the matrix to write new blocks).
You can use pinned host memory to make the transfers faster.
You can try uploading the matrix in chunks and overlap the uploads with the CPU work of filling another chunk.
Ultimately, the issue of transferring the matrix only goes away by porting the entire code 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding recommendations I have to be a bit cryptic because of my current job, but the general goals are coalesced access, and avoid reading or writing the same global memory location more than once.
I read this paper before my current job.
Optimization of Block Sparse Matrix-Vector Multiplication on Shared-Memory
Parallel Architectures
But like you said, there is much more to gain by porting more linear algebra operations than to micro-optimize the multiplications.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go through the paper and follow up the pinned memory lead

Will continue to work on porting the solver, lets hope we get some interesting work in time for the conference 😄

If necessary, I'll contact you with updates either on this thread or catch you in the next dev meeting. Thanks for the help Pedro

@areenraj areenraj changed the title Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 [GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 Aug 29, 2024
@kursatyurt
Copy link
Contributor

I could not follow the work here; I just watched the conference presentation. As much work has been done, have you considered using an existing linear algebra library such as https://github.com/ginkgo-project/ginkgo?

AFAIK the only thing you need to do is copy the matrix to Gingko format on GPU. Then Gingko will provide an efficient, scalable solver that works not only on NVIDIA but also on AMD and Intel.

GMRES and ILU preconditioners are available there, so it is pretty much ready to go for all problems.

@areenraj
Copy link
Author

areenraj commented Oct 7, 2024

@kursatyurt Hello, thank you so much for the lead.

Our initial scope mostly involved writing our own kernels and I did explore some libraries at the start - I was planning on using CUSP as well but my main concern was its lack of being updated to the newly compatible versions of the toolkit. cuSolver and cuBLAS do exist, but I chose to go ahead with a "simple" kernel implementation to have more control. I also felt that if I could keep the block size of the grid in optimal territory then they could be just as fast as those options (please do correct me if my reading of the literature or the situation was incorrect)

I was not aware of Ginkgo and I will surely give it a go and try to produce some comparative results. I am currently super busy for this month and will get to working on the code with some delay.

Again, thank you for the lead!

@areenraj
Copy link
Author

areenraj commented Oct 7, 2024

Also could you mention what you meant by "you could not follow the work here" If there is a specific doubt in the work then I would love to clarify it over slack whenever I get the time

@kursatyurt
Copy link
Contributor

Also could you mention what you meant by "you could not follow the work here" If there is a specific doubt in the work then I would love to clarify it over slack whenever I get the time

I am not very familiar with the linear solver implementation in SU2 not about the work itself

@kursatyurt
Copy link
Contributor

@kursatyurt Hello, thank you so much for the lead.

Our initial scope mostly involved writing our own kernels and I did explore some libraries at the start - I was planning on using CUSP as well but my main concern was its lack of being updated to the newly compatible versions of the toolkit. cuSolver and cuBLAS do exist, but I chose to go ahead with a "simple" kernel implementation to have more control. I also felt that if I could keep the block size of the grid in optimal territory then they could be just as fast as those options (please do correct me if my reading of the literature or the situation was incorrect)

To learn the basics, it's a good idea, but for large-scale projects, I prefer using existing libraries if possible.
Those libraries generally exploit state-of-the-art solution like mixed-precision computing. A gaming GPU is not way faster than a good CPU in double precision, but way faster in single precision, most of them have 64:1 ratio, however server class GPU have 2:1 ratio. Also when available they use vendor libraries like cuBLAS or hipBLAS. It is always nice to have you only care about connection and somebody else handle the solver as performant as possible. In future probably they will provide more and more solvers and it will be automagically works.

It is kind of light-weight too, not a huge dependency like Trilinos or PETSc.

I was not aware of Ginkgo and I will surely give it a go and try to produce some comparative results. I am currently super busy for this month and will get to working on the code with some delay.

Again, thank you for the lead!

I can test on various GPUs (P100/V100/A100 and 4070Mobile) on single node multi-gpu etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants