-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346
base: develop
Are you sure you want to change the base?
Conversation
Created Final Report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
|
||
gpuErrChk(cudaMemcpy((void*)(d_matrix), (void*)&matrix[0], (sizeof(ScalarType)*mat_size), cudaMemcpyHostToDevice)); | ||
gpuErrChk(cudaMemcpy((void*)(d_vec), (void*)&vec[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice)); | ||
gpuErrChk(cudaMemcpy((void*)(d_prod), (void*)&prod[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to copy the product, you just need to memset to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will add this. Thank you
double xDim = (double) 1024.0/(nVar*nEqn); | ||
dim3 blockDim(floor(xDim), nVar, nEqn); | ||
double gridx = (double) nPointDomain/xDim; | ||
dim3 gridDim(ceil(gridx), 1, 1); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document the choice of work distribution between blocks and threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for(int index = d_row_ptr[i]; index<d_row_ptr[i+1]; index++) | ||
{ | ||
int matrix_index = index * nVar * nEqn; | ||
int vec_index = d_col_ind[index] * nEqn; | ||
|
||
res += matrix[matrix_index + (j * nEqn + k)] * vec[vec_index + k]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this based on some publication? Did you experiment with other divisions of work?
For example, I see you are going for coalesced access to the matrix blocks, but this requires multiple reads of the same vector entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't experimented with different implementations as I went with this because it seemed optimal. It does access the same vector elements repeatedly while going through an entire row.
Haven't particularly checked out publications yet but you're right, there may be a better way to do this and I'll look into them. If you do have some recommendations on improvements then let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the current bottleneck is based on memory copy between the CPU and GPU while the kernel launch itself is 20 times faster than the repeated copy. Any insights on that as well would be very grateful.
Our current approach to circumvent this is to port not only single subroutines like matrix vector multiplication but the entire Krylov Solver loop where it searches over all the search directions in the subspace to the GPU. This would cut down on the repeated memory transfers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do something like a flag in CSysMatrix that is set to true when the matrix is uploaded to GPU and set to false when the matrix changes (for example when we clear the matrix to write new blocks).
You can use pinned host memory to make the transfers faster.
You can try uploading the matrix in chunks and overlap the uploads with the CPU work of filling another chunk.
Ultimately, the issue of transferring the matrix only goes away by porting the entire code 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding recommendations I have to be a bit cryptic because of my current job, but the general goals are coalesced access, and avoid reading or writing the same global memory location more than once.
I read this paper before my current job.
Optimization of Block Sparse Matrix-Vector Multiplication on Shared-Memory
Parallel Architectures
But like you said, there is much more to gain by porting more linear algebra operations than to micro-optimize the multiplications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go through the paper and follow up the pinned memory lead
Will continue to work on porting the solver, lets hope we get some interesting work in time for the conference 😄
If necessary, I'll contact you with updates either on this thread or catch you in the next dev meeting. Thanks for the help Pedro
I could not follow the work here; I just watched the conference presentation. As much work has been done, have you considered using an existing linear algebra library such as https://github.com/ginkgo-project/ginkgo? AFAIK the only thing you need to do is copy the matrix to Gingko format on GPU. Then Gingko will provide an efficient, scalable solver that works not only on NVIDIA but also on AMD and Intel. GMRES and ILU preconditioners are available there, so it is pretty much ready to go for all problems. |
@kursatyurt Hello, thank you so much for the lead. Our initial scope mostly involved writing our own kernels and I did explore some libraries at the start - I was planning on using CUSP as well but my main concern was its lack of being updated to the newly compatible versions of the toolkit. cuSolver and cuBLAS do exist, but I chose to go ahead with a "simple" kernel implementation to have more control. I also felt that if I could keep the block size of the grid in optimal territory then they could be just as fast as those options (please do correct me if my reading of the literature or the situation was incorrect) I was not aware of Ginkgo and I will surely give it a go and try to produce some comparative results. I am currently super busy for this month and will get to working on the code with some delay. Again, thank you for the lead! |
Also could you mention what you meant by "you could not follow the work here" If there is a specific doubt in the work then I would love to clarify it over slack whenever I get the time |
I am not very familiar with the linear solver implementation in SU2 not about the work itself |
To learn the basics, it's a good idea, but for large-scale projects, I prefer using existing libraries if possible. It is kind of light-weight too, not a huge dependency like Trilinos or PETSc.
I can test on various GPUs (P100/V100/A100 and 4070Mobile) on single node multi-gpu etc. |
Proposed Changes
This is the modified version of SU2 code that supports CUDA usage for the FGMRES solver and the use of NVBLAS. The main focus is the offloading of the Matrix Vector Product in the FGMRES solver to the GPU using CUDA Kernels. This implementation shows promise with marginally better run times (all benchmarks were carried out with the GPU Error Checking switched off and in debug mode to check if the correct functions were being called).
The use of NVBLAS is secondary and while functionality has been added to make it usable, it is not activated as it doesn't cause an appreciative increase in performance.
Compilation and Usage
Compile using the following MESON Flag
And activate the functions using the following Config File Option
NOTE ON IMPLEMENTATION
I've decided to go with a single version of the code where the CPU and GPU implementations co-exist in the same linear solver and can be disabled or switched using a combination of Meson and Config File options. This is why I have defined three classes - one over-arching class that is named CExecutionPath that has two child classes - CCpuExecution and CGpuExecution. These child classes contain the correct member function for each path - CPU or GPU functioning.
All of this could also be easily achieved with an if statement that switches between the two - but that particular implementation will access and run the statement for each call. In our case once a Matrix Vector Product object is created, it will immediately know whether to use CPU or GPU mode of execution.
Recommendations are most welcome to improve or make this implementation better
PR Checklist
Warning Levels do come (only at level 3) but they are all of the following type
The documentation for compiling with CUDA needs to be added by forking the SU2 site repo and adding the relevant changes to it? Or do I need to contact someone to change things on the site itself?
DOxygen Documentations and config template are all updated.
pre-commit run --all
to format old commits.