GPU support with CuPy #37

brownbaerchen · 2024-02-13T17:26:15Z

As discussed in #14, GPU support can be achieved relatively easily by swapping numpy for cupy in a bunch of places. However, the communication is a bit tricky because even CUDA-aware MPI is not as streamlined with GPU data as with CPU data. On JUWELS Booster, calling Alltoallw from mpi4py directly leads to many small individual send/receive operations with copies between host and device. This is very slow to the point that the CPU implementation is faster. This may be specific to Jülich machines. I cannot test this.
A remedy was implementing a custom Alltoallw replacement. However, I am not an expert on this and chose a fairly simple scheme which roughly follows this. It gives decent weak scaling but poor strong scaling. In particular, MPI requires synchronisation between device and host after the send buffer has been prepared, which means strong scaling is inherently difficult.
I also played around with NCCL for Alltoallw using the same simple communication scheme (NCCL doesn't have Alltoallw). NCCL allows to forgo a lot of synchronisation and strong scales better. However, I am a beginner with GPUs and my feeling is that an expert can point out a much better communication scheme with NCCL, possibly with no explicit synchronisation at all. See below a plot of strong scaling, comparing to NVIDIAs cuFFTMp:

Note that cuFFTMp does not have easy to use python bindings and the data was generated using no python at all. Because it uses NVSHMEM, it can do without much synchronisation and strong scales really well. The orange lines use the mpi4py-fft plus CuPy version in this PR with different communication backends.

I am unfortunately not an expert on GPUs or FFTs. I am sure there is plenty of room for improvement in both the communication and maybe also the calls to the CuPy FFT functions. For more details, please see the discussion in #14. Any help is appreciated!
Please point out issues with my programming and inconsistencies with your conventions as well!

brownbaerchen · 2024-02-29T13:51:17Z

I was experimenting with CUDA graphs and it seems to work really well as the following plots show:

I record separate graphs for forward and backward operations for each transfer class. Once these are recorded, it seems they are very efficient. Keep in mind that the measured times do not include recording the graph. But this overhead should not be a large concern in practice.

I ran the unit tests and I always make sure that the error after transforming back and forth is on the order of machine precision. So if you don't find any more errors, this implementation works well enough for my needs.
I am not entirely happy with all my changes. Particularly, the implementation of copyto in the wrappers of the FFT backends is a bit messy. If you have a better idea for this, please tell me. Also, I am sure you can point out a lot more places where the code could be more neat.

I am not sure if it is worth keeping the CustomMPI communication backend. As the NCCL+graphs thing is so much better, I can remove this if you prefer.
The cupyx-scipy backend for serial FFTs is also a bit unnecessary as it behaves the same as the cupy one.

Please let me know all the necessary steps to merge this PR regarding testing, code style, ...
I have to say that the library is really well written and approachable! I want to deteriorate that as little as possible.

brownbaerchen · 2024-03-21T14:31:40Z

I want to comment a bit more on Alltoallw, since I anticipate a comment on this. To quote the paper you wrote to go along mpi4py-fft:

MPI_ALLTOALL(V) works on contiguous dataarrays in both ends, both for sending and receiving ranks, and there are highly optimised versions avail- able on several architectures. On the contrary, the subarray type used by MPI_ALLTOALLW is in general discontiguous, and there are to the authors’ knowledge no architecture-specific optimizations available.

Indeed, I observe much better performance with Alltoall(v) compared to Alltoallw on GPUs, presumably due to lack of architecture specific optimisation. I wrote a small script for measuring the performance on contiguous arrays with size divisible the number of processes, such that I can use either communication method. The results are as follows:

At the largest data size, which is close to maxing out the memory of the GPUs, there is more than two orders of magnitude difference between Alltoall(v) and w. Clearly, this is more difference than copying to a contiguous buffer first.

I write this as a justification of the communication backend I implemented in NCCL. This is neither MPI, nor does it avoid the intermediate step of copying to contiguous buffers with Alltoallw. I can see that you may reasonably object to adding features to your code that don't fit the premise in the name or in the accompanying paper. However, This is really crucial to good performance on GPUs on my specific machine and I hope that you agree that the roughly 10x speedup from GPUs vs. CPUs I measured for FFTs on a given number of compute nodes is worth it.

mikaem · 2024-04-08T08:18:53Z

Hi.

This looks great, but unfortunately I don't know how to test this myself. I'm on a mac and as far as I can tell cupy is not supported.

However, this PR makes cupy a hard dependency, and that is not ok. You need to hide imports to cupy such that regular usage of mpi4py-fft does not break for anyone without cupy (like myself).

brownbaerchen · 2024-05-24T07:49:15Z

By the way, I talked to a member of the MPI Forum and developer of OpenMPI at a conference and showed him the plots with poor performance of Alltoallw. He was not surprised at all. Apparently Alltoallw on GPUs is very low priority for them. So there is no point in waiting for this.
Also, I don't think that MPI will become stream aware anytime soon, which means explicit synchronisation between host and device is required before every MPI operation to make sure the buffers are ready. Instead of submitting a communication kernel to a stream on the GPU, MPI will just do the communication directly. This synchronisation means we cannot submit the next computation kernel before communication is finished, lowering throughput on the GPU. This kills strong scaling.
Hence, even if performant Alltoallw becomes available in MPI implementations, it still cannot compete with NCCL. NCCL does submit the communication to a stream and does not require the synchronisation.
The reason the Forum doesn't want stream awareness in the standard, as far as I know, is that they don't want any vendor specific stuff in there...

So, while it's not great that this implementation is totally specific to NVIDIA GPUs, especially with the CUDA graphs, I really see no other way that also gives competitive performance on NVIDIA hardware.

brownbaerchen · 2024-08-19T07:22:47Z

It seems there have been some developments in testing open source code on GPUs, see here. If I understand correctly, you could apply for this program to have the code tested on GPUs free of charge. Does this sound like an option for you?

Thomas added 16 commits December 8, 2023 12:31

Implemented CuPy version of DistArray

a50137f

Cleanup and tests

e770b56

Added cupy backend for FFTs

424fdb1

Removed useCuPy argument from NewDistArray

44be7bc

Fixed tests

90551b4

Added cupyx-scipy backend

202f30c

Implemented NCCL for Alltoallw operations

e5d4b56

Minor refactor

d0a493e

Changed Alltoallw communication pattern

6b47e36

Added custom MPI based Alltoallw implementation

21805de

Cleanup

ceccc7d

Overlapped all communication within a single Alltoallw in NCCL backend

c3c20ff

Cosmetic changes

405fb4d

Swapped array[...]=data for cupy.copyto

b69a303

Removed unnecessary rescaling in CuPy backend

b9ac0f7

Use CUDA graphs in NCCL Alltoallw

77b96f3

brownbaerchen marked this pull request as ready for review February 29, 2024 13:51

Thomas added 2 commits April 16, 2024 07:06

Removed dependence on CuPy

75f0343

Merge remote-tracking branch 'upstream/master' into cupy_implementation

a920dca

Fixes

7330b6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU support with CuPy #37

GPU support with CuPy #37

brownbaerchen commented Feb 13, 2024

brownbaerchen commented Feb 29, 2024

brownbaerchen commented Mar 21, 2024

mikaem commented Apr 8, 2024

brownbaerchen commented May 24, 2024

brownbaerchen commented Aug 19, 2024

GPU support with CuPy #37

Are you sure you want to change the base?

GPU support with CuPy #37

Conversation

brownbaerchen commented Feb 13, 2024

brownbaerchen commented Feb 29, 2024

brownbaerchen commented Mar 21, 2024

mikaem commented Apr 8, 2024

brownbaerchen commented May 24, 2024

brownbaerchen commented Aug 19, 2024