-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU support with CuPy #37
base: master
Are you sure you want to change the base?
Conversation
I want to comment a bit more on Alltoallw, since I anticipate a comment on this. To quote the paper you wrote to go along mpi4py-fft:
Indeed, I observe much better performance with Alltoall(v) compared to Alltoallw on GPUs, presumably due to lack of architecture specific optimisation. I wrote a small script for measuring the performance on contiguous arrays with size divisible the number of processes, such that I can use either communication method. The results are as follows: I write this as a justification of the communication backend I implemented in NCCL. This is neither MPI, nor does it avoid the intermediate step of copying to contiguous buffers with Alltoallw. I can see that you may reasonably object to adding features to your code that don't fit the premise in the name or in the accompanying paper. However, This is really crucial to good performance on GPUs on my specific machine and I hope that you agree that the roughly 10x speedup from GPUs vs. CPUs I measured for FFTs on a given number of compute nodes is worth it. |
Hi. This looks great, but unfortunately I don't know how to test this myself. I'm on a mac and as far as I can tell cupy is not supported. However, this PR makes cupy a hard dependency, and that is not ok. You need to hide imports to cupy such that regular usage of mpi4py-fft does not break for anyone without cupy (like myself). |
By the way, I talked to a member of the MPI Forum and developer of OpenMPI at a conference and showed him the plots with poor performance of Alltoallw. He was not surprised at all. Apparently Alltoallw on GPUs is very low priority for them. So there is no point in waiting for this. So, while it's not great that this implementation is totally specific to NVIDIA GPUs, especially with the CUDA graphs, I really see no other way that also gives competitive performance on NVIDIA hardware. |
It seems there have been some developments in testing open source code on GPUs, see here. If I understand correctly, you could apply for this program to have the code tested on GPUs free of charge. Does this sound like an option for you? |
As discussed in #14, GPU support can be achieved relatively easily by swapping
numpy
forcupy
in a bunch of places. However, the communication is a bit tricky because even CUDA-aware MPI is not as streamlined with GPU data as with CPU data. On JUWELS Booster, callingAlltoallw
from mpi4py directly leads to many small individual send/receive operations with copies between host and device. This is very slow to the point that the CPU implementation is faster. This may be specific to Jülich machines. I cannot test this.A remedy was implementing a custom Alltoallw replacement. However, I am not an expert on this and chose a fairly simple scheme which roughly follows this. It gives decent weak scaling but poor strong scaling. In particular, MPI requires synchronisation between device and host after the send buffer has been prepared, which means strong scaling is inherently difficult.
I also played around with NCCL for Alltoallw using the same simple communication scheme (NCCL doesn't have Alltoallw). NCCL allows to forgo a lot of synchronisation and strong scales better. However, I am a beginner with GPUs and my feeling is that an expert can point out a much better communication scheme with NCCL, possibly with no explicit synchronisation at all. See below a plot of strong scaling, comparing to NVIDIAs cuFFTMp:
Note that cuFFTMp does not have easy to use python bindings and the data was generated using no python at all. Because it uses NVSHMEM, it can do without much synchronisation and strong scales really well. The orange lines use the mpi4py-fft plus CuPy version in this PR with different communication backends.
I am unfortunately not an expert on GPUs or FFTs. I am sure there is plenty of room for improvement in both the communication and maybe also the calls to the CuPy FFT functions. For more details, please see the discussion in #14. Any help is appreciated!
Please point out issues with my programming and inconsistencies with your conventions as well!