Implementation of CUDA-ised transform #602

jackleland · 2024-10-31T18:11:35Z

With apologies to @milankl for the delay on getting this PR made!

This is a working-progress implementation of the transform on CUDA, currently with a functioning forward fourier transform component utilising CUFFT. A couple of things to note:

I had to refactor the plan creation as CUFFT doesn't seem to play nicely while using views of the CuArray, so it instead has to slice them. I thought initially that the slicing would also work for the CPU arrays but that seems to break the use of views in the actual fourier_batched! and fourier_serial. There might be a way around this but I have not been able to find one!
The actual running of the FFT within fourier_batched was done, again, with slices i.e.

view(f_north, 1:nfreq, 1:nlayers, j) .= rfft_plan * grids.data[ilons, :]

which is slow and allocatey but again I could not find a way around. This doesn't appear to be a limitation for the serial implementation however...

Speed-wise we get significant speedup for larger grids (trunc=127, nlayers=64) compared to the CPU, but is 10x slower than the CPU at default grid size.

Feedback much appreciated as this is my first Julia PR. I'll be back on this next week to have a go at the Legendre.

Addresses #575

maximilian-gelbrecht

Nice progress!

We should now really add GPU CI, then you can add CI tests for those things as well.

src/SpeedyTransforms/SpeedyTransforms.jl

src/SpeedyTransforms/spectral_transform.jl

ext/SpeedyWeatherCUDAExt/fourier.jl

milankl · 2024-11-01T11:51:30Z

ext/SpeedyWeatherCUDAExt/fourier.jl

+        ilons = rings[j_north]              # in-ring indices northern ring
+
+        # FOURIER TRANSFORM in zonal direction, northern latitude
+        view(f_north, 1:nfreq, 1:nlayers, j) .= rfft_plan * grids.data[ilons, :]


@vchuravy is there a way to apply a cuFFT plan such that the result is directly written into some pre-allocated array? Instead of the allocation here on the right side and then writing it into the view of an array on the left?

@jackleland because at the moment it sounds we would just need an rfft!/brfft! method that calls this line, either written with LinearAlgebra.mul! for CPUs or with something else for GPUs, so the _fourier_batched! function is actually the same and we wouldn't need any code duplication here, just a new method for rfft!, brfft!

milankl · 2024-11-01T12:54:49Z

src/SpeedyTransforms/SpeedyTransforms.jl


 # SPEEDYWEATHER MODULES
 using ..LowerTriangularMatrices
 using ..RingGrids

+# import SpeedyWeatherCUDAExt
+using SpeedyWeather
+Base.get_extension(SpeedyWeather, :SpeedyWeatherCUDAExt)


Ouh why is that needed? Wouldn't that load CUDA everytime SpeedyTransforms is loaded, i.e. everytime SpeedyWeather is loaded?

Accidental commit, will roll back.

src/SpeedyTransforms/spectral_transform.jl

maximilian-gelbrecht · 2024-11-04T11:06:24Z

@milankl After my vacation, in early December, I could look into setting up CI for GPU. What do you think?

maximilian-gelbrecht · 2024-11-06T11:48:17Z

I'll try to take care of it early in December when I'm back.

@jackleland When you are writing unit tests, just commit them to a separate file in the test folder, and don't include those into the standard test suite in the runtests.jl file.

In principle we are likely able to set up a separate CI pipeline on Builtkite that does the GPU only tests, so a reduced version of our tests just for the GPU functionality.

jackleland · 2024-11-06T12:30:12Z

@maximilian-gelbrecht Possibly not the place to discuss this, but how does GPU CI work? Do you pay for public cloud hosting or is there some free service available?

maximilian-gelbrecht · 2024-11-06T12:39:18Z

I didn't look into this in all detail yet, but there seems to be a free service for open source Julia projects hosted by Builtkite.

ext/SpeedyWeatherCUDAExt/fourier.jl

milankl · 2024-11-25T16:09:00Z

ext/SpeedyWeatherCUDAExt/legendre.jl

+
+# range of the running indices lm in a l-column (degrees of spherical harmonics)
+# given the column index m (order of harmonics) 
+get_lm_range(m, lmax) = ij2k(m, m, lmax):ij2k(lmax, m, lmax)


This function assumes 1-based m, lmax inputs. I uses the ij2k function from LowerTriangularMatrices to transform between i, j indices of a matrix (here l, m) to the running index k as it's called in that module (here lm).

milankl · 2024-11-25T16:09:24Z

ext/SpeedyWeatherCUDAExt/legendre.jl

+        # are m, lmax 0-based here or 1-based? 
+        lm_range = get_lm_range(m, lmax)    # assumes 1-based


this needs to be adapted (+comment maybe!) for 0-based m, lmax

milankl · 2024-11-25T16:09:58Z

ext/SpeedyWeatherCUDAExt/legendre.jl

+    g_south .= 0
+
+    # INVERSE LEGENDRE TRANSFORM by looping over wavenumbers l, m and layer k
+    kernel = CUDA.@cuda launch=false phase_factor_kernel!(


renamed this to kernel

milankl · 2024-11-25T16:19:27Z

ext/SpeedyWeatherCUDAExt/legendre.jl

+# (inverse) legendre transform kernel, called from _legendre!
+function phase_factor_kernel!(


this change hasn't been pushed yet?

Not sure what change you're referring to?

Where it is actually called from?

It's invoked by the @cuda macro, so lines 95 (and 109)

The name is bad, I'll change it

milankl · 2024-12-03T12:05:39Z

We talked about load balancing last week, which isn't trivial with the lower triangular matrices we have when mapping rows or columns to threads. However, one can two columns/rows, one long, one short together similar to how conceptually the triangle number can be explicitly computed, so instead of mapping row/column m=1,...,n to thread 1,...,n one could do

julia> m = 15
15

julia> for (r, i) in enumerate(0:m÷2)
           if r == 1 && isodd(m)
               println("thread $r: column $m")
           elseif r == 1
               println("thread $r: (idle)")
           else
               println("thread $r: column $i, $(m-i+iseven(m))")
           end
       end
thread 1: column 15
thread 2: column 1, 14
thread 3: column 2, 13
thread 4: column 3, 12
thread 5: column 4, 11
thread 6: column 5, 10
thread 7: column 6, 9
thread 8: column 7, 8

julia> m = 16
16

julia> for (r, i) in enumerate(0:m÷2)
           if r == 1 && isodd(m)
               println("thread $r: column $m")
           elseif r == 1
               println("thread $r: (idle)")
           else
               println("thread $r: column $i, $(m-i+iseven(m))")
           end
       end
thread 1: (idle)
thread 2: column 1, 16
thread 3: column 2, 15
thread 4: column 3, 14
thread 5: column 4, 13
thread 6: column 5, 12
thread 7: column 6, 11
thread 8: column 7, 10
thread 9: column 8, 9

(two examples one with even m one with odd m). Don't know whether this is actually helpful in practice might might be worth thinking about it

maximilian-gelbrecht · 2024-12-13T18:20:55Z

I'll look into CI.

For GPU CI, we'd just run those tests that absolutely need to run on GPU, on GPU. So, we need to define another test set. I'd suggest to just put all GPU tests in a new test/gpu folder.

@jackleland Do you already have some unit tests that you use locally for this?

simone-silvestri · 2024-12-16T16:20:10Z

excited to see SpeedyWeather running on GPUs!

If it might be of any help, nsys and ncu are perfect tools to visualize the bottlenecks.
The usage is quite simple

nsys profile --trace=cuda julia --project mytest.jl

and it is really great to see what the code does

jackleland · 2024-12-16T16:39:05Z

I'll look into CI.

For GPU CI, we'd just run those tests that absolutely need to run on GPU, on GPU. So, we need to define another test set. I'd suggest to just put all GPU tests in a new test/gpu folder.

@jackleland Do you already have some unit tests that you use locally for this?

Have been doing a lot of manual testing with notebooks, formalising into unit tests now

…ralTransform

maximilian-gelbrecht · 2024-12-17T12:44:01Z

GPU CI is on the way. We have access now, and I'll do a PR to set up the pipelines within the next days.

jackleland added 2 commits October 23, 2024 15:39

Allow generation of CUFFT plans

69f2f24

Working CUFFT implementation for batched and serial forward transforms

5db4d18

jackleland requested review from milankl and maximilian-gelbrecht October 31, 2024 18:11

jackleland marked this pull request as draft November 1, 2024 11:30

maximilian-gelbrecht reviewed Nov 1, 2024

View reviewed changes

src/SpeedyTransforms/SpeedyTransforms.jl Outdated Show resolved Hide resolved

src/SpeedyTransforms/spectral_transform.jl Outdated Show resolved Hide resolved

milankl reviewed Nov 1, 2024

View reviewed changes

jackleland and others added 9 commits November 11, 2024 14:12

Remove unnecessary reference to CUDA extension

f51fb40

Add additional methods for reverse transform

c43d99d

First working version of cuda-ised inverse legendre transform

7cfc88f

Tidy up legendre inverse kernel

9b3dafb

Loosen type restriction on LowerTriangularArray in _legendre!

9c522f3

Merge branch 'main' into jl/fourier_cuda

aa78bd2

WIP: lm_range via ij2k

f4e99b0

use ArrayType_ in SpectralTransform generator

2d3d6c0

add SpeedyWeather. to access LTMs

49afcd2

milankl reviewed Nov 25, 2024

View reviewed changes

milankl and others added 3 commits November 25, 2024 16:20

update changelog

d57e0a0

Fix minor typos

e08cf04

Fix get_lm_range to match the original loop indexing

a5c8d5e

milankl linked an issue Dec 3, 2024 that may be closed by this pull request

Simplified spectral transform towards GPU version #575

Open

milankl changed the title ~~Working progress implementation of CUDA-ised transform~~ Implementation of CUDA-ised transform Dec 9, 2024

jackleland added 3 commits December 10, 2024 14:02

Remove unused reference to LinearAlgebra

a827e76

Remove reference to gpu.jl

5e1c6c9

Merge branch 'main' into jl/fourier_cuda

df6824c

milankl added gpu 🖼️ Everthing GPU related transform ⬅️ ➡️ Our spherical harmonic transform, grid to spectral, spectral to grid labels Dec 14, 2024

jackleland added 3 commits December 16, 2024 17:08

Merge branch 'main' into jl/fourier_cuda

03a12d0

Update legendre kernel to use kjm indices stored as a matrix in Spect…

c06cc34

…ralTransform

Add unit tests for components of transform!

8d9c049

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of CUDA-ised transform #602

Implementation of CUDA-ised transform #602

jackleland commented Oct 31, 2024

maximilian-gelbrecht left a comment

milankl Nov 1, 2024

milankl Nov 1, 2024

milankl Nov 1, 2024

jackleland Nov 1, 2024

maximilian-gelbrecht commented Nov 4, 2024

maximilian-gelbrecht commented Nov 6, 2024

jackleland commented Nov 6, 2024

maximilian-gelbrecht commented Nov 6, 2024

milankl Nov 25, 2024

milankl Nov 25, 2024

milankl Nov 25, 2024

milankl Nov 25, 2024

jackleland Nov 25, 2024

milankl Nov 25, 2024

jackleland Nov 25, 2024

jackleland Nov 25, 2024

milankl commented Dec 3, 2024

maximilian-gelbrecht commented Dec 13, 2024

simone-silvestri commented Dec 16, 2024

jackleland commented Dec 16, 2024

maximilian-gelbrecht commented Dec 17, 2024

		# are m, lmax 0-based here or 1-based?
		lm_range = get_lm_range(m, lmax) # assumes 1-based

		# (inverse) legendre transform kernel, called from _legendre!
		function phase_factor_kernel!(

Implementation of CUDA-ised transform #602

Are you sure you want to change the base?

Implementation of CUDA-ised transform #602

Conversation

jackleland commented Oct 31, 2024

maximilian-gelbrecht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximilian-gelbrecht commented Nov 4, 2024

maximilian-gelbrecht commented Nov 6, 2024

jackleland commented Nov 6, 2024

maximilian-gelbrecht commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milankl commented Dec 3, 2024

maximilian-gelbrecht commented Dec 13, 2024

simone-silvestri commented Dec 16, 2024

jackleland commented Dec 16, 2024

maximilian-gelbrecht commented Dec 17, 2024