Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of CUDA-ised transform #602

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from
Draft

Conversation

jackleland
Copy link
Collaborator

With apologies to @milankl for the delay on getting this PR made!

This is a working-progress implementation of the transform on CUDA, currently with a functioning forward fourier transform component utilising CUFFT. A couple of things to note:

  1. I had to refactor the plan creation as CUFFT doesn't seem to play nicely while using views of the CuArray, so it instead has to slice them. I thought initially that the slicing would also work for the CPU arrays but that seems to break the use of views in the actual fourier_batched! and fourier_serial. There might be a way around this but I have not been able to find one!
  2. The actual running of the FFT within fourier_batched was done, again, with slices i.e.
view(f_north, 1:nfreq, 1:nlayers, j) .= rfft_plan * grids.data[ilons, :]

which is slow and allocatey but again I could not find a way around. This doesn't appear to be a limitation for the serial implementation however...

Speed-wise we get significant speedup for larger grids (trunc=127, nlayers=64) compared to the CPU, but is 10x slower than the CPU at default grid size.

Feedback much appreciated as this is my first Julia PR. I'll be back on this next week to have a go at the Legendre.

Addresses #575

@jackleland jackleland marked this pull request as draft November 1, 2024 11:30
Copy link
Member

@maximilian-gelbrecht maximilian-gelbrecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress!

We should now really add GPU CI, then you can add CI tests for those things as well.

src/SpeedyTransforms/SpeedyTransforms.jl Outdated Show resolved Hide resolved
src/SpeedyTransforms/spectral_transform.jl Outdated Show resolved Hide resolved
ext/SpeedyWeatherCUDAExt/fourier.jl Show resolved Hide resolved
ext/SpeedyWeatherCUDAExt/fourier.jl Show resolved Hide resolved
ilons = rings[j_north] # in-ring indices northern ring

# FOURIER TRANSFORM in zonal direction, northern latitude
view(f_north, 1:nfreq, 1:nlayers, j) .= rfft_plan * grids.data[ilons, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vchuravy is there a way to apply a cuFFT plan such that the result is directly written into some pre-allocated array? Instead of the allocation here on the right side and then writing it into the view of an array on the left?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackleland because at the moment it sounds we would just need an rfft!/brfft! method that calls this line, either written with LinearAlgebra.mul! for CPUs or with something else for GPUs, so the _fourier_batched! function is actually the same and we wouldn't need any code duplication here, just a new method for rfft!, brfft!


# SPEEDYWEATHER MODULES
using ..LowerTriangularMatrices
using ..RingGrids

# import SpeedyWeatherCUDAExt
using SpeedyWeather
Base.get_extension(SpeedyWeather, :SpeedyWeatherCUDAExt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouh why is that needed? Wouldn't that load CUDA everytime SpeedyTransforms is loaded, i.e. everytime SpeedyWeather is loaded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental commit, will roll back.

src/SpeedyTransforms/spectral_transform.jl Show resolved Hide resolved
@maximilian-gelbrecht
Copy link
Member

@milankl After my vacation, in early December, I could look into setting up CI for GPU. What do you think?

@maximilian-gelbrecht
Copy link
Member

I'll try to take care of it early in December when I'm back.

@jackleland When you are writing unit tests, just commit them to a separate file in the test folder, and don't include those into the standard test suite in the runtests.jl file.

In principle we are likely able to set up a separate CI pipeline on Builtkite that does the GPU only tests, so a reduced version of our tests just for the GPU functionality.

@jackleland
Copy link
Collaborator Author

@maximilian-gelbrecht Possibly not the place to discuss this, but how does GPU CI work? Do you pay for public cloud hosting or is there some free service available?

@maximilian-gelbrecht
Copy link
Member

I didn't look into this in all detail yet, but there seems to be a free service for open source Julia projects hosted by Builtkite.

ext/SpeedyWeatherCUDAExt/fourier.jl Show resolved Hide resolved

# range of the running indices lm in a l-column (degrees of spherical harmonics)
# given the column index m (order of harmonics)
get_lm_range(m, lmax) = ij2k(m, m, lmax):ij2k(lmax, m, lmax)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function assumes 1-based m, lmax inputs. I uses the ij2k function from LowerTriangularMatrices to transform between i, j indices of a matrix (here l, m) to the running index k as it's called in that module (here lm).

Comment on lines +23 to +24
# are m, lmax 0-based here or 1-based?
lm_range = get_lm_range(m, lmax) # assumes 1-based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be adapted (+comment maybe!) for 0-based m, lmax

g_south .= 0

# INVERSE LEGENDRE TRANSFORM by looping over wavenumbers l, m and layer k
kernel = CUDA.@cuda launch=false phase_factor_kernel!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed this to kernel

Comment on lines 8 to 9
# (inverse) legendre transform kernel, called from _legendre!
function phase_factor_kernel!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change hasn't been pushed yet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what change you're referring to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where it is actually called from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's invoked by the @cuda macro, so lines 95 (and 109)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is bad, I'll change it

@milankl milankl linked an issue Dec 3, 2024 that may be closed by this pull request
@milankl
Copy link
Member

milankl commented Dec 3, 2024

We talked about load balancing last week, which isn't trivial with the lower triangular matrices we have when mapping rows or columns to threads. However, one can two columns/rows, one long, one short together similar to how conceptually the triangle number can be explicitly computed, so instead of mapping row/column m=1,...,n to thread 1,...,n one could do

julia> m = 15
15

julia> for (r, i) in enumerate(0:m÷2)
           if r == 1 && isodd(m)
               println("thread $r: column $m")
           elseif r == 1
               println("thread $r: (idle)")
           else
               println("thread $r: column $i, $(m-i+iseven(m))")
           end
       end
thread 1: column 15
thread 2: column 1, 14
thread 3: column 2, 13
thread 4: column 3, 12
thread 5: column 4, 11
thread 6: column 5, 10
thread 7: column 6, 9
thread 8: column 7, 8

julia> m = 16
16

julia> for (r, i) in enumerate(0:m÷2)
           if r == 1 && isodd(m)
               println("thread $r: column $m")
           elseif r == 1
               println("thread $r: (idle)")
           else
               println("thread $r: column $i, $(m-i+iseven(m))")
           end
       end
thread 1: (idle)
thread 2: column 1, 16
thread 3: column 2, 15
thread 4: column 3, 14
thread 5: column 4, 13
thread 6: column 5, 12
thread 7: column 6, 11
thread 8: column 7, 10
thread 9: column 8, 9

(two examples one with even m one with odd m). Don't know whether this is actually helpful in practice might might be worth thinking about it

@milankl milankl changed the title Working progress implementation of CUDA-ised transform Implementation of CUDA-ised transform Dec 9, 2024
@milankl milankl added gpu 🖼️ Everthing GPU related transform ⬅️ ➡️ Our spherical harmonic transform, grid to spectral, spectral to grid labels Dec 14, 2024
@simone-silvestri
Copy link
Collaborator

excited to see SpeedyWeather running on GPUs!

If it might be of any help, nsys and ncu are perfect tools to visualize the bottlenecks.
The usage is quite simple

nsys profile --trace=cuda julia --project mytest.jl

and it is really great to see what the code does

@jackleland
Copy link
Collaborator Author

I'll look into CI.

For GPU CI, we'd just run those tests that absolutely need to run on GPU, on GPU. So, we need to define another test set. I'd suggest to just put all GPU tests in a new test/gpu folder.

@jackleland Do you already have some unit tests that you use locally for this?

Have been doing a lot of manual testing with notebooks, formalising into unit tests now

@maximilian-gelbrecht
Copy link
Member

GPU CI is on the way. We have access now, and I'll do a PR to set up the pipelines within the next days.

maximilian-gelbrecht and others added 3 commits January 14, 2025 14:53
* buildkite CUDA pipeline first draft

* remove codecov token

* gpu ci changelog

* add Adapt

* no coverage test

---------

Co-authored-by: Jack Leland <[email protected]>
@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented Jan 14, 2025

The error in the regular CI is a bit odd, it doesn't seem to be related to the GPU versions of the transforms at all. If you don't find the cause of it, I can look into it early next week.

@jackleland
Copy link
Collaborator Author

The error in the regular CI is a bit odd, it doesn't seem to be related to the GPU versions of the transforms at all. If you don't find the cause of it, I can look into it early next week.

It was being caused by a fix I found last year for a problem I was having with mapreducedim! on LTA (Milan mentioned it in #558 I think), but seemingly it is causing ambiguity problems now. Not sure what changed between then and now that would cause that but I've removed the fix as I'm no longer using mapreducedim!, so no more error.

@jackleland
Copy link
Collaborator Author

jackleland commented Jan 15, 2025

We now have a fully working round-trip transform on GPUs, with the caveat that I'm not sure it works on other grids just yet. I still need to:

  • Add transform(grid) etc convenience functions for GPU
  • Make better scaling benchmark plots
  • Refactor redundant fft calling code
  • Test with other grids
  • Rewrite kernels in KA form (not sure this is necessary just yet?)
    Before merging across, also happy to hear from others obviously.

@milankl For the forward legendre kernel I ended up using an atomic add to deal with the race condition on reducing into the specs output, it seems to work pretty well but I had to allow some allocations in the kernel to get the @atomic call to work (it doesn't seem to like calculating indices or writing into views) but any suggestions on better ways of approaching this would be welcomed.

@maximilian-gelbrecht
Copy link
Member

Ah I see. Good to know. I'll give fixing this issue an attempt next week anyway. Seems important for other parts of the model as well.

@milankl
Copy link
Member

milankl commented Jan 16, 2025

Wooohoo tests are passing! Let me know if you're ready for a review from my side!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu 🖼️ Everthing GPU related transform ⬅️ ➡️ Our spherical harmonic transform, grid to spectral, spectral to grid
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplified spectral transform towards GPU version
4 participants