-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of CUDA-ised transform #602
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress!
We should now really add GPU CI, then you can add CI tests for those things as well.
ilons = rings[j_north] # in-ring indices northern ring | ||
|
||
# FOURIER TRANSFORM in zonal direction, northern latitude | ||
view(f_north, 1:nfreq, 1:nlayers, j) .= rfft_plan * grids.data[ilons, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vchuravy is there a way to apply a cuFFT plan such that the result is directly written into some pre-allocated array? Instead of the allocation here on the right side and then writing it into the view of an array on the left?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackleland because at the moment it sounds we would just need an rfft!
/brfft!
method that calls this line, either written with LinearAlgebra.mul!
for CPUs or with something else for GPUs, so the _fourier_batched!
function is actually the same and we wouldn't need any code duplication here, just a new method for rfft!
, brfft!
|
||
# SPEEDYWEATHER MODULES | ||
using ..LowerTriangularMatrices | ||
using ..RingGrids | ||
|
||
# import SpeedyWeatherCUDAExt | ||
using SpeedyWeather | ||
Base.get_extension(SpeedyWeather, :SpeedyWeatherCUDAExt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouh why is that needed? Wouldn't that load CUDA everytime SpeedyTransforms is loaded, i.e. everytime SpeedyWeather is loaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental commit, will roll back.
@milankl After my vacation, in early December, I could look into setting up CI for GPU. What do you think? |
I'll try to take care of it early in December when I'm back. @jackleland When you are writing unit tests, just commit them to a separate file in the test folder, and don't include those into the standard test suite in the In principle we are likely able to set up a separate CI pipeline on Builtkite that does the GPU only tests, so a reduced version of our tests just for the GPU functionality. |
@maximilian-gelbrecht Possibly not the place to discuss this, but how does GPU CI work? Do you pay for public cloud hosting or is there some free service available? |
I didn't look into this in all detail yet, but there seems to be a free service for open source Julia projects hosted by Builtkite. |
ext/SpeedyWeatherCUDAExt/legendre.jl
Outdated
|
||
# range of the running indices lm in a l-column (degrees of spherical harmonics) | ||
# given the column index m (order of harmonics) | ||
get_lm_range(m, lmax) = ij2k(m, m, lmax):ij2k(lmax, m, lmax) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function assumes 1-based m, lmax inputs. I uses the ij2k
function from LowerTriangularMatrices to transform between i, j indices of a matrix (here l, m) to the running index k
as it's called in that module (here lm).
# are m, lmax 0-based here or 1-based? | ||
lm_range = get_lm_range(m, lmax) # assumes 1-based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be adapted (+comment maybe!) for 0-based m, lmax
ext/SpeedyWeatherCUDAExt/legendre.jl
Outdated
g_south .= 0 | ||
|
||
# INVERSE LEGENDRE TRANSFORM by looping over wavenumbers l, m and layer k | ||
kernel = CUDA.@cuda launch=false phase_factor_kernel!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed this to kernel
ext/SpeedyWeatherCUDAExt/legendre.jl
Outdated
# (inverse) legendre transform kernel, called from _legendre! | ||
function phase_factor_kernel!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change hasn't been pushed yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what change you're referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where it is actually called from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's invoked by the @cuda
macro, so lines 95 (and 109)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is bad, I'll change it
We talked about load balancing last week, which isn't trivial with the lower triangular matrices we have when mapping rows or columns to threads. However, one can two columns/rows, one long, one short together similar to how conceptually the triangle number can be explicitly computed, so instead of mapping row/column m=1,...,n to thread 1,...,n one could do julia> m = 15
15
julia> for (r, i) in enumerate(0:m÷2)
if r == 1 && isodd(m)
println("thread $r: column $m")
elseif r == 1
println("thread $r: (idle)")
else
println("thread $r: column $i, $(m-i+iseven(m))")
end
end
thread 1: column 15
thread 2: column 1, 14
thread 3: column 2, 13
thread 4: column 3, 12
thread 5: column 4, 11
thread 6: column 5, 10
thread 7: column 6, 9
thread 8: column 7, 8
julia> m = 16
16
julia> for (r, i) in enumerate(0:m÷2)
if r == 1 && isodd(m)
println("thread $r: column $m")
elseif r == 1
println("thread $r: (idle)")
else
println("thread $r: column $i, $(m-i+iseven(m))")
end
end
thread 1: (idle)
thread 2: column 1, 16
thread 3: column 2, 15
thread 4: column 3, 14
thread 5: column 4, 13
thread 6: column 5, 12
thread 7: column 6, 11
thread 8: column 7, 10
thread 9: column 8, 9 (two examples one with even m one with odd m). Don't know whether this is actually helpful in practice might might be worth thinking about it |
I'll look into CI. For GPU CI, we'd just run those tests that absolutely need to run on GPU, on GPU. So, we need to define another test set. I'd suggest to just put all GPU tests in a new @jackleland Do you already have some unit tests that you use locally for this? |
excited to see SpeedyWeather running on GPUs! If it might be of any help, nsys profile --trace=cuda julia --project mytest.jl and it is really great to see what the code does |
Have been doing a lot of manual testing with notebooks, formalising into unit tests now |
GPU CI is on the way. We have access now, and I'll do a PR to set up the pipelines within the next days. |
With apologies to @milankl for the delay on getting this PR made!
This is a working-progress implementation of the transform on CUDA, currently with a functioning forward fourier transform component utilising CUFFT. A couple of things to note:
fourier_batched!
andfourier_serial
. There might be a way around this but I have not been able to find one!fourier_batched
was done, again, with slices i.e.which is slow and allocatey but again I could not find a way around. This doesn't appear to be a limitation for the serial implementation however...
Speed-wise we get significant speedup for larger grids
(trunc=127, nlayers=64)
compared to the CPU, but is 10x slower than the CPU at default grid size.Feedback much appreciated as this is my first Julia PR. I'll be back on this next week to have a go at the Legendre.
Addresses #575