-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlapping computation and MPI halo communication #615
Comments
Don't you want the opposite? You want a kernel that computes source terms in the "deep interior", which can be performed without knowledge of halos and thus can be performed simultaneous to performing communication. After communication + deep interior calculations are complete, you then perform calculations on near-boundary elements. Are communications restricted to This is is the part of our current algorithm that involves interior tendency computation (there are additional halo filling calls associated with the fractional step): function calculate_explicit_substep!(tendencies, velocities, tracers, pressures, diffusivities, model)
time_step_precomputations!(diffusivities, pressures, velocities, tracers, model)
calculate_tendencies!(tendencies, velocities, tracers, pressures, diffusivities, model)
return nothing
end The function The function function time_step_precomputations!(diffusivities, pressures, velocities, tracers, model)
fill_halo_regions!(merge(velocities, tracers), model.boundary_conditions.solution, model.architecture,
model.grid, boundary_condition_function_arguments(model)...)
calculate_diffusivities!(diffusivities, model.architecture, model.grid, model.closure, model.buoyancy,
velocities, tracers)
fill_halo_regions!(diffusivities, model.boundary_conditions.diffusivities, model.architecture, model.grid)
@launch(device(model.architecture), config=launch_config(model.grid, :xy),
update_hydrostatic_pressure!(pressures.pHY′, model.grid, model.buoyancy, tracers))
fill_halo_regions!(pressures.pHY′, model.boundary_conditions.pressure, model.architecture, model.grid)
return nothing
end To implement the optimizations discussed in this issue, we need to also consider the calculation of hydrostatic pressure and nonlinear diffusivities to intertwine communication with interior tendency computation. Can this be done abstractly perhaps via some combination of launch configurations and macro specifications to Notice that the "pre-computation" of nonlinear diffusivities and the isolation of the hydrostatic pressure both add communication steps. We should monitor whether these become significantly suboptimal algorithms in the presence of expensive communication. We can easily combine hydrostatic pressure with nonhydrostatic pressure with no loss of performance (probably a small performance increase, in fact). We can also in principle calculate nonlinear diffusivities "in-line", though when we tried this previously we were unable to achieve good performance. Also, "in-line" calculation of diffusivities makes the application of diffusivity boundary conditions much more difficult (or impossible). |
In PR #590 [WIP] I've prototyped how I've thought about adding support for distributed parallelism by adding a non-invasive
Distributed
MPI layer on top of Oceananigans to keep the core code MPI-free.At last week's CliMA software meeting @lcw and @jkozdon have pointed out a potential limitation of this approach: when running on many nodes and communication starts to eat up a lot of compute time it becomes beneficial to overlap computation and communication. Abstractions such as
CLIMA.MPIStateArray
help a lot with this but require MPI to be "baked in".Obviously this issue won't be tackled for a while until we have a working distributed model and need more performance, so I'm just documenting the issue here for future discussion.
I think we can achieve this by splitting a kernel like
calculate_interior_source_terms!
into two kernels, one that computes source terms "near" the boundary (1-2? grid points from any boundary as needed), then halo communication can happen while a second more compute-intensive kernel computes the source terms in the rest of the interior.But that only helps with one particular instance of halo communication. There will be other halo communications needed that may be impossible to overlap with compute-intensive kernels. Pursuing overlapping in this manner to the extreme and applying it to as many kernels as possible may be detrimental to code clarity.
Once we want more distributed performance we should go through the algorithm and minimize the number of halo communications (i.e. calls to
fill_halo_regions!
).cc @leios @jm-c
Tasks
The text was updated successfully, but these errors were encountered: