-
Notifications
You must be signed in to change notification settings - Fork 45
Processor counts, load balancing and memory management on Cheyenne and Geyser
General rules for determining processor counts, memory requirements and mpi task distribution across nodes
Update 2018-2-7 geyser intra-node and inter-node communication works correctly. However, the inter-node
communication is about 3x slower than intra-node. Consequently, the slurm batch scripts for geyser still only request a single node (N=1).
All CESM postprocessing tools can be run either on cheyenne, on geyser directly or on geyser via cheyenne. create_postprocess creates a set of submission scripts for cheyenne (PBS) and for geyser (SLURM) in the $PP_CASE_PATH. Default settings with best guess processor and node counts and memory requirements for SLURM are defined in each submission script. These default settings are optimized for CMIP6 postprocessing and cylc workflow interaction.
There may be instances when these default settings are not optimal for the postprocessing tasks. For example, ocean hi-resolution data, long sea ice time diagnostics, or atmospheric data sets with lots of variables may require modifications to the SLURM batch submission stanza default settings. Listed below are some general guidelines for modifying the batch submission settings.
For ocean hi-resolution, long ice timeseries or atm data sets with a lot of variables, set the netcdf_format XML variable to netcdfLarge.
[compname]_averages_geyser has the SBATCH queue settings -n, -N, -ntasks-per-node, --mem and -t that may need adjustments in order to maximize the amount of memory available per mpi task. Geyser shared nodes allow for up to 16 mpi tasks per node and 1000 Gbytes of memory shared across the tasks on that node.
The parallelism utilized in the PyAverager is a divide and rationing scheme based on the number of variables to be averaged and the number of different averages or climatology files to be computed.
See PyAverager README for details regarding what averages and climatology files are computed for each CESM component.
If the variables are large (i.e. high resolution) or there are a lot of years to be averaged, then the optimal performance layout would be to use as many shared geyser nodes as possible (-N) while reducing the number of mpi tasks per node (-n and --ntasks-per-node) while increasing the amount of memory (--mem) available per task and the wallclock time (-t) in the [compname]_averages_geyser submission script.
Be aware that there are trade-offs between requesting more resources in a shared environment and also getting through the queue in a timely manner. Try and be considerate of other users as well when requesting a lot of memory on the node.
The land diagnostics may require a separate regridding step for NE120 to 1 degree resolution of the averages and climatology files created by the PyAverager. This is because:
- unlike the atmosphere where the regridding weights do not change, the land regridding weights need to be recalculated due to changing coast lines
- the underlying NCL scripts rely on 1 degree observation files for comparisons
The lnd_regrid and lnd_regrid_geyser scripts use a rationing scheme for parallelization based on the number of climatology files that need to be regridded. Depending on the size of the climatology input files, it may be necessary to use the geyser shared nodes with adjustments to the memory requirements (--mem) per task and possibly the wallclock (-t).
For NCL based diagnostics, the number of mpi tasks should not exceed the number of plot sets to be created as each plot set runs on its own mpi task. On geyser, this is set using -n and --ntasks-per-node. On cheyenne, the nodes are exclusive so there may be unused tasks on a node if the number of processors exceeds the number of plot sets.
Depending on the size of the climatology files created by the PyAverager, it may be necessary to run on geyser ([compname]_diagnostics_geyser) shared nodes with increased memory (--mem) and increased wallclock time.
The ILAMB and IOMB diagnostics require a 2-step setup process; first, initializing which only requires 1 mpi task and second, running the diagnostics. The parallelism is based on the number of different datasets to be compared.
The conversion of history slice files to single variable timeseries files uses 3 different levels of parallelization:
- number of history streams (e.g. cam.h0, cam.h1, clm2.h0, pop.h, cice.h, mosart.h0, etc..)
- number of variables in each history stream file
- number of "chunks" to be created (e.g. 10 year chunks of monthly history for a 100 year run would result in 10 "chunks" of single variable timeseries files)
The optimal number of mpi tasks is the sum of the 3 numbers listed above divided by 2 as some variables may be larger than others (e.g. 1-d vs. 2-d vs. 3-d vs. 4-d variables) and a rationing scheme is used to distribute the tasks by variable.
The only reason to run the timeseries on geyser is if the chunk sizes are very large, the data are high resolution, or there are project allocation constraints and more memory is required on the shared nodes.
By default, the output single variable timeseries files are compressed in netcdf4c format.
There is a hard constraint of at least 16 tasks per history stream.
Coming soon!
Cheyenne and DAV Quick Start Guide
* NO LONGER SUPPORTED as of 9/20/18 * Cheyenne and Geyser Quick Start Guide (v0.3.z)
Processor-counts, load-balancing and memory management on Cheyenne and Geyser
CESM Python Post Processing User's Guide
CESM Python Post Processing Developer's Guide