Skip to content

CRM Testing Framework

Matt Norman edited this page Nov 22, 2019 · 2 revisions

Overview

The gist of testing the CRM is to create baselines in 2-D and 3-D with 1-mom micro for full coverage. The baselines will run at two different optimization levels to approximate bit-level differences and how they non-linearly amplify over the course of one model day of simulation. This will serve as an envelope for how different your refactoring should be after one model day, and you will be performing a 3-way diff comparing your diff from -O2 against the difference between -O0 and -O2. You shouldn't be much more than 2x outside that envelope. Note that some variables exist but are filled with junk (such as UNICON variables we'll probably never use), and I have no idea why.

  • For PGI
    • Use -O2 -Mvect=nosimd (the default) and -O0
    • IMPORTANT: You must remove the -Mvect=nosimd for the -O0 case because that flag will force -O2, which you don't want.
  • For everything else
    • Use -O2 and -O0

You will change the optimization levels in Macros.make and Macros.cmake. After simulating at -O0, you'll need to copy the restart file to the case directory, e.g.,

# Run the -O0 case
cp $RUNDIR/$CASE.cam.r.0001-01-02-00000.nc ./$CASE.cam.r.0001-01-02-00000.optO0.nc
# Run the -O2 case
cp $RUNDIR/$CASE.cam.r.0001-01-02-00000.nc ./$CASE.cam.r.0001-01-02-00000.optO2.nc

With the -O0 and -O2 baselines in place for 2D and 3D, you can then run your regressions and compare against baseline.

The following script will setup your baselines using the PGI compiler on the CPU:

#!/bin/bash

E3SM_HOME=~/ACME-ECP
COMPILER=pgi
MACH=summit-cpu
PES=84x1
RES=ne4_ne4
PROJ=stf006

CASE=sp1vfast2d_baseline
./create_newcase -compset FSP1FAST -case $CASE -compiler $COMPILER -mach $MACH -project $PROJ -pecount $PES -res $RES --handle-preexisting-dirs r || exit -1
cd $CASE
./xmlchange ATM_NCPL=144,STOP_N=1
./xmlchange CHARGE_ACCOUNT=$PROJ
./xmlchange CAM_CONFIG_OPTS="-phys cam5 -use_SPCAM -crm_adv MPDATA -nlev 30 -crm_nz 28 -crm_dx 4000 -crm_dt 20  -microphys mg2 -cppdefs ' -DSP_DIR_NS ' -rad rrtmg       -crm_nx 4 -crm_ny 1 -crm_nx_rad 1 -crm_ny_rad 1 -SPCAM_microp_scheme sam1mom  -chem none  -bc_dep_to_snow_updates"
cat > user_nl_cam << 'eof'
prescribed_aero_cycle_yr = 2000
prescribed_aero_file = 'mam3_1.9x2.5_L30_2000clim_c130319.nc'
prescribed_aero_datapath = '/gpfs/alpine/world-shared/csc190/e3sm/cesm/inputdata/atm/cam/chem/trop_mam/aero'
use_hetfrz_classnuc = .false.
prescribed_aero_type = 'CYCLICAL'
aerodep_flx_type = 'CYCLICAL'
aerodep_flx_datapath = '/gpfs/alpine/world-shared/csc190/e3sm/cesm/inputdata/atm/cam/chem/trop_mam/aero'
aerodep_flx_file = 'mam3_1.9x2.5_L30_2000clim_c130319.nc'
aerodep_flx_cycle_yr = 2000
srf_flux_avg = 1
eof

cd ..

CASE=sp1vfast3d_baseline
./create_newcase -compset FSP1FAST -case $CASE -compiler $COMPILER -mach $MACH -project $PROJ -pecount $PES -res $RES --handle-preexisting-dirs r || exit -1
cd $CASE
./xmlchange ATM_NCPL=144,STOP_N=1
./xmlchange CHARGE_ACCOUNT=$PROJ
cat > user_nl_cam << 'eof'
prescribed_aero_cycle_yr = 2000
prescribed_aero_file = 'mam3_1.9x2.5_L30_2000clim_c130319.nc'
prescribed_aero_datapath = '/gpfs/alpine/world-shared/csc190/e3sm/cesm/inputdata/atm/cam/chem/trop_mam/aero'
use_hetfrz_classnuc = .false.
prescribed_aero_type = 'CYCLICAL'
aerodep_flx_type = 'CYCLICAL'
aerodep_flx_datapath = '/gpfs/alpine/world-shared/csc190/e3sm/cesm/inputdata/atm/cam/chem/trop_mam/aero'
aerodep_flx_file = 'mam3_1.9x2.5_L30_2000clim_c130319.nc'
aerodep_flx_cycle_yr = 2000
srf_flux_avg = 1
eof

From there, just cd sp1vfast[23]d_baseline and ./case.setup, change Macros.[c]make as needed, ./case.build, and ./case.submit to create your baseline netCDF files. You must be on a clean and updated master branch before you generate baselines, e.g.: git checkout master && git fetch origin && git reset --hard origin/master.

Regressions

Now, you're ready to perform quick running regression tests. The following script will setup your regression test cases:

#!/bin/bash

E3SM_HOME=~/ACME-ECP
COMPILER=pgigpu
MACH=summit
PES=18x1
RES=ne4_ne4
PROJ=stf006

CASE=sp1vfast2d_regression
./create_newcase -compset FSP1FAST -case $CASE -compiler $COMPILER -mach $MACH -project $PROJ -pecount $PES -res $RES --handle-preexisting-dirs r || exit -1
cd $CASE
./xmlchange ATM_NCPL=144,STOP_N=1
./xmlchange CHARGE_ACCOUNT=$PROJ
./xmlchange CAM_CONFIG_OPTS="-phys cam5 -use_SPCAM -crm_adv MPDATA -nlev 30 -crm_nz 28 -crm_dx 4000 -crm_dt 20  -microphys mg2 -cppdefs ' -DSP_DIR_NS ' -rad rrtmg       -crm_nx 4 -crm_ny 1 -crm_nx_rad 1 -crm_ny_rad 1 -SPCAM_microp_scheme sam1mom  -chem none  -bc_dep_to_snow_updates"
cat > user_nl_cam << 'eof'
prescribed_aero_cycle_yr = 2000
prescribed_aero_file = 'mam3_1.9x2.5_L30_2000clim_c130319.nc'
prescribed_aero_datapath = '/gpfs/alpine/world-shared/csc190/e3sm/cesm/inputdata/atm/cam/chem/trop_mam/aero'
use_hetfrz_classnuc = .false.
prescribed_aero_type = 'CYCLICAL'
aerodep_flx_type = 'CYCLICAL'
aerodep_flx_datapath = '/gpfs/alpine/world-shared/csc190/e3sm/cesm/inputdata/atm/cam/chem/trop_mam/aero'
aerodep_flx_file = 'mam3_1.9x2.5_L30_2000clim_c130319.nc'
aerodep_flx_cycle_yr = 2000
srf_flux_avg = 1
eof

cd ..

CASE=sp1vfast3d_regression
./create_newcase -compset FSP1FAST -case $CASE -compiler $COMPILER -mach $MACH -project $PROJ -pecount $PES -res $RES --handle-preexisting-dirs r || exit -1
cd $CASE
./xmlchange ATM_NCPL=144,STOP_N=1
./xmlchange CHARGE_ACCOUNT=$PROJ
cat > user_nl_cam << 'eof'
prescribed_aero_cycle_yr = 2000
prescribed_aero_file = 'mam3_1.9x2.5_L30_2000clim_c130319.nc'
prescribed_aero_datapath = '/gpfs/alpine/world-shared/csc190/e3sm/cesm/inputdata/atm/cam/chem/trop_mam/aero'
use_hetfrz_classnuc = .false.
prescribed_aero_type = 'CYCLICAL'
aerodep_flx_type = 'CYCLICAL'
aerodep_flx_datapath = '/gpfs/alpine/world-shared/csc190/e3sm/cesm/inputdata/atm/cam/chem/trop_mam/aero'
aerodep_flx_file = 'mam3_1.9x2.5_L30_2000clim_c130319.nc'
aerodep_flx_cycle_yr = 2000
srf_flux_avg = 1
eof

Now, the following script will run both 2-D and 3-D regressions and perform a 3-way diff against baseline:

#!/bin/bash
#BSUB -P stf006
#BSUB -W 02:00
#BSUB -nnodes 1
#BSUB -J regression
#BSUB -o regdim23.%J
#BSUB -e regdim23.%J
#BSUB -alloc_flags gpumps

source $MODULESHOME/init/bash
ulimit -s unlimited

dim2=1
dim3=0

clean=0
build=1
submit=1

E3SM_HOME=~/ACME-ECP

cd $E3SM_HOME/cime/scripts

if [[ $dim2 -eq 1 ]]; then
  CASE=sp1vfast2d_regression
  if [ ! -d "$CASE" ]; then
    echo "************* ERROR: 2D CASE DOES NOT EXIST *************"
  else 
    cd $CASE
  fi
  if [[ $clean  -eq 1 ]]; then
    echo "************* CLEANING 2D CASE *************"
    ./case.build --clean-all
  fi
  if [[ $build  -eq 1 ]]; then
    echo "************* BUILDING 2D CASE *************"
    ./case.build  || exit -1
  fi
  if [[ $submit -eq 1 ]]; then
    echo "************* SUBMITTING 2D CASE *************"
    ./case.submit --no-batch || exit -1
    cp /gpfs/alpine/scratch/imn/stf006/e3sm/$CASE/run/$CASE.cam.r.0001-01-02-00000.nc .
  fi
  echo "************* DIFF'ING 2D *************"
  module add python/3.7.0-anaconda3-5.3.0
  source activate rrtmgp-env
  python $E3SM_HOME/cime/tools/nccmp/nccmp3.py $E3SM_HOME/cime/scripts/sp1vfast2d_baseline/sp1vfast2d_baseline.cam.r.0001-01-02-00000.optO0.nc \
                                               $E3SM_HOME/cime/scripts/sp1vfast2d_baseline/sp1vfast2d_baseline.cam.r.0001-01-02-00000.optO2.nc \
                                               $E3SM_HOME/cime/scripts/sp1vfast2d_regression/sp1vfast2d_regression.cam.r.0001-01-02-00000.nc
  source deactivate
  module rm python
  echo ""
  
  cd ..
fi

if [[ $dim3 -eq 1 ]]; then
  CASE=sp1vfast3d_regression
  if [ ! -d "$CASE" ]; then
    echo "************* ERROR: 3D CASE DOES NOT EXIST *************"
  else 
    cd $CASE
  fi
  if [[ $clean  -eq 1 ]]; then
    echo "************* CLEANING 3D CASE *************"
    ./case.build --clean-all
  fi
  if [[ $build  -eq 1 ]]; then
    echo "************* BUILDING 3D CASE *************"
    ./case.build || exit -1
  fi
  if [[ $submit -eq 1 ]]; then
    echo "************* SUBMITTING 3D CASE *************"
    ./case.submit --no-batch || exit -1
    cp /gpfs/alpine/scratch/imn/stf006/e3sm/$CASE/run/$CASE.cam.r.0001-01-02-00000.nc .
  fi
  echo "************* DIFF'ING 3D *************"
  module add python/3.7.0-anaconda3-5.3.0
  source activate rrtmgp-env
  python $E3SM_HOME/cime/tools/nccmp/nccmp3.py $E3SM_HOME/cime/scripts/sp1vfast3d_baseline/sp1vfast3d_baseline.cam.r.0001-01-02-00000.optO0.nc \
                                               $E3SM_HOME/cime/scripts/sp1vfast3d_baseline/sp1vfast3d_baseline.cam.r.0001-01-02-00000.optO2.nc \
                                               $E3SM_HOME/cime/scripts/sp1vfast3d_regression/sp1vfast3d_regression.cam.r.0001-01-02-00000.nc
  source deactivate
  module rm python
  echo ""
fi

Python Environment

You'll have to replace source activate rrtmgp-env with an anaconda environment you've created that includes netCDF. As an example of how to do this:

module load python/3.7.0-anaconda3-5.3.0
conda create -n rrtmgp-env python=3.7 openssl=1.1.1b numpy netcdf4 xarray

You only need to create this environment once for all time. From here, you can just source activate it. Note that it tends to screw up the E3SM python scripts if you have an anaconda environment loaded while you run them, so it's best to source deactivate before you run any E3SM script.

Some GPU details

Currently we are using FFLAGS in Depends.summit.cmake and Depends.summit.[compiler].cmake to apply the compiler flags for GPU offloading. The reason for this is that the PGI compiler gives wrong answers in runtime if you use the offloading flags for all files. Thus we must use them for only the files we need them for. Also, note that we also have to change LDFLAGS in Macros.[c]make for linking purposes.

I recommend for OpenMP offload porting of the CRM that you delete all !$acc statements in crm_module.F90 that are outside the "main time stepping loop". The reason is that they currently use class and derived type data, which causes compilers some issues. Inside the main time stepping loop, however, you'll find that there aren't any pointers or direct references to derived type data.

For OpenMP porting, I recommend not bothering with optimizing data movement up front. The XL compiler will move all data for you in Fortran, so you don't need to worry about data statements. I recommend not using the depend(inout:asyncid) nowait clause up front to avoid potential wrong answers due to forgetting to put in !$omp taskwait. Also, I don't recommend porting everything at once. I recommend rather going, say, 10 kernels at a time, working your way through the code. You will encounter segfaults with OpenMP if you do everything at once. I tried it a few weeks ago.

To get a full traceback in XL, you'll need to specify -g -qtbtable=full. It actually does an admirable job doing a full traceback.

Also, the XL compiler does not recognize simd as a useful clause in OpenMP Offload. So just use !$omp target teams distribute parallel for collapse(N) private(...) to replace !$acc parallel loop collapse(N) private(...)