-
Notifications
You must be signed in to change notification settings - Fork 318
List of common problems
Solution: put code in a conditional that checks for 0, and handles 0 values specially
Solution: put code in a conditional that handles mathematically impossible cases specially
Solutions:
foo = max(foo, 0._r8)
or use the truncate_small_values
subroutine in NumericsMod
.
This is one of the most common issues we encounter. It causes both maintainability problems and future bugs (when one block of code gets modified and thus becomes out of sync with the other block). Sometimes a block of code is copied and pasted entirely; other times this issue is more subtle, with logic effectively duplicated in two places even though the code looks somewhat different.
This problem can be identified by asking yourself: If I make a change in place A, will I need to make a corresponding change in place B?
Possible solutions:
-
Introduce a subroutine that holds the common code, with parameters to allow for any differences in behavior between the different locations
-
Save the result of a calculation in a variable that can be reused in multiple places
Variables need to be written to and read from the restart file if their value persists from one time step to the next. This is mainly the case for the model's fundamental state variables. However, in order to minimize restart file size (reasons of disk usage, performance, and understandability), we try to avoid adding variables to the restart file unless they're truly needed.
Some good rules of thumb to use are:
-
If a variable's value at time
t
directly depends on its value at timet-1
, often through evolution equations of the formx = x + flux*dtime
, then it likely needs to be on the restart file. -
Imagine setting a variable to 0 at the start of each time step. Would this cause incorrect results? If so, it likely needs to be on the restart file.
-
However, if a variable can be recalculated based on other variables, then it probably should not be on the restart file. Instead, its initial value can be calculated in model initialization, after reading the restarat file.
There are also some cases where a variable is needed on the restart file because its value is referenced (i.e., it appears on the right-hand side of an equation) earlier in the driver loop than where it is set. This is a subtle issue, and needs to be kept in mind when developing and reviewing code. For example, the relevant parts of the driver loop could look like this:
foo = bar*2
(more code here)
bar = ...
In reality, these lines will be in subroutines called from the main driver loop, so an understanding is needed of the calling order of the model's subroutines.
An ideal solution in this case is to reorder the code so that bar
is
calculated before it is used. The next most ideal solution is to
recalculate bar
from other variables in initialization after the
restart file is read. However, if neither of these are possible, then
bar
needs to be added to the restart file.
We have many restart tests in the automated test suite; these catch many problems with variables being absent from the restart file that should be there. However, these tests cannot catch all problems - particularly if a variable's value only needs to persist between time steps in certain circumstances (such as if the restart is done mid-day). In addition, these tests cannot catch problems with variables being added to the restart file unnecessarily.
Example:
bar = ...
foo = bar + 1._r8
(more code here)
bar = bar + 1._r8
In this case, it's possible that the foo assignment should really have happened after the increment to bar.
Solution: Check code carefully for assignments to variables that are used on the right-hand side of equations. Ideally, this search would only need to be done within the current subroutine. But in practice, variables in CTSM are sometimes updated in multiple subroutines, so you should extend this search to make sure your new code happens in the correct place in the driver loop. (i.e., make sure that there aren't subroutines called later in the driver that update the quantity that you're using on the right-hand side of the equation.)
Possible sources of answer changes with changing processor count (PEM test failures, also seen in ERP tests)
PEM and ERP tests are designed to catch problems specific to parallelization. In CTSM, these aren't the most common sources of errors with these tests, but we'll start with a few parallel-specific reasons that these tests could fail.
Namelist variables are read on the master proc and then should be broadcast to all other processors. If a broadcast statement is missing or incorrect for a namelist variable, then the namelist value could be wrong on all other processors. This will lead to answer changes with changing processor count because changing processor count will change which grid cells are on the master proc (with the correct namelist value) vs. other processors.
An obvious source of answer changes with changing processor count is processor count dependence of a parallel algorithm. A common issue here is an MPI reduction that depends on the processor count – e.g., a sum across multiple processors, which could depend on the order in which the sum is taken.
However, we do not have many parallel algorithms in CTSM; these would mainly apply in cases where there is communication between grid cells.
One common cause of processor count dependence is the incorrect indexing of a subgrid variable. For example, if a patch variable is indexed by g
instead of p
, then it will access the wrong index. This will be picked up in a PEM (or ERP) test because exactly which point it accesses is processor count-dependent.
Here are some git grep
commands that can help find this problem:
git grep -i '_patch *( *[glc] *)'
git grep -i '_col *( *[glp] *)'
git grep -i '_lun *( *[gcp] *)'
git grep -i '_grc *( *[lcp] *)'
However, since we often strip the suffix in associate statements, you cannot rely on these grep commands to detect this issue.
Another common cause of answer changes with changing processor counts is a scalar variable being used before it is set in a given loop iteration. This means that its value depends on a previous loop iteration, or possibly the value that was set in an earlier loop in this subroutine. Changing processor counts changes which grid cell is operated on first in a loop for a given processor, and also which grid cell is the previous loop iteration for a given grid cell.
There are a few common specific ways that this appears in CTSM code, as noted below:
Often, a loop over one subgrid level will access variables in arrays at a coarser subgrid level. For example, a loop over patches will access column and gridcell-level variables. This requires settings like c = patch%column(p)
. Sometimes there is a bug where a given loop is missing one of these needed settings; instead its setting comes from the previous loop in that subroutine. In this case, all patches – on all grid cells – will use the same c
value.
Sometimes a scalar variable is set inside a conditional but is accessed outside that conditional. There may be multiple branches of the conditional with the intent that the scalar is set for all cases, but there may be a missing branch, so in some situations the scalar doesn't end up getting set for a particular point. The value will then be taken from the previous loop iteration.
foo(:) = 0._r8
should be replaced by the following, assuming foo is a column-level array:
foo(bounds%begc:bounds%endc) = 0._r8
or, better, initialize foo within a loop over the appropriate filter, or a loop over bounds%begc to bounds%endc, ideally subset by active points.
One way to identify threaded loops with issues is to turn off more and more loops until you identify ones with issues.
For example a subroutine with a loop like this...
!$OMP PARALLEL DO PRIVATE (nc,bounds_clump)
do nc = 1,nclumps
call get_clump_bounds(nc, bounds_clump)
...
end do
!$OMP END PARALLEL DO
For threading each processor thread needs to have its own version of temporary or loop indexing variables. For longer loops the list needed can be quite long, if you don't have the right list of variables in the private list, different threads will share these variables and result in strange behavior.
So in the above example, if either nc, or bounds_clump weren't in the private list the loop would not be able to function correctly.
Turn off OpenMP parallelism by removing the first line or adding an extra comment character (!) to comment it out.
Active points are (generally, but not entirely) ones with > 0 weight on the grid cell.
Solution: Use the filters, which only include active points
If someone commits a binary file without git lfs enabled, it will actually be committed directly. The same thing will happen even if they have git lfs enabled if the file has an extension that isn't currently tracked by git lfs. Look for any such binary files when looking through the list of changed files. These will often appear in the doc directory.
See the .gitattributes file at the top level of the repository for files typically handled by git lfs).
-
General
-
Documents
-
Bugs/Issues
-
Tutorials
-
Development guides
CTSM Users:
CTSM Developer Team
-
Meetings
-
Notes
-
Editing documentation (tech note, user's guide)