-
Notifications
You must be signed in to change notification settings - Fork 35
Home
- CMake
- C/C++ compilers with C++11 support
- BLAS/LAPACK, numerical library, use platform-optimized libraries
cd build
cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx ..
make -j 8
CMake provides a number of optional variables that can be set to control the build and configure steps. When passed to CMake, these variables will take precident over the enviornmental and default variables. To set them add -D FLAG=VALUE to the configure line between the cmake command and the path to the source directory.
- General build options
CMAKE_C_COMPILER Set the C compiler
CMAKE_CXX_COMPILER Set the C++ compiler
CMAKE_BUILD_TYPE A variable which controls the type of build (defaults to Release).
Possible values are:
None (Do not set debug/optmize flags, use CMAKE_C_FLAGS or CMAKE_CXX_FLAGS)
Debug (create a debug build)
Release (create a release/optimized build)
RelWithDebInfo (create a release/optimized build with debug info)
MinSizeRel (create an executable optimized for size)
CMAKE_C_FLAGS Set the C flags. Note: to prevent default debug/release flags
from being used, set the CMAKE_BUILD_TYPE=None
Also supported: CMAKE_C_FLAGS_DEBUG, CMAKE_C_FLAGS_RELEASE,
CMAKE_C_FLAGS_RELWITHDEBINFO
CMAKE_CXX_FLAGS Set the C++ flags. Note: to prevent default debug/release flags
from being used, set the CMAKE_BUILD_TYPE=None
Also supported: CMAKE_CXX_FLAGS_DEBUG, CMAKE_CXX_FLAGS_RELEASE,
CMAKE_CXX_FLAGS_RELWITHDEBINFO
- Key QMC build options
QMC_MPI Build with MPI (default 1:yes, 0:no)
QMC_COMPLEX Build the complex (general twist/k-point) version (1:yes, default 0:no)
QMC_MIXED_PRECISION Build the mixed precision (mixing double/float) version (1:yes, default 0:no).
Use float and double for base and full precision.
Executables are created in ./bin folder. There are a few of them
check_determinant # checks the determinant part of the wavefunction
check_wfc # checks Jastrow components of the wave function
check_spo # checks single particle orbitals including 3D-cubic splines.
miniqmc # runs a fake DMC and report the time spent in each component.
It is recommended to run all the check_XXX routines. They report either "All checking pass!" or a failure message indicating where the failure is.
all the executables can run without any arguments and input files, namely default setting. If more controls is needed, query by -h option to print out available options.
This is an example how miniqmc reports. The default setting is a fake DMC run mimicing the NiO 32 atom cell simulation.
==================================
Stack timer profile
Timer Inclusive_time Exclusive_time Calls Time_per_call
Total 1.6640 0.0131 1 1.663990021
Diffusion 1.0446 0.0158 100 0.010445936
Current Gradient 0.0054 0.0036 38400 0.000000140
Jastrow 0.0017 0.0017 38400 0.000000045
Distance Tables 0.3296 0.3296 95899 0.000003437
New Gradient 0.6151 0.0069 38395 0.000016021
Jastrow 0.0973 0.0973 38395 0.000002534
Single-Particle Orbitals 0.5110 0.5110 38395 0.000013308
Update 0.0785 0.0785 18999 0.000004133
Wavefuntion GL 0.0002 0.0002 100 0.000002389
Pseudopotential 0.6063 0.0098 100 0.006063325
Distance Tables 0.2907 0.2907 74484 0.000003902
Value 0.3059 0.0116 74484 0.000004107
Jastrow 0.1221 0.1221 74484 0.000001640
Single-Particle Orbitals 0.1722 0.1722 74484 0.000002312
The basic task of quantum Monte Carlo is to obtain the total energy of a quantum mechanical system composed of mobile electrons and immobile ions. The system miniQMC is based on is nickel oxide (NiO), in which nickel and oxygen atoms are arranged in a 3D checkerboard pattern (crystal). A fixed number of immobile atoms (e.g. 16 nickel and 16 oxygen) are set inside a box with periodic boundary conditions along with the physically relevant electrons--18 per nickel and 6 per oxygen. The remaining 10 and 2 electrons, for nickel and oxygen respectively, are represented abstractly by effective force fields called "pseudopotentials".
Each electron in the system is represented explicitly as a point charge in 3-dimensional space. If there are N electrons, we can list the 3D positions as . The quantum mechanical wavefunction describes the probability of finding the electrons at any given set of positions:
The QMC simulation process amounts to performing an integral over the electronic coordinates. The total energy is just the sum of the local energy of each particular set of electronic coordinates weighted by its probability of occurrence:
In Monte Carlo, a sequence of electron positions () are sampled randomly with probability and the integral is approximated statistically by the sum
The computationally costly parts of the simulation are the generation of the sample positions and the evaluation of the local energy .
The sample positions are generated from each other in sequence in the following way:
Here, is a list of 3N gaussian distributed random numbers, is a uniformly distributed random number, and is a ratio that depends on the value and the gradient of the wavefunction .
The most important real-space approach is the diffusion Monte Carlo algorithm
(DMC), where an ensemble of walkers (population) is propagated stochastically
from generation to generation; each walker is represented by a set of 3D
particle positions, R
. In each propagation step, the walkers are moved
through position space by a drift-diffusion process. At the end of each step,
each walker may reproduce itself, be killed, or continue unchanged (branching
process), depending on the value of E_L
, the local energy, at that point in
space. The walkers in an ensemble are tightly coupled through a trial energy
E_T
and a feedback mechanism to keep the average population at the target
10^5-10^6 walkers. This leads to a fluctuating population and necessitates
shuffling of walkers among processing units to maintain load balance at an
appropriate interval. Most of the computational time is spent on evaluations of
ratios (L7), derivatives (L8), energies (L13), and state updates when Monte
Carlo moves are accepted (L10). Profiling the QMCPACK code shows the dominant
kernels are the evaluation of the particle-based Jastrow functions and 3D
B-Splines called from L7,8,13 of the pseudo code, and the inverse update from
L10. The latter kernel is in fact using most of the computational resources for
both the CPU and GPU versions of the code. For DMC and the real space approach,
the selected mini-apps will naturally focus on the inverse matrix update, the
splines, the Jastrow functions and the distance tables.
The pseudocode for each step of diffusion Monte Carlo (DMC) at a high-level goes as:
1. for MC generation = 1...M do
2. for walker = 1...Nw do
3. let R = {r1...rN}
4. for particle i = 1...N do
5. set r'_i = r_i + del
6. let R' = {r_1...r'_i...r_N}
7. ratio p = Psi_T(R')/ Psi_T(R) (Jastrow factors, 3D B-Spline)
8. derivatives \nabla_i Psi_T, \nalba^2_i Psi_T (Jastrow factors, 3D B-Spline)
9. if r --> r' is accepted then
10. update state of a walker (Inverse Update)
11. end if
12. end for {particle}
13. local energy E_L = H Psi_T(R)/Psi_T(R) (Jastrow factors, 3D B-Spline)
14. reweight and branch walkers
15. end for {walker}
16. update E_T and load balance
17. endfor {MCgeneration}
miniQMC currently implements only a single walker to focus on single-node performance explorations, and so discards the two outer-most loops shown above. miniQMC mimics QMCPACk in the abstraction of the wavefunction into its components, which are the 1,2,3-body Jastrow functions and the determinant matrix; each of these pieces derive from a common Wavefunction base class and have their own implementation of functions such as ratio computation and gradient evaluation.
As electrons are moved in the Monte Carlo, DMC uses the Sherman-Morrison algorithm to perform column-by-column updates of the Slater determinant (matrix) component of the wave- functions. Updating the Slater matrix is the source of the cubic scaling of our calculations. This algorithm relies on BLAS2 functions. In current calculations it is 30-50% of computation time and is therefore a major optimization target. In addition to searching for improved performance portable implementations, we will investigate using a delayed update method which reduces this bottleneck.
There are various algorithms and implementations that compute the updated
determinant matrix generally based on the Sherman-Morrison formula. The one
currently used by the main path of QMCPACK is as given in Eqn (26) of Fahy et
al. 1990. An alternate variation is described in K. P. Esler, J. Kim, D. M.
Ceperley, and L. Shulenburger: Accelerating Quantum Monte Carlo Simulations of
Real Materials on GPU Clusters. Computing in Science & Engineering, Vol. 14, No.
1, Jan/Feb 2012, pp. 40-51 (section "Inverses and Updates"). The reference
version implemented in miniQMC in src/QMCWaveFunctions/DeterminantRef.h
uses
the Fahy variant.
Given the following notation:
-
D
: the incoming determinant matrix -
Dinv
: the inverse of 'D' -
Dtil
: \tilde{D}, the transpose of 'Dinv' -
p
: the index of the row (particle) in Dinv to be udpated -
b
: the new vector to update Dinv
we can describe the task of the determinant update kernel as
- replacing the
p
'th row ofD
with the vectorb
- inverting the matrix
D
to getDinv
Roughly, the pseudocode steps for computing the inverse update using the Fahy formula variant is:
-
ratio = Dtil[p,:].b
(vector dot product) -
ratio_inv = 1.0 / ratio
(scalar arithmetic) -
vtemp = Dtil[p,:] * b
(DGEMV) -
vtemp[p] = 1.0 - ratio_inv
(scalar update) -
rcopy = Dinv[p,:]
(vector update) -
Dtil = rcopy X v_temp
(DGER outer product)
For Sherman-Morrison, the steps go as: FIXME complete this with scalar ops
-
delta_b = b - D[p,:]
(vector subtraction) delDinv = delta_b * Dinv (DGEMV)
-
ratio_inv = -1 / (1.0 + delDinv[p])
(scalar arithmetic) -
temp = Dinv[:,p]
(vector update) -
temp = Dinv[:,p] * delta_Dinv
(this and following two lines done with DGER) temp = temp * -1 / (1 + delta_Dinv[p])
result = temp + Dinv
Be aware, this algorithm returns the updated inverse which is different than the inverse transpose that Fahy operates on/returns.
For every proposed electron move the value, gradient and laplacian of N orbitals are evaluated at a random particle position, where N is the number of electrons. Additional evaluations are performed for the energy. For this reason, the choice of the orbital representation is crucial for efficiency. This mini-app will focus on evaluating 3D tri-cubic B-splines which have been shown to be among the most efficient choices in condensed phases. Because many splines are evaluated simultaneously, data layout and memory hierarchy considerations are critical for high performance.
The large memory footprint of the spline data also motivates use of alternative representations and may be an important area for collaboration with mathematical efforts. A more compact representations may fit in faster/higher parts of the memory hierarchy, and therefore be preferred even if the evaluation cost is greater due to overall higher performance. Physics-based alternative representations are being explored in current BES-funded work, e.g. optimizing the grid spacing based on the nature of the orbitals.
Jastrow factors are simple functions of inter-particle distances, such as the distances between electrons and between electrons and atoms. Although their purpose is very different, they are very similar to the terms that occur in classical molecular dynamics force fields and related particle methods. For exascale, the challenge is to evaluate these functions efficiently for the expected problem size and to expose additional parallelism. For example, the one, two, and three body terms in the Jastrow can be coscheduled, either by a tasking or stream programming approach, or by a runtime. The amount of work in each term varies with physical system and chosen parameterization, so a fixed decomposition will not be optimal.
As a particle-based method, many operations in DMC require knowledge of the distance between electrons or between electrons and atoms. This is quadratically scaling work, and as the system size increases this becomes time consuming. Under periodic boundary conditions, minimum image conventions must be applied to every distance computation. In this mini-app, which potentially can be fused with the Jastrow factor mini-app, we will explore the required technique to perform this task efficiently and portably: the nature of the algorithms leads to a strong sensitivity to data layout. Currently the threaded CPU implementation maintains explicit lists of particle distances, while the GPU implementation computes distances on the fly. As noted for the Jastrow factor evaluation, many techniques have been developed for classical molecular dynamics codes, and we will evaluate their algorithms where pertinent. However, we note that our particle counts are typically much smaller and minimum image convention is more important.