Skip to content
Graham Lopez edited this page Jan 3, 2018 · 28 revisions

Welcome to the miniQMC wiki!

How to build and run miniQMC

Prerequisites

  • CMake
  • C/C++ compilers with C++11 support
  • BLAS/LAPACK, numerical library, use platform-optimized libraries

Build miniQMC

cd build
cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx ..
make -j 8

CMake provides a number of optional variables that can be set to control the build and configure steps. When passed to CMake, these variables will take precident over the enviornmental and default variables. To set them add -D FLAG=VALUE to the configure line between the cmake command and the path to the source directory.

  • General build options
CMAKE_C_COMPILER    Set the C compiler
CMAKE_CXX_COMPILER  Set the C++ compiler
CMAKE_BUILD_TYPE    A variable which controls the type of build (defaults to Release).
                    Possible values are:
                    None (Do not set debug/optmize flags, use CMAKE_C_FLAGS or CMAKE_CXX_FLAGS)
                    Debug (create a debug build)
                    Release (create a release/optimized build)
                    RelWithDebInfo (create a release/optimized build with debug info)
                    MinSizeRel (create an executable optimized for size)
CMAKE_C_FLAGS       Set the C flags.  Note: to prevent default debug/release flags
                    from being used, set the CMAKE_BUILD_TYPE=None
                    Also supported: CMAKE_C_FLAGS_DEBUG, CMAKE_C_FLAGS_RELEASE,
                                    CMAKE_C_FLAGS_RELWITHDEBINFO
CMAKE_CXX_FLAGS     Set the C++ flags.  Note: to prevent default debug/release flags
                    from being used, set the CMAKE_BUILD_TYPE=None
                    Also supported: CMAKE_CXX_FLAGS_DEBUG, CMAKE_CXX_FLAGS_RELEASE,
                                    CMAKE_CXX_FLAGS_RELWITHDEBINFO
  • Key QMC build options
QMC_MPI             Build with MPI (default 1:yes, 0:no)
QMC_COMPLEX         Build the complex (general twist/k-point) version (1:yes, default 0:no)
QMC_MIXED_PRECISION Build the mixed precision (mixing double/float) version (1:yes, default 0:no). 
                    Use float and double for base and full precision.

Check builds

Executables are created in ./bin folder. There are a few of them

check_determinant    # checks the determinant part of the wavefunction
check_wfc            # checks Jastrow components of the wave function
check_spo            # checks single particle orbitals including 3D-cubic splines.
miniqmc              # runs a fake DMC and report the time spent in each component.

It is recommended to run all the check_XXX routines. They report either "All checking pass!" or a failure message indicating where the failure is.

Run executables

all the executables can run without any arguments and input files, namely default setting. If more controls is needed, query by -h option to print out available options.

Benchmark

This is an example how miniqmc reports. The default setting is a fake DMC run mimicing the NiO 32 atom cell simulation.

================================== 
Stack timer profile
Timer                           Inclusive_time  Exclusive_time  Calls   Time_per_call
Total                              1.6640     0.0131              1       1.663990021
  Diffusion                        1.0446     0.0158            100       0.010445936
    Current Gradient               0.0054     0.0036          38400       0.000000140
      Jastrow                      0.0017     0.0017          38400       0.000000045
    Distance Tables                0.3296     0.3296          95899       0.000003437
    New Gradient                   0.6151     0.0069          38395       0.000016021
      Jastrow                      0.0973     0.0973          38395       0.000002534
      Single-Particle Orbitals     0.5110     0.5110          38395       0.000013308
    Update                         0.0785     0.0785          18999       0.000004133
    Wavefuntion GL                 0.0002     0.0002            100       0.000002389
  Pseudopotential                  0.6063     0.0098            100       0.006063325
    Distance Tables                0.2907     0.2907          74484       0.000003902
    Value                          0.3059     0.0116          74484       0.000004107
      Jastrow                      0.1221     0.1221          74484       0.000001640
      Single-Particle Orbitals     0.1722     0.1722          74484       0.000002312

Basic quantum Monte Carlo and miniQMC

The basic task of quantum Monte Carlo is to obtain the total energy of a quantum mechanical system composed of mobile electrons and immobile ions. The system miniQMC is based on is nickel oxide (NiO), in which nickel and oxygen atoms are arranged in a 3D checkerboard pattern (crystal). A fixed number of immobile atoms (e.g. 16 nickel and 16 oxygen) are set inside a box with periodic boundary conditions along with the physically relevant electrons--18 per nickel and 6 per oxygen. The remaining 10 and 2 electrons, for nickel and oxygen respectively, are represented abstractly by effective force fields called "pseudopotentials".

Each electron in the system is represented explicitly as a point charge in 3-dimensional space. If there are N electrons, we can list the 3D positions as f1. The quantum mechanical wavefunction f2 describes the probability f3 of finding the electrons at any given set of positions:

f4

The QMC simulation process amounts to performing an integral over the electronic coordinates. The total energy is just the sum of the local energy of each particular set of electronic coordinates weighted by its probability of occurrence:

f5

In Monte Carlo, a sequence of electron positions (f6) are sampled randomly with probability f3 and the integral is approximated statistically by the sum

f7

The computationally costly parts of the simulation are the generation of the sample positions f8 and the evaluation of the local energy f9.

The sample positions are generated from each other in sequence in the following way:

f10

f11

Here, f12 is a list of 3N gaussian distributed random numbers, f13 is a uniformly distributed random number, and f14 is a ratio that depends on the value and the gradient of the wavefunction f15.

miniQMC computational overview

The most important real-space approach is the diffusion Monte Carlo algorithm (DMC), where an ensemble of walkers (population) is propagated stochastically from generation to generation; each walker is represented by a set of 3D particle positions, R. In each propagation step, the walkers are moved through position space by a drift-diffusion process. At the end of each step, each walker may reproduce itself, be killed, or continue unchanged (branching process), depending on the value of E_L, the local energy, at that point in space. The walkers in an ensemble are tightly coupled through a trial energy E_T and a feedback mechanism to keep the average population at the target 10^5-10^6 walkers. This leads to a fluctuating population and necessitates shuffling of walkers among processing units to maintain load balance at an appropriate interval. Most of the computational time is spent on evaluations of ratios (L7), derivatives (L8), energies (L13), and state updates when Monte Carlo moves are accepted (L10). Profiling the QMCPACK code shows the dominant kernels are the evaluation of the particle-based Jastrow functions and 3D B-Splines called from L7,8,13 of the pseudo code, and the inverse update from L10. The latter kernel is in fact using most of the computational resources for both the CPU and GPU versions of the code. For DMC and the real space approach, the selected mini-apps will naturally focus on the inverse matrix update, the splines, the Jastrow functions and the distance tables.

The pseudocode for each step of diffusion Monte Carlo (DMC) at a high-level goes as:

1.  for MC generation = 1...M do
2.    for walker = 1...Nw do
3.      let R = {r1...rN}
4.      for particle i = 1...N do
5.        set r'_i = r_i + del
6.        let R' = {r_1...r'_i...r_N}
7.        ratio p = Psi_T(R')/ Psi_T(R) (Jastrow factors, 3D B-Spline)
8.        derivatives \nabla_i Psi_T, \nalba^2_i Psi_T (Jastrow factors, 3D B-Spline)
9.        if r --> r' is accepted then
10.         update state of a walker (Inverse Update)
11.       end if
12.     end for {particle}
13.     local energy E_L = H Psi_T(R)/Psi_T(R) (Jastrow factors, 3D B-Spline)
14.     reweight and branch walkers
15.   end for {walker}
16.   update E_T and load balance
17. endfor {MCgeneration}

miniQMC currently implements only a single walker to focus on single-node performance explorations, and so discards the two outer-most loops shown above. miniQMC mimics QMCPACk in the abstraction of the wavefunction into its components, which are the 1,2,3-body Jastrow functions and the determinant matrix; each of these pieces derive from a common Wavefunction base class and have their own implementation of functions such as ratio computation and gradient evaluation.

Important kernels

Determinant update (inverse matrix update)

As electrons are moved in the Monte Carlo, DMC uses the Sherman-Morrison algorithm to perform column-by-column updates of the Slater determinant (matrix) component of the wave- functions. Updating the Slater matrix is the source of the cubic scaling of our calculations. This algorithm relies on BLAS2 functions. In current calculations it is 30-50% of computation time and is therefore a major optimization target. In addition to searching for improved performance portable implementations, we will investigate using a delayed update method which reduces this bottleneck.

There are various algorithms and implementations that compute the updated determinant matrix generally based on the Sherman-Morrison formula. The one currently used by the main path of QMCPACK is as given in Eqn (26) of Fahy et al. 1990. An alternate variation is described in K. P. Esler, J. Kim, D. M. Ceperley, and L. Shulenburger: Accelerating Quantum Monte Carlo Simulations of Real Materials on GPU Clusters. Computing in Science & Engineering, Vol. 14, No. 1, Jan/Feb 2012, pp. 40-51 (section "Inverses and Updates"). The reference version implemented in miniQMC uses the Fahy variant.

Given the following notation:

  • D: the incoming determinant matrix
  • Dinv: the inverse of 'D'
  • Dtil: \tilde{D}, the transpose of 'Dinv'
  • p: the index of the row (particle) in Dinv to be udpated
  • b: the new vector to update Dinv

we can describe the task of the determinant update kernel as

  1. replacing the p'th row of D with the vector b
  2. inverting the matrix D to get Dinv

Roughly, the pseudocode steps for computing the inverse update using the Fahy formula variant is:

  1. ratio = Dtil[p,:].b (vector dot product)
  2. ratio_inv = 1.0 / ratio (scalar arithmetic)
  3. vtemp = Dtil[p,:] * b (DGEMV)
  4. vtemp[p] = 1.0 - ratio_inv (scalar update)
  5. rcopy = Dinv[p,:] (vector update)
  6. Dtil = rcopy X v_temp (DGER outer product)

For Sherman-Morrison, the steps go as: FIXME complete this with scalar ops

  1. delta_b = b - D[p,:] (vector subtraction)
  2. delDinv = delta_b * Dinv (DGEMV)
  3. ratio_inv = -1 / (1.0 + delDinv[p]) (scalar arithmetic)
  4. temp = Dinv[:,p] (vector update)
  5. temp = Dinv[:,p] * delta_Dinv (this and following two lines done with DGER)
  6. temp = temp * -1 / (1 + delta_Dinv[p])
  7. result = temp + Dinv

Be aware, this algorithm returns the updated inverse which is different than the inverse transpose that Fahy operates on/returns.

Splines

For every proposed electron move the value, gradient and laplacian of N orbitals are evaluated at a random particle position, where N is the number of electrons. Additional evaluations are performed for the energy. For this reason, the choice of the orbital representation is crucial for efficiency. This mini-app will focus on evaluating 3D tri-cubic B-splines which have been shown to be among the most efficient choices in condensed phases. Because many splines are evaluated simultaneously, data layout and memory hierarchy considerations are critical for high performance.

The large memory footprint of the spline data also motivates use of alternative representations and may be an important area for collaboration with mathematical efforts. A more compact representations may fit in faster/higher parts of the memory hierarchy, and therefore be preferred even if the evaluation cost is greater due to overall higher performance. Physics-based alternative representations are being explored in current BES-funded work, e.g. optimizing the grid spacing based on the nature of the orbitals.

Jastrow factors (1-,2-, and 3-body)

Jastrow factors are simple functions of inter-particle distances, such as the distances between electrons and between electrons and atoms. Although their purpose is very different, they are very similar to the terms that occur in classical molecular dynamics force fields and related particle methods. For exascale, the challenge is to evaluate these functions efficiently for the expected problem size and to expose additional parallelism. For example, the one, two, and three body terms in the Jastrow can be coscheduled, either by a tasking or stream programming approach, or by a runtime. The amount of work in each term varies with physical system and chosen parameterization, so a fixed decomposition will not be optimal.

Distance tables

As a particle-based method, many operations in DMC require knowledge of the distance between electrons or between electrons and atoms. This is quadratically scaling work, and as the system size increases this becomes time consuming. Under periodic boundary conditions, minimum image conventions must be applied to every distance computation. In this mini-app, which potentially can be fused with the Jastrow factor mini-app, we will explore the required technique to perform this task efficiently and portably: the nature of the algorithms leads to a strong sensitivity to data layout. Currently the threaded CPU implementation maintains explicit lists of particle distances, while the GPU implementation computes distances on the fly. As noted for the Jastrow factor evaluation, many techniques have been developed for classical molecular dynamics codes, and we will evaluate their algorithms where pertinent. However, we note that our particle counts are typically much smaller and minimum image convention is more important.

Clone this wiki locally