diff --git a/README.md b/README.md index 58b5f3e406..02d481b4f5 100644 --- a/README.md +++ b/README.md @@ -77,7 +77,7 @@ Key features include: ## Installation The easiest way to try Devito is through Docker using the following commands: -``` +```bash # get the code git clone https://github.com/devitocodes/devito.git cd devito diff --git a/benchmarks/user/README.md b/benchmarks/user/README.md index 5c21698a84..21fd2c6275 100644 --- a/benchmarks/user/README.md +++ b/benchmarks/user/README.md @@ -77,24 +77,28 @@ DEVITO_LANGUAGE=openmp ``` One has two options: either set it explicitly or prepend it to the Python command. In the former case, assuming a bash shell: -``` +```bash export DEVITO_LANGUAGE=openmp ``` In the latter case: -``` +```bash DEVITO_LANGUAGE=openmp python benchmark.py ... ``` ## Enabling MPI To switch on MPI, one should set -``` +```bash DEVITO_MPI=1 ``` and run with `mpirun -n number_of_processes python benchmark.py ...` -Devito supports multiple MPI schemes for halo exchange. See the `Tips` section -below. +Devito supports multiple MPI schemes for halo exchange. + +* Devito's most prevalent MPI modes are three: `basic`, `diag2` and `full`. +and are respectively activated e.g., via `DEVITO_MPI=basic`. +These modes may perform better under different factors such as arithmetic intensity, +or number of fields used in the computation. ## The optimization level @@ -109,7 +113,7 @@ lines a few sections below. Auto-tuning can significantly improve the run-time performance of an Operator. It can be enabled on an Operator basis: -``` +```python op = Operator(...) op.apply(autotune=True) ``` @@ -162,52 +166,41 @@ Run with `DEVITO_LOGGING=DEBUG` to find out the specific performance optimizations applied by an Operator, how auto-tuning is getting along, and to emit more performance metrics. -## Tips - -* The most powerful MPI mode is called "full", and is activated setting - `DEVITO_MPI=full` instead of `DEVITO_MPI=1`. The "full" mode supports - computation/communication overlap. -* When auto-tuning is enabled, one should always run in performance mode: - ``` - from devito import mode_performance - mode_perfomance() - ``` - This is automatically turned on by `benchmark.py` ## Example commands The isotropic acoustic wave forward Operator in a `512**3` grid, space order 12, and a simulation time of 100ms: -``` +```bash python benchmark.py run -P acoustic -d 512 512 512 -so 12 --tn 100 ``` Like before, but with auto-tuning in `basic` mode: -``` +```bash python benchmark.py run -P acoustic -d 512 512 512 -so 12 -a basic --tn 100 ``` It is also possible to run a TTI forward operator -- here in a 512x402x890 grid: -``` +```bash python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100 ``` Same as before, but telling devito not to use temporaries to store the intermediate values which stem from mixed derivatives: -``` +```bash python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100 --opt "('advanced', {'cire-mingain: 1000000'})" ``` Do not forget to pin processes, especially on NUMA systems; below, we use `numactl` to pin processes and threads to one specific NUMA domain. -``` +```bash numactl --cpubind=0 --membind=0 python benchmark.py ... ``` While a benchmark is running, you can have some useful programs running in background in other shells. For example, to monitor pinning: -``` +```bash htop ``` or to keep the memory footprint under control: -``` +```bash watch numastat -m ``` @@ -218,7 +211,7 @@ This is often referred to as the ["JIT backdoor" mode](https://github.com/devitocodes/devito/wiki/FAQ#can-i-manually-modify-the-c-code-generated-by-devito-and-test-these-modifications). With ``benchmark.py`` we can exploit this feature to manually hack and test the code generated for a given benchmark. So, we first run a problem, for example -``` +```bash python benchmark.py run-jit-backdoor -P acoustic -d 512 512 512 -so 12 --tn 100 ``` As you may expect, the ``run-jit-backdoor`` mode accepts exactly the same arguments @@ -235,7 +228,7 @@ you will see the performance impact of your changes. ## Running on HPC clusters `benchmark.py` can be used to evaluate MPI on multi-node systems: -``` +```bash mpiexec python benchmark.py ... ``` In `bench` mode, each MPI rank will produce a different `.json` file diff --git a/examples/mpi/overview.ipynb b/examples/mpi/overview.ipynb index 256a0e09ec..b2da4ba6e8 100644 --- a/examples/mpi/overview.ipynb +++ b/examples/mpi/overview.ipynb @@ -11,7 +11,7 @@ "* Install an MPI distribution on your system, such as OpenMPI, MPICH, or Intel MPI (if not already available).\n", "* Install some optional dependencies, including `mpi4py` and `ipyparallel`; from the root Devito directory, run\n", "```bash\n", - "pip install -r requirements-optional.txt\n", + "pip install -r requirements-mpi.txt\n", "```\n", "* Create an `ipyparallel` MPI profile, by running our simple setup script. From the root directory, run\n", "```bash\n", @@ -119,7 +119,8 @@ "%%px\n", "# Keep generated code as simple as possible\n", "configuration['language'] = 'C'\n", - "# Fix platform so that this notebook can be tested by py.test --nbval\n", + "# Fix platform so that this notebook can have asserted output\n", + "# when tested by ``py.test --nbval\" in any platform\n", "configuration['platform'] = 'knl7210'" ] }, @@ -831,10 +832,32 @@ "The Devito compiler applies several optimizations before generating code.\n", "\n", "* Redundant halo exchanges are identified and removed. A halo exchange is redundant if a prior halo exchange carries out the same `Function` update and the data is not “dirty” yet.\n", - "* Computation/communication overlap, with explicit prodding of the asynchronous progress engine to make sure that non-blocking communications execute in background during the compute part.\n", + "* Halo exchange communications that could be ``fired\" together are preferred over being scattered all over the code.\n", "* Halo exchanges could also be reshuffled to maximize the extension of the computation/communication overlap region.\n", "\n", - "To run with all these optimizations enabled, instead of `DEVITO_MPI=1`, users should set `DEVITO_MPI=full`, or, equivalently" + "## Computation/communication patterns\n", + "\n", + "\n", + "\n", + "Additionally, the Devito compiler offers a few modes of different computation and communication strategies, each exhibiting superiority under specific conditions for a kernel, such as operational intensity, memory footprint, the number of utilized ranks, and the characteristics of the cluster’s interconnect. Some of the best patterns are namely `basic`, `diagonal`, and `full`. Those have proven to be effective in improving the efficiency and scalability of computations, under several scnarios.\n", + "\n", + "- `basic`: The basic pattern is the simplest among the methods presented in this section and targets CPUs and GPUs. This mode, illustrated in Figure 5a, involves blocking point-to-point (P2P) data exchanges perpendicular to the 2D and 3D planes of the Cartesian topology between MPI ranks. For\n", + "example, each rank issues 4 in 2D and 6 communications in 3D. While this mode benefits from fewer communications, it may encounter synchronization bottlenecks during grid updates before computing the next timestep. This method allocates the memory needed to exchange halos in C-land before every communication, only adding negligible overhead.\n", + "\n", + "- `diag2`: Compared to the `basic`, this pattern also performs diagonal data exchanges, facilitating the communication of the corner points in our domains in a single step. This results in more communications, with 8 in 2D and 26 in 3D. Although it involves more communications, they are issued\n", + "in a single step, and the messages are smaller compared to basic. Compared to basic, this mode slightly benefits from preallocated buffers in python-land, eliminating the need to allocate data in C-land before every communication. The latter is why this version is not supported on GPUs since the\n", + "mechanism of pre-allocating buffers on device memory still needs to be supported.\n", + "\n", + "- `full`: This pattern leverages communication/computation overlap. The local-per-rank domain is logically decomposed into an inner (CORE) and an outer (OWNED/remainder) area. In a 3D example, the remainder areas take the form of faces and vector-like areas along the decomposed dimensions. The number of communications is the same as in the diagonal mode. This mode benefits from overlapping\n", + "two steps: halo updating and the stencil computations in the CORE area. After this step, stencil updates are computed in the ``remainder” areas. In the ideal case, assuming that communication is perfectly hidden, the execution time should converge to the time needed to compute the CORE plus the time needed to compute the remainder. An important drawback of this mode is the slower GPts/s achieved at the remainder areas. The elements in the remainder are not contiguous; therefore,\n", + "we have less efficient memory access patterns (strides) along vectorizable dimensions. These areas have lower cache utilization and vectorization efficiency." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's see the `diag2` method:" ] }, { @@ -844,14 +867,31 @@ "outputs": [], "source": [ "%%px\n", - "configuration['mpi'] = 'full'" + "configuration['mpi'] = 'diag2'\n", + "\n", + "op = Operator(Eq(u.forward, u.dx + 1))\n", + "# Uncomment below to show code (it's quite verbose)\n", + "# print(op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We could now peek at the generated code to see that things now look differently." + "The body of the time-stepping loop has slightly changed compared to `basic`:\n", + "\n", + "Some differences are:\n", + "\n", + "* The communication buffers `bufg`, `bufs` are not allocated at C-land, as this already happens in Python-land\n", + "* We now fire `ncomms` communications which are not only vertical or horizontal, but also diagonal.\n", + "This leads to more messages, but slightly smaller compared to `basic`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We could now peek at the generated code of the `full` mode and see that things now look differently." ] }, { @@ -863,6 +903,8 @@ "outputs": [], "source": [ "%%px\n", + "configuration['mpi'] = 'full'\n", + "\n", "op = Operator(Eq(u.forward, u.dx + 1))\n", "# Uncomment below to show code (it's quite verbose)\n", "# print(op)" @@ -879,6 +921,15 @@ "* `halowait0` wait and terminates the non-blocking communications;\n", "* `remainder0`, which internally calls `compute0`, computes the boundary region requiring the now up-to-date halo data." ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "More information on Devito's MPI, can be found in this pre-print:\n", + "[Automated MPI-X code generation for scalable finite-difference solvers](https://arxiv.org/abs/2312.13094)" + ] } ], "metadata": {