Merge pull request #1703 from devitocodes/openacc-tile-clause

gpu: Enable tile clause in place of collapse with OpenACC
devitocodes · Jul 13, 2021 · 78d3237 · 78d3237
2 parents eb01a9e + ca3918f
commit 78d3237
Show file tree

Hide file tree

Showing 16 changed files with 141 additions and 1,183 deletions.
diff --git a/benchmarks/user/README.md b/benchmarks/user/README.md
@@ -98,17 +98,12 @@ below.
 
 ## The optimization level
 
-In Devito, an Operator has two preset optimization levels: `noop` and
-`advanced`.  With `noop`, no performance optimizations are introduced by the
-compiler. With `advanced`, several flop-reducing and data locality
-optimizations are applied. Examples of flop-reducing optimizations are common
-sub-expressions elimination and factorization; examples of data locality
-optimizations are loop fusion and cache blocking. SIMD vectorization is also
-applied through compiler auto-vectorization.
-
-`benchmark.py` has two preset optimization modes, that for historical reasons
-are called `O1` and `O2`. Basically, `O1` corresponds to `noop`, while `O2`
-corresponds to `advanced`.
+`benchmark.py` allows to set optimization mode, as well as several optimization
+options, via the `--opt` argument. Please refer to
+[this](https://github.com/devitocodes/devito/blob/master/examples/performance/00_overview.ipynb)
+notebook for a comprehensive list of all optimization modes and options
+available in Devito. You may also want to take a look at the example command
+lines a few sections below.
 
 ## Auto-tuning
 
@@ -195,6 +190,12 @@ grid:
 ```
 python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100
 ```
+Same as before, but telling devito not to use temporaries to store the
+intermediate values which stem from mixed derivatives:
+```
+python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100 --opt
+"('advanced', {'cire-mingain: 1000000'})"
+```
 Do not forget to pin processes, especially on NUMA systems; below, we use
 `numactl` to pin processes and threads to one specific NUMA domain.
 ```
@@ -212,8 +213,9 @@ watch numastat -m
 
 ## The run-jit-backdoor mode
 
-As of Devito v3.5 it is possible to customize the code generated by Devito. This
-is often referred to as the ["JIT backdoor" mode](https://github.com/devitocodes/devito/wiki/FAQ#can-i-manually-modify-the-c-code-generated-by-devito-and-test-these-modifications).
+As of Devito v3.5 it is possible to customize the code generated by Devito.
+This is often referred to as the ["JIT backdoor"
+mode](https://github.com/devitocodes/devito/wiki/FAQ#can-i-manually-modify-the-c-code-generated-by-devito-and-test-these-modifications).
 With ``benchmark.py`` we can exploit this feature to manually hack and test the
 code generated for a given benchmark. So, we first run a problem, for example
 ```
@@ -248,57 +250,7 @@ experiments.
 ## Benchmark output
 
 The GFlops/s and GPoints/s performance, Operational Intensity (OI) and
-execution time are emitted to standard output at the end of each run.
-Further, when running in `bench` mode, a `.json` file is produced
-(see `python benchmark.py bench --help` for more info) in a folder named
-`results` except if otherwise specified with the `-r` option.
-
-## Generating a roofline model
-
-To generate a roofline model from the results obtained in `bench` mode,
-one can execute `benchmark.py` in `plot` mode. For example, the command
-
-```
-python benchmark.py plot -P acoustic -d 512 512 512 -so 12 --tn 100 -a aggressive --max-bw 12.8 --flop-ceil 80 linpack
-```
-
-will generate a roofline model for the results obtained from
-
-```
-python benchmark.py bench -P acoustic -d 512 512 512 -so 12 --tn 100 -a
-```
-
-The `plot` mode expects the same arguments used in `bench` mode plus
-two additional arguments to generate the roofline:
-
-*    `--max-bw <float>`: DRAM bandwidth (GB/s).
-*    `--flop-ceil <float, str>`: CPU machine peak. The CPU performance ceil
-        (GFlops/s) and how the ceil was obtained (ideal peak, linpack, ...).
-
-There also are two optional arguments:
-
-*   `--point-runtime` (bool switch): Annotate points with the runtime value.
-*   `--section <str>`:  The code section for which the roofline is produced.
-        An Operator consists of multiple sections. Each section typically
-        comprises a loop nest and a sequence of equations. Different sections
-        are created for logically-distinct parts of the computation
-        (finite-difference stencils, boundary conditions, interpolation, etc.).
-        The naming convention is `sectionX`, where `X` is a progress id (`section0`,
-        `section1`, ...). In the generated code the beginning and the end
-        of a section are marked with suitable comments. Currently, there is
-        no way other than looking at the generated code to understand which
-        section the user-provided equations belong to.
-
-To obtain the DRAM bandwidth of a system, we advise to use
- [STREAM](http://www.cs.virginia.edu/stream/ref.html).
-
-To obtain the ideal CPU peak, one should instantiate this formula
-
-#[cores] · #[avx units] · #[vector lanes] · #[FMA ports] · [ISA base frequency]
-
-More details in this [paper](https://arxiv.org/pdf/1807.03032.pdf).
-
-## Do not hesitate to contact us
-
-Should you encounter any issues, do not hesitate to
-[get in touch with the development team](https://join.slack.com/t/devitocodes/shared_invite/zt-gtd2yxj9-Y31YKk_7lr9AwfXeL2iMFg)
+execution time are emitted to standard output at the end of each run.  You may
+find this
+[FAQ](https://github.com/devitocodes/devito/wiki/FAQ#how-does-devito-compute-the-performance-of-an-operator)
+useful.