Skip to content

Commit

Permalink
Merge pull request #1703 from devitocodes/openacc-tile-clause
Browse files Browse the repository at this point in the history
gpu: Enable tile clause in place of collapse with OpenACC
  • Loading branch information
FabioLuporini authored Jul 13, 2021
2 parents eb01a9e + ca3918f commit 78d3237
Show file tree
Hide file tree
Showing 16 changed files with 141 additions and 1,183 deletions.
86 changes: 19 additions & 67 deletions benchmarks/user/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,17 +98,12 @@ below.

## The optimization level

In Devito, an Operator has two preset optimization levels: `noop` and
`advanced`. With `noop`, no performance optimizations are introduced by the
compiler. With `advanced`, several flop-reducing and data locality
optimizations are applied. Examples of flop-reducing optimizations are common
sub-expressions elimination and factorization; examples of data locality
optimizations are loop fusion and cache blocking. SIMD vectorization is also
applied through compiler auto-vectorization.

`benchmark.py` has two preset optimization modes, that for historical reasons
are called `O1` and `O2`. Basically, `O1` corresponds to `noop`, while `O2`
corresponds to `advanced`.
`benchmark.py` allows to set optimization mode, as well as several optimization
options, via the `--opt` argument. Please refer to
[this](https://github.com/devitocodes/devito/blob/master/examples/performance/00_overview.ipynb)
notebook for a comprehensive list of all optimization modes and options
available in Devito. You may also want to take a look at the example command
lines a few sections below.

## Auto-tuning

Expand Down Expand Up @@ -195,6 +190,12 @@ grid:
```
python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100
```
Same as before, but telling devito not to use temporaries to store the
intermediate values which stem from mixed derivatives:
```
python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100 --opt
"('advanced', {'cire-mingain: 1000000'})"
```
Do not forget to pin processes, especially on NUMA systems; below, we use
`numactl` to pin processes and threads to one specific NUMA domain.
```
Expand All @@ -212,8 +213,9 @@ watch numastat -m

## The run-jit-backdoor mode

As of Devito v3.5 it is possible to customize the code generated by Devito. This
is often referred to as the ["JIT backdoor" mode](https://github.com/devitocodes/devito/wiki/FAQ#can-i-manually-modify-the-c-code-generated-by-devito-and-test-these-modifications).
As of Devito v3.5 it is possible to customize the code generated by Devito.
This is often referred to as the ["JIT backdoor"
mode](https://github.com/devitocodes/devito/wiki/FAQ#can-i-manually-modify-the-c-code-generated-by-devito-and-test-these-modifications).
With ``benchmark.py`` we can exploit this feature to manually hack and test the
code generated for a given benchmark. So, we first run a problem, for example
```
Expand Down Expand Up @@ -248,57 +250,7 @@ experiments.
## Benchmark output

The GFlops/s and GPoints/s performance, Operational Intensity (OI) and
execution time are emitted to standard output at the end of each run.
Further, when running in `bench` mode, a `.json` file is produced
(see `python benchmark.py bench --help` for more info) in a folder named
`results` except if otherwise specified with the `-r` option.

## Generating a roofline model

To generate a roofline model from the results obtained in `bench` mode,
one can execute `benchmark.py` in `plot` mode. For example, the command

```
python benchmark.py plot -P acoustic -d 512 512 512 -so 12 --tn 100 -a aggressive --max-bw 12.8 --flop-ceil 80 linpack
```

will generate a roofline model for the results obtained from

```
python benchmark.py bench -P acoustic -d 512 512 512 -so 12 --tn 100 -a
```

The `plot` mode expects the same arguments used in `bench` mode plus
two additional arguments to generate the roofline:

* `--max-bw <float>`: DRAM bandwidth (GB/s).
* `--flop-ceil <float, str>`: CPU machine peak. The CPU performance ceil
(GFlops/s) and how the ceil was obtained (ideal peak, linpack, ...).

There also are two optional arguments:

* `--point-runtime` (bool switch): Annotate points with the runtime value.
* `--section <str>`: The code section for which the roofline is produced.
An Operator consists of multiple sections. Each section typically
comprises a loop nest and a sequence of equations. Different sections
are created for logically-distinct parts of the computation
(finite-difference stencils, boundary conditions, interpolation, etc.).
The naming convention is `sectionX`, where `X` is a progress id (`section0`,
`section1`, ...). In the generated code the beginning and the end
of a section are marked with suitable comments. Currently, there is
no way other than looking at the generated code to understand which
section the user-provided equations belong to.

To obtain the DRAM bandwidth of a system, we advise to use
[STREAM](http://www.cs.virginia.edu/stream/ref.html).

To obtain the ideal CPU peak, one should instantiate this formula

#[cores] · #[avx units] · #[vector lanes] · #[FMA ports] · [ISA base frequency]

More details in this [paper](https://arxiv.org/pdf/1807.03032.pdf).

## Do not hesitate to contact us

Should you encounter any issues, do not hesitate to
[get in touch with the development team](https://join.slack.com/t/devitocodes/shared_invite/zt-gtd2yxj9-Y31YKk_7lr9AwfXeL2iMFg)
execution time are emitted to standard output at the end of each run. You may
find this
[FAQ](https://github.com/devitocodes/devito/wiki/FAQ#how-does-devito-compute-the-performance-of-an-operator)
useful.
Loading

0 comments on commit 78d3237

Please sign in to comment.