Skip to content

Latest commit

 

History

History
219 lines (170 loc) · 13.2 KB

README.md

File metadata and controls

219 lines (170 loc) · 13.2 KB

logo

trinity is a C++ library and command-line tool for anisotropic mesh adaptation.
It is targetted to non-uniform memory access multicore and manycore processors.
It was primarly designed for performance and hence for HPC applications.
It is intended to be involved within a numerical simulation loop.

adaptive-loop

Build Status Codacy Badge license

Table of contents

Building the library

Build Status

trinity is completely standalone.
It can be built on Linux or macOS using CMake.
It only requires a C++14 compiler endowed with OpenMP.
It can build medit to render meshes but it is optional though.
It supports hwloc to retrieve and print more information on the host machine.

mkdir build                                          # out-of-source build recommended
cd build                                             #
cmake ..                                             # see build options below
make -j4                                             # use multiple jobs for compilation
make install                                         # optional, can use a prefix
Option Description Default
Build_Medit Build medit mesh renderer ON
Build_GTest Build googletest for future unit tests OFF
Build_Main Build the command-line tool ON
Build_Examples Build provided examples ON
Use_Deferred Use deferred topology updates scheme in pragmatic OFF
Linking to your project

Codacy Badge

trinity is exported as a package.
To enable the library, update your CMakeLists.txt with:

find_package(trinity)                                # in build or install trees
target_link_libraries(target PRIVATE trinity)        # replace 'target'

And then include trinity.h in your application.
Please take a look at the examples folder for basic usage.

Using the tool

The list of command arguments is given by the -h option.

host:~$ bin/trinity -h
Usage: trinity [options]

Options:
  -h, --help            show this help message and exit
  -m CHOICE             select mode [release|benchmark|debug]
  -a CHOICE             cpu architecture [skl|knl|kbl]
  -i STRING             initial mesh file
  -o STRING             result mesh file
  -s STRING             solution field .bb file
  -c INT                number of threads
  -b INT                vertex bucket capacity [64-256]
  -t FLOAT              target resolution factor [0.5-1.0]
  -p INT                metric field L^p norm [0-4]
  -r INT                remeshing rounds [1-5]
  -d INT                max refinement/smoothing depth [1-3]
  -v INT                verbosity level [0-2]
  -P CHOICE             enable papi [cache|cycles|tlb|branch]

For now, only .mesh files used in medit are supported.

Setting thread-core affinity

For performance reasons, I recommend to explicitly set thread-core affinity before any run.
Indeed, threads should be statically bound to cores to prevent the OS from migrating them.
Besides, simultaneous multithreading (or hyperthreading on Intel) should be:

  • enabled to ease memory latency penalties especially on Intel KNL.
  • disabled to reduce shared caches saturation on faster nodes.

It can be done by setting some environment variables:

export OMP_PLACES=[cores|threads] OMP_PROC_BIND=close  # with GNU or clang/LLVM
export KMP_AFFINITY=granularity=[core|fine],compact    # with Intel compiler  

Overview

trinity aims to reduce and equidistribute the interpolation error of a computed physical field u on a triangulated
planar domain M by adapting its discretization with respect to a target number of points n.
Basically, it takes (u, M, n) and outputs a mesh adapted to the variation of the gradient of u on M using n points.
It uses metric tensors to encode the desired point distribution with respect to the estimated error.

principle

It enables to resample and regularize a planar triangular mesh M.
It aims to reduce and equidistribute the error of a solution field u on M using n points.
For that, it uses five kernels:

  • metric recover: compute a tensor field which encodes desired point density.
  • refinement: add points on areas where the error of u is large.
  • coarsening: remove points on areas where the error of u is small.
  • swapping: flip edges to locally improve cell quality.
  • smoothing: relocate points to locally improve cell qualities.
Error estimate

trinity uses metric tensors to link the error of u with mesh points distribution.
A tensor encodes the desired edge length incident to a point, which may be direction-dependent.
trinity enables to tune the sensitivity of the error estimate according to the simulation needs.
For that, it provides a multi-scale estimate in L^p norm:

  • a small p will distribute points to capture all scales of the error of u.
  • a large p will distribute points mainly on large variation areas (shocks).

It actually implements the continuous metric defined in:

📄 Fréderic Alauzet, Adrien Loseille, Alain Dervieux and Pascal Frey (2006).
"Multi-Dimensional Continuous Metric for Mesh Adaptation".
In proceedings of the 15th International Meshing Roundtable, pp 191-214, Springer Berlin.

To obtain a good mesh, it needs an accurate metric tensor field.
The latter rely on the computation of the variations of the gradient of u.
It is given by its local hessian matrices.
It is computed in trinity through a L^2 projection.

multiscale_meshes.png

Fine-grained parallelism

trinity enables intra-node parallelism by multithreading.
It relies on a fork-join model through OpenMP.
All kernels are structured into synchronous stages.
A stage consists of local computation, a reduction in shared-memory, and a barrier.

algo_structure

It does not rely on domain partitioning unlike coarse-grained parallel remeshers.
It does not rely on task parallelism and runtime capabilities such as Cilk, TBB or StarPU neither.

In fact manycore machines have plenty of slow cores with small caches.
To scale up, one needs plenty of very thin and local tasks to keep them busy.
In trinity, remesh kernels are expressed into a graph, except for refinement.
Runnable tasks are then extracted using multithreaded heuristics:

graph_matching.png

trinity fixes incidence data only at the end of a round of any kernel.
It uses an explicit synchronization scheme to fix them.
It relies on the use of low-level atomic primitives.
It was designed to minimize data movement penalties, especially on NUMA cases.
For further details, please take a look at:

📄 Hoby Rakotoarivelo, Franck Ledoux, Franck Pommereau and Nicolas Le-Goff (2017).
"Scalable fine-grained metric-based remeshing algorithm for manycore/NUMA architectures".
In proceedings of 23rd International European Conference on Parallel and Distributed Computing, Springer.


Profiling

trinity is natively instrumented.
It prints the runtime stats with three verbosity level.
Here is an output example with the medium level.

screenshot

Stats are exported as tab-separated values and can be easily plotted with gnuplot or matplotlib.
You can use wrappi to profile oncore events such as cycles, caches misses, branch predictions.

Deployment on a cluster

Preparing a benchmark campaign can be tedious 😩.
I included some python scripts to help setting it up on a node, enabling to:

  • compute a synthetic solution field.
  • rebuild sources and set thread-core affinity.
  • set memory affinity through numactl, which is useful on a Intel KNL node.
  • compact profiling data and generate gnuplot script for plots.
  • profile memory bandwith of the host machine using STREAM.
  • plot sparsity pattern of mesh incidence graph.

They are somewhat outdated, so adapt them to your needs.


logo
Copyright 2016, Hoby Rakotoarivelo.

license

trinity is free and intended for research purposes.
It was written during my doctorate, so improvements are welcome.
To get involved, you can:

  • report bugs or request features by submitting an issue.
  • submit code contributions using feature branches and pull requests.

Enjoy! 😊