Simplification of EventTimings #1410

fsimonis · 2022-08-09T09:49:25Z

I open this issue in this repo to preserve the information and make it easier to find from people running into issues regarding this. It also may impact the preCICE lib in the future.

Please describe the problem you are trying to solve.

The EventTimings provide a system to:

measure named sections in the code
attach data to these sections (used in PETSc RBF mappings)
synchronize the communicator prior to the recording using a barrier on requested aka syncmode
aggregate and normalize these measurements across ranks
write a summary of the results to a file as a text table
write the aggregate results to a file as json

The additional events2trace script formats and merges multiple of these outputs into a single eventstracing json file that can be visualized with various tools.

Concerns of this approach:

The EventsTimings need a way to synchronize all ranks. This currently requires passing a custom MPI comm. Not using MPI only supports a single rank.
The data aggregation happens during the finalization of preCICE, which is a collective operation on the communicator. If any issue occurs with this communicator during the lifetime of preCICE, then the collective will fail, resulting in a crash/error.
If preCICE runs into any error, then the events won't be aggregated nor written to a file.
preCICE requires an additional dependency for the sole purpose of writing the aggregated data to disk. We are currently using a checked-in version of the json library, which will at some point collide with other versions on the system leading to strange problems such as Eigen::VectorXd not working for in-house code #527 or Using solids4foam with preCICE leads to Eigen-related runtime errors openfoam-adapter#238 .

Describe the solution you propose.

Simplify the Events in preCICE as much as possible.

The synchronization can be handled fully by preCICE, as IntraComm provides a barrier method. Its implementation also works if preCICE is compiled without MPI.
Use independent rank files, essentially removing the aggregation from the preCICE core. This allows to output events on error.
Optionally write these files continuously during the lifetime of the program, allowing to inspect the events on a crash. We could even implement a block-wise write to reduce the memory overhead.
This serialization is so simple that it doesn't require a special library. ( Similar to the clang time-tracing implementation. )
The above results in additional IO. So, a configuration option to disable the tracing could be beneficial.

Move the functionality to normalize, aggregate and format to a separate script (or precice-tools)

Use python pandas or similar for normalizing and aggregating the data.
Ship this as an extra tool in /usr/share/precice or similar.
This allows us to remove the json dependency from the project.
Alternatively, move the existing C++ implementation from the preCICE code into a separate executable, or into precice-tools.
The python version would allow everyone to easily add custom functionality such as
- plotting given timings over time
- analyse the comm establishment to detect filesystem issues on some nodes
- find an imbalance of mapping cost over ranks of a participant

Describe alternatives you've considered

Move the events2trace script into the preCICE library, essentially completely integrating the external project.
Reimplement the events2trace as a subcommand of precice-tools.

Additional context

The text was updated successfully, but these errors were encountered:

MakisH · 2022-08-13T08:56:11Z

To the extent I understand the situation, I think such a simplification would be useful. I particularly like that we could drop a dependency, and I think that moving to precice-tools sounds natural.

I also like the concept of using this for debugging a crashing simulation: instead of (or, in addition to) trying to understand the rather complex trace log, we could just visualize the events before the crash. Sounds a bit like an airplane's "black box" for preCICE.

As for the implementation details, I am sure you have thought this several steps ahead already, so I will not try to comment.

uekerman · 2022-08-14T12:49:37Z

I agree on all concerns and the proposed solution. Thanks for opening this 👍

The most common problem is not even mentioned yet: Often, coupled simulation get manually killed by users (e.g. user decides that they have enough statistics or that the simulation is in steady state, connected to #1118) or they get killed by the scheduler on a cluster (max time reached). In both cases, no event data is produced as finalize is not called.

I think it will be nice if we integrate all functionality into precice-tools:

the aggregation,
the summary,
conversion to the trace format.

An open question / challenge in my opinion: Do we need that this tool runs on distributed memory? Could be for performance reasons, but also for memory reasons.

Minor comments:

So, a configuration option to disable the tracing could be beneficial.

By tracing, you mean the events in general, right? Not the logging. If we touch this, I even could imagine that we distinguish three levels of events:

off: no events at all
user (the default option): the main routines (initialization, data mapping, advance, ...)
developer: all events

I did not think much about the names yet.

I also like the concept of using this for debugging a crashing simulation: instead of (or, in addition to) trying to understand the rather complex trace log, we could just visualize the events before the crash. Sounds a bit like an airplane's "black box" for preCICE.

I believe that this would be complicated as we do not have the very last events. Only those from the last checkpoint.

fsimonis added enhancement A new feature, a new functionality of preCICE (from user perspective) maintainability Working on this will make our lives easier in the long run as preCICE gets easier to maintain. dependencies Pull requests that update a dependency file labels Aug 9, 2022

fsimonis mentioned this issue Aug 19, 2022

Overhaul events #1419

Merged

9 tasks

This was referenced Sep 13, 2022

Use Eventtimings submodule #1160

Closed

Add link to events2trace documentation precice/precice.github.io#184

Closed

fsimonis mentioned this issue Nov 4, 2022

Confusing solver-interface XML tag and impact on geometric multiscale #1416

Closed

fsimonis closed this as completed in #1419 May 11, 2023

fsimonis mentioned this issue Aug 1, 2023

Remove nlohmann JSON dependency #1759

Merged

7 tasks

fsimonis added this to the Version 3.0.0 milestone Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplification of EventTimings #1410

Simplification of EventTimings #1410

fsimonis commented Aug 9, 2022

MakisH commented Aug 13, 2022

uekerman commented Aug 14, 2022

Simplification of EventTimings #1410

Simplification of EventTimings #1410

Comments

fsimonis commented Aug 9, 2022

MakisH commented Aug 13, 2022

uekerman commented Aug 14, 2022