Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplification of EventTimings #1410

Closed
fsimonis opened this issue Aug 9, 2022 · 2 comments · Fixed by #1419
Closed

Simplification of EventTimings #1410

fsimonis opened this issue Aug 9, 2022 · 2 comments · Fixed by #1419
Labels
dependencies Pull requests that update a dependency file enhancement A new feature, a new functionality of preCICE (from user perspective) maintainability Working on this will make our lives easier in the long run as preCICE gets easier to maintain.
Milestone

Comments

@fsimonis
Copy link
Member

fsimonis commented Aug 9, 2022

I open this issue in this repo to preserve the information and make it easier to find from people running into issues regarding this. It also may impact the preCICE lib in the future.

Please describe the problem you are trying to solve.

The EventTimings provide a system to:

  • measure named sections in the code
  • attach data to these sections (used in PETSc RBF mappings)
  • synchronize the communicator prior to the recording using a barrier on requested aka syncmode
  • aggregate and normalize these measurements across ranks
  • write a summary of the results to a file as a text table
  • write the aggregate results to a file as json

The additional events2trace script formats and merges multiple of these outputs into a single eventstracing json file that can be visualized with various tools.

Concerns of this approach:

  • The EventsTimings need a way to synchronize all ranks. This currently requires passing a custom MPI comm. Not using MPI only supports a single rank.
  • The data aggregation happens during the finalization of preCICE, which is a collective operation on the communicator. If any issue occurs with this communicator during the lifetime of preCICE, then the collective will fail, resulting in a crash/error.
  • If preCICE runs into any error, then the events won't be aggregated nor written to a file.
  • preCICE requires an additional dependency for the sole purpose of writing the aggregated data to disk. We are currently using a checked-in version of the json library, which will at some point collide with other versions on the system leading to strange problems such as Eigen::VectorXd not working for in-house code #527 or Using solids4foam with preCICE leads to Eigen-related runtime errors openfoam-adapter#238 .

Describe the solution you propose.

  1. Simplify the Events in preCICE as much as possible.
  • The synchronization can be handled fully by preCICE, as IntraComm provides a barrier method. Its implementation also works if preCICE is compiled without MPI.
  • Use independent rank files, essentially removing the aggregation from the preCICE core. This allows to output events on error.
  • Optionally write these files continuously during the lifetime of the program, allowing to inspect the events on a crash. We could even implement a block-wise write to reduce the memory overhead.
  • This serialization is so simple that it doesn't require a special library. ( Similar to the clang time-tracing implementation. )
  • The above results in additional IO. So, a configuration option to disable the tracing could be beneficial.
  1. Move the functionality to normalize, aggregate and format to a separate script (or precice-tools)
  • Use python pandas or similar for normalizing and aggregating the data.

  • Ship this as an extra tool in /usr/share/precice or similar.

  • This allows us to remove the json dependency from the project.

  • Alternatively, move the existing C++ implementation from the preCICE code into a separate executable, or into precice-tools.

  • The python version would allow everyone to easily add custom functionality such as

    • plotting given timings over time
    • analyse the comm establishment to detect filesystem issues on some nodes
    • find an imbalance of mapping cost over ranks of a participant

Describe alternatives you've considered

  • Move the events2trace script into the preCICE library, essentially completely integrating the external project.
  • Reimplement the events2trace as a subcommand of precice-tools.

Additional context

@fsimonis fsimonis added enhancement A new feature, a new functionality of preCICE (from user perspective) maintainability Working on this will make our lives easier in the long run as preCICE gets easier to maintain. dependencies Pull requests that update a dependency file labels Aug 9, 2022
@MakisH
Copy link
Member

MakisH commented Aug 13, 2022

To the extent I understand the situation, I think such a simplification would be useful. I particularly like that we could drop a dependency, and I think that moving to precice-tools sounds natural.

I also like the concept of using this for debugging a crashing simulation: instead of (or, in addition to) trying to understand the rather complex trace log, we could just visualize the events before the crash. Sounds a bit like an airplane's "black box" for preCICE.

As for the implementation details, I am sure you have thought this several steps ahead already, so I will not try to comment.

@uekerman
Copy link
Member

I agree on all concerns and the proposed solution. Thanks for opening this 👍

The most common problem is not even mentioned yet: Often, coupled simulation get manually killed by users (e.g. user decides that they have enough statistics or that the simulation is in steady state, connected to #1118) or they get killed by the scheduler on a cluster (max time reached). In both cases, no event data is produced as finalize is not called.

I think it will be nice if we integrate all functionality into precice-tools:

  • the aggregation,
  • the summary,
  • conversion to the trace format.

An open question / challenge in my opinion: Do we need that this tool runs on distributed memory? Could be for performance reasons, but also for memory reasons.

Minor comments:

So, a configuration option to disable the tracing could be beneficial.

By tracing, you mean the events in general, right? Not the logging. If we touch this, I even could imagine that we distinguish three levels of events:

  • off: no events at all
  • user (the default option): the main routines (initialization, data mapping, advance, ...)
  • developer: all events

I did not think much about the names yet.

I also like the concept of using this for debugging a crashing simulation: instead of (or, in addition to) trying to understand the rather complex trace log, we could just visualize the events before the crash. Sounds a bit like an airplane's "black box" for preCICE.

I believe that this would be complicated as we do not have the very last events. Only those from the last checkpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement A new feature, a new functionality of preCICE (from user perspective) maintainability Working on this will make our lives easier in the long run as preCICE gets easier to maintain.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants