Skip to content

Release-2022.04.15

Compare
Choose a tag to compare
@mwkrentel mwkrentel released this 27 Apr 03:57

HPCToolkit Release Notes

Enhancements:

  • Hpcrun includes initial support for using the OpenMP OMPT interface for
    profiling and tracing of OpenMP TARGET operations on AMD GPUs in code
    generated by ROCM 5.1's clang-based AOMP compiler.
  • Hpcrun supports profiling of kernels on AMD GPUs using publicly available
    hardware counters with AMD's rocprofiler API.
  • Hpcrun obtains binaries for code that executes on AMD GPUs using AMD's
    Roctracer API instead of ROCm Debug API.
  • Hpcrun emits a better error message when an application unexpectedly closes
    hpcrun's log file.
  • Hpcrun now uses an embedded implementation of an MD5 hash function for
    naming CPU and GPU binaries revealed in memory.
  • Hpcstruct now supports caching of structure files from binaries it analyzes.
    A cache greatly reduces the time to analyze binaries for executions as the
    cache will almost always contain up to date analysis results for commonly
    used shared platform libraries, e.g. libc, libm, as well as libraries for MPI.
    When a binary changes, results in the cache are updated as needed.
  • Hpcstruct no longer pretty-prints its output by default. Omitting leading
    blanks due to pretty printing reduced the output size by over 15%, which was
    quite significant when analyzing multi-gigabyte binaries.
  • When applied to a measurements directory, hpcstruct will analyze only CPU and
    GPU binaries that were measured in the execution using a mix of parallelism
    and concurrency. Binaries that did not get any profile hits are not analyzed.
  • Hpcstruct's parallel efficiency has been improved. Changes that contributed
    to that improvement include enhancements to parallelism in Dyninst’s
    finalization of binary analysis and parallel assembly of hpcstruct's output
    file.
  • Update hpcstruct to support analysis of CUDA binaries from 11.5+ to
    accommodate change to NVIDIA's nvdisasm output format.
  • When measuring hardware counter metrics for kernels on AMD GPUs, disable
    kernel measurement with Roctracer because it gives an incorrect timestamp
    for the first kernel. The timestamp is wrong by a mile and destroys the
    accuracy of kernel profiles and traces.

Bug fixes:

  • Adjust tracing for ROCm GPU activities to correct alignment between CPU and
    GPU timelines.
  • Fix use of Dyninst by hpcstruct so that it sees inlining info in Intel GPU
    binaries.

Infrastructure improvements:

  • Code for hpcrun's use of LD_AUDIT has been streamlined.
  • Fixed recording of program path names as part of metadata in hpcrun's
    output files.

Dependency changes:

  • Deletions
    • Mbedtls - superceded by internal MD5 hash implementation
    • ROCm Debug API - obtain GPU binaries using Roctracer API instead
    • Gotcha - unused and removed
  • Additions
    • Rocprofiler API - included for a spack '+rocm' install to provide access
      to hardware counters on AMD GPUs
    • HSA - included for a spack '+rocm' install to support rocprofiler

Known Issues:

  • Profile measurements and traces for AMD GPUs, which are new for ROCm 5.1,
    should be viewed with some skepticism.
    Also, elapsed time for copies seem too large for executions that we've
    measured. For a 96-thread run of miniqmc, the aggregate time for copies
    reported by AMD's OMPT implementation for its GPUs was almost 100x longer
    than the real time of the execution. If timestamps are incorrect for
    OpenMP events on AMD GPUs, this will affect the accuracy of both profile
    and trace views.
    Furthermore, trace items for OpenMP events on AMD GPUs are known to
    overlap. For that reason, having hpcviewer render them on a single
    trace line, which it does, is problematic. As a result, overlapping
    trace items will cause incorrect statistics in trace view. In such
    cases, the profile view will accurately represent the aggregate values
    reported by OMPT for AMD GPUs.
  • In some cases, attribution of exclusive metrics for BLOCKTIME and
    CTXT SWTCH to call paths within the Linux kernel may be missing
    even though inclusive costs for these metrics are attributed properly.

HPCViewer Release Notes

Enhancements:

  • Improved call site icons
  • Double buffering x and y axis in the trace view
  • Simplify metric number in derived metrics
  • Set maximum database history to 20
  • Set the default GPU trace exposure to true