19 May 15:46

Bugfixes since 2022.04.15

hide the symbols from XED, this avoid a problem interfering with
symbols from Intel gtpin
fix a typo in renamestruct.sh script
fix a problem with fork() on Cray

Assets 2

27 Apr 03:57

mwkrentel

release-2022.04

a92fdad

Release-2022.04.15

HPCToolkit Release Notes

Enhancements:

Hpcrun includes initial support for using the OpenMP OMPT interface for
profiling and tracing of OpenMP TARGET operations on AMD GPUs in code
generated by ROCM 5.1's clang-based AOMP compiler.
Hpcrun supports profiling of kernels on AMD GPUs using publicly available
hardware counters with AMD's rocprofiler API.
Hpcrun obtains binaries for code that executes on AMD GPUs using AMD's
Roctracer API instead of ROCm Debug API.
Hpcrun emits a better error message when an application unexpectedly closes
hpcrun's log file.
Hpcrun now uses an embedded implementation of an MD5 hash function for
naming CPU and GPU binaries revealed in memory.
Hpcstruct now supports caching of structure files from binaries it analyzes.
A cache greatly reduces the time to analyze binaries for executions as the
cache will almost always contain up to date analysis results for commonly
used shared platform libraries, e.g. libc, libm, as well as libraries for MPI.
When a binary changes, results in the cache are updated as needed.
Hpcstruct no longer pretty-prints its output by default. Omitting leading
blanks due to pretty printing reduced the output size by over 15%, which was
quite significant when analyzing multi-gigabyte binaries.
When applied to a measurements directory, hpcstruct will analyze only CPU and
GPU binaries that were measured in the execution using a mix of parallelism
and concurrency. Binaries that did not get any profile hits are not analyzed.
Hpcstruct's parallel efficiency has been improved. Changes that contributed
to that improvement include enhancements to parallelism in Dyninst’s
finalization of binary analysis and parallel assembly of hpcstruct's output
file.
Update hpcstruct to support analysis of CUDA binaries from 11.5+ to
accommodate change to NVIDIA's nvdisasm output format.
When measuring hardware counter metrics for kernels on AMD GPUs, disable
kernel measurement with Roctracer because it gives an incorrect timestamp
for the first kernel. The timestamp is wrong by a mile and destroys the
accuracy of kernel profiles and traces.

Bug fixes:

Adjust tracing for ROCm GPU activities to correct alignment between CPU and
GPU timelines.
Fix use of Dyninst by hpcstruct so that it sees inlining info in Intel GPU
binaries.

Infrastructure improvements:

Code for hpcrun's use of LD_AUDIT has been streamlined.
Fixed recording of program path names as part of metadata in hpcrun's
output files.

Dependency changes:

Deletions
- Mbedtls - superceded by internal MD5 hash implementation
- ROCm Debug API - obtain GPU binaries using Roctracer API instead
- Gotcha - unused and removed
Additions
- Rocprofiler API - included for a spack '+rocm' install to provide access
  to hardware counters on AMD GPUs
- HSA - included for a spack '+rocm' install to support rocprofiler

Known Issues:

Profile measurements and traces for AMD GPUs, which are new for ROCm 5.1,
should be viewed with some skepticism.
Also, elapsed time for copies seem too large for executions that we've
measured. For a 96-thread run of miniqmc, the aggregate time for copies
reported by AMD's OMPT implementation for its GPUs was almost 100x longer
than the real time of the execution. If timestamps are incorrect for
OpenMP events on AMD GPUs, this will affect the accuracy of both profile
and trace views.
Furthermore, trace items for OpenMP events on AMD GPUs are known to
overlap. For that reason, having hpcviewer render them on a single
trace line, which it does, is problematic. As a result, overlapping
trace items will cause incorrect statistics in trace view. In such
cases, the profile view will accurately represent the aggregate values
reported by OMPT for AMD GPUs.
In some cases, attribution of exclusive metrics for BLOCKTIME and
CTXT SWTCH to call paths within the Linux kernel may be missing
even though inclusive costs for these metrics are attributed properly.

HPCViewer Release Notes

Enhancements:

Improved call site icons
Double buffering x and y axis in the trace view
Simplify metric number in derived metrics
Set maximum database history to 20
Set the default GPU trace exposure to true

Assets 2

09 Jun 22:44

mwkrentel

release-2021.05

8a415de

Release-2021.05.15

Primarily a bug-fix release on top of 2021.03.01. For a full
description, see the README.ReleaseNotes file.

Improvements for hpcviewer

- Improve the performance of hot-path operation by not re-revealing the tree path.
- Default window size is 1400x1000 or the screen size
- Trace view: Move depth field into a separate pane so users can change the depth easily 
  even when call stack view is not visible.
- Reduce memory consumption.
- Use Java XML parser to slightly improve XML parsing performance and avoid using 
  the old Apache xerces. 
- Code clean-up, remove dead code and remove unused variables
- Issue 77: Add support for different color mapping policy in the trace view. 
  Default: procedure-name color instead of random color.
- Warn users when filtering is enabled 
- Default is to build with Eclipse 4.19 (2021.03) except for Linux 
  ppc64le (built with Eclipse 4.16). Some fixes include improved dark color theme.

Bug fixes

hpcrun
CPU issues
- avoid deadlock by not sampling an openmp thread before it finishes
setting up TLS
- avoid having the UCX communication library used by MPI terminate
a program when an unwind fails rather than just dropping a
sample
- fix initialization of control knobs when a process forks but
does not exec
- add a timeout to interrupt a hung cuptiActivityFlushAll and so a program
can terminate and write out all performance data already collected.
Intel GPUs
- always dump Intel GPU binaries so we can extract kernel names
even if not using GTPin binary instrumentation
NVIDIA GPUs
- avoid introducing kernel serialization while using coarse-grain
measurement by monitoring CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL rather
than CUPTI_ACTIVITY_KIND_KERNEL

hpcstruct
- correct reconstruction of loop nests for Intel GPU binaries

hpcviewer
- Fix issue #80 and #81 (null pointer exception for empty databases)
- Fix issue #79 (CCT filter on the trace view, preserve tree expansion)
- Fix issue #73 (sort direction is not shown on Linux for the first appearance)
- Fix issue #75 (closing only a window in multiple windows mode)
- Fix issue #74 (no sort direction on Linux/GTK)
- Fix issue #85 (keyboard shortcut to minimize the window)
- Fix filtering CCT nodes for thread views
- Fix hot path to select the child node instead of the parent
- Fix merging GPU databases which contain aggregate and derived metrics
by deep copying the metric descriptors.
- Fix build script to include notarization for mac
- Fix storing recent open database: store the absolute path, not the relative one.
- Fix SWT resource leaks
- Fix flickering issue on Windows when splitting the hpcviewer window.
- Fix trace view’s color map changes to also refresh other panes and windows
- Fix Find dialog layout on Linux/GTK
- Fix merging GPU databases
- Fix a procedure-color mapping bug in the trace view
- Partial fix issue 42: Fix a performance bug when sorting a table

Assets 2

10 Aug 17:45

mwkrentel

release-2020.08

d9d13c7

Release-2020.08

Support for nVidia GPU's and CUDA, including PC sampling on GPU.

New fnbounds server that is faster and has a much smaller memory
footprint.

New format for hpcviewer databases. Note: the new viewer supports
reading old databases, but the current hpcprof requires the latest
hpcviewer.

Hpcstruct supports thread-level parallelism, use '-j ' to
run with multiple threads.

Important bug fixes to improve powerpc unwinding.

Bug fixes to better handle DOE applications: adagio, kull, pytorch

Updates to the manual and man pages.

Assets 2

01 Oct 02:38

jmellorcrummey

release-2018.09

7b9a814

Release 2018.09

Visible Improvements

hpcrun

Significantly enhanced robustness of HPCToolkit's measurement infrastructure to better support profiling of highly multithreaded applications.

Overhauled initialization to support profiling of applications that create threads in constructors that must synchronize before main is entered.
Improved call stack unwinding on all platforms.
Improved support for collecting call path samples that include frames in the Linux kernel.
Refined handling for precise hardware events in the Linux perf sample source.
Refined scripts to avoid interactions with Darshan that can cause deadlock.
Note jobid in jsrun jobs.

hpcstruct

Improved program structure recovery of loops, inlined code, and outlined code using binary analysis of highly-optimized code.

Improved attribution of loops to source lines.
Made hpcstruct's output deterministic for irreducible loops.
Improved attribution for PLT stubs.
Improved name demangling.

hpcprof/hpcprof-mpi

Significantly enhanced robustness of hpcprof-mpi.

Emit a warning and proceed with analysis if measurement data is salvageable rather than aborting with a fatal error.
Tolerate missing load modules. Generate a placeholder if necessary and emit a warning rather than triggering a fatal error.
Tolerate cases where some ranks in hpcprof-mpi are not assigned any profiles to analyze.
Avoid unnecessary per-rank duplication of informational messages.

hpcviewer

Calculate costs for inlined functions in bottom-up view and flat view as one would if they were actual functions.

Documentation

Updated man pages and manuals.

Streamline user view

Migrated developer-centric functionality out of HPCToolkit's bin directory.

Migrated hpcsummary to libexec/hpctoolkit.
Removed support for creating DOT files from hpcstruct. Create a separate executable for developer use in libexec/hpctoolkit.
Updated hpcproftt to remove stale command-line options. Migrate hpcproftt to libexec/hpctoolkit.

Bug Fixes

Updated build system to automake 1.5.1 to handle newer Linux software stacks.
Fixed latex2man script for perl 5.26.1.
Fixed configuration to skip kernel sampling and disable support for BLOCKTIME for older Linux kernels.
Fixed bugs related to handling Linux perf_events at runtime.
Fixed race conditions that arise where samples arrive after shutting down a sample source or when monitoring ends while processing a sample.
Corrected handling in HPCToolkit's measurement infrastructure for dlclose, which is frequently used by OpenMPI.
Corrected support for libunwind to properly terminate unwinds on ARM when compilers put DWARF FDEs in .debug_frame rather than .eh_frame segments.
Adjusted unwinder support for Power architectures to avoid libunwind.
Adjusted support for integrating libunwind and binary analysis on x86_64 architectures.
When measuring an execution, if hpcfnbounds quits wait for it to finish to avoid zombies.
Corrected hpcprof-mpi to handle the corner case where an MPI rank is assigned no profiles to analyze.
Added comprehensive error handling in hpcprof-mpi when writing files, especially to handle disk full or quota exceeded errors.
Fix selection of an alternate output directory in hpcprof-mpi.
Report an error if hpcstruct is run on anything other than an ELF binary.
Correct handling for pseudo-roots such as in hpcviewer's flat view.

Assets 2

11 Oct 13:57

laksono

release-2017.10

c779ab0

Release 2017.10

Principal Technical Improvements

Support for Linux perf events.

Linux perf events provides a powerful interface that supports measurement of both application execution and kernel activity. Using perf events, one can measure both hardware and software events. Using a processor’s hardware performance monitoring unit (PMU), the perf events interface can measure an execution using any hardware counter supported by the PMU.

Frequency-based sampling.

Rather than picking a sample period for a hardware counter, the Linux perf events interface enables one to specify the desired sampling frequency and have the kernel automatically select and adjust the period to try to achieve the desired sampling frequency.

Multiplexing.

Using multiplexing enables one to monitor more events in a single execution than the number of hardware counters a processor can support for each thread. The number of events that can be monitored in a single execution is only limited by the maximum number of concurrent events that the kernel will allow a user to multiplex using the perf events interface.
When more events are specified than can be monitored simultaneously using a thread’s hardware counters, the kernel will employ multiplexing and divide the set of events to be monitored into groups, monitor only one group of events at a time, and cycle repeatedly through the groups as a program executes.

Kernel sampling

Collect calling-context into the kernel using perf_events. It adds support for extending user-level program contexts with kernel calling contexts. The kernel call chains interpretation requires the value /proc/sys/kernel/kptr_restrict=0 and /proc/sys/kernel/perf_event_paranoid=1 (1 or 0).

Thread blocking.

When a program executes, a thread may block waiting for the kernel to complete some operation on its behalf. Example operations include waiting for a read operation to complete or having the kernel service a page fault or zero-fill a page.
On systems running Linux 4.3 or newer, one can use the perf events sample source to monitor how much time a thread is blocked and where the blocking occurs.

Improvements to call stack unwinding

Members of the project team fixed bugs identified by our testing of libunwind in the context of HPCToolkit's measurement infrastructure and helped refine libunwind to enable an external tool, e.g., HPCToolkit's hpcrun, to cache libunwind recipes for a procedure to avoid the need to recompute them on demand later.

hpctoolkit-externals includes a snapshot of libunwind as of 2 October 2017.

Improved binary analysis

This release of HPCToolkit benefits from refinements to Dyninst that improve hpcstruct's ability to reconstruct control flow graphs for procedures in the presence of jump tables.

hpctoolkit-externals includes Dyninst 9.3.2 supplemented with patches that include important but unreleased improvements.

Assets 2

01 Jul 04:13

mwkrentel

release-2017.06

1d908a3

Release 2017.06

Technical Improvements

Updated the ompt branch to provide better scalability to large thread counts as found on KNL and Power8. This branch, together with the LLVM OpenMP runtime library provides the OMP_IDLE metric to unify the presentation of worker and main threads in OpenMP regions.
Updated Dyninst to version 9.3.2 in hpctoolkit-externals, plus a patch for better binary analysis of functions that use jump tables.
Updated the use of atomic operations in hpcrun with C11 atomics.
Updated hpcstruct to handle a new ABI on Power/LE architectures with both internal and external interfaces for functions.

Bug Fixes

Improved analysis for call stack unwinding on x86-64, including a bug fix to track stack frame allocation and deallocation using the load effective address (LEA) instruction and an enhancement that improves call stack unwinding for procedures that realign their stack pointer upon function entry.
Fixed bug in hpcrun to correct data reinitialization after fork(). This bug prevented using hpcrun to profile programs launched with shell scripts.
Fixed bug in hpcstruct in getRealPath() that caused hpcstruct to sometimes report incorrect file names.

Known Problems

Some types of applications on x86-64 architectures generate a significant number of 'partial unwinds,' making it harder to use the top-down view in hpcviewer. A partial workaround is to use the bottom-up and flat views in hpcviewer.

Assets 2

02 Jan 05:38

jmellorcrummey

release-2016.12

7d2e5bb

Release 2016.12

Notable Platform Changes

Added support for Intel Knights Landing (KNL), treated as an x86-64 flavor.
Added preliminary measurement, analysis, attribution, and GUI support for Power8/LE.
Added preliminary measurement, analysis, and attribution support for ARM64.