Skip to content

Commit

Permalink
Merge branch 'main' of github.com:lanl/benchmarks
Browse files Browse the repository at this point in the history
  • Loading branch information
Galen M. Shipman authored and Galen M. Shipman committed Oct 6, 2023
2 parents 4b90e24 + 43e3cc0 commit d69e6fb
Show file tree
Hide file tree
Showing 24 changed files with 3,176 additions and 46 deletions.
14 changes: 9 additions & 5 deletions doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,13 @@ Adjustments to ``GOMP_CPU_AFFINITY`` may be necessary.

The ``STREAM_ARRAY_SIZE`` value is a critical parameter set at compile time and controls the size of the array used to measure bandwidth. STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.

You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet BOTH of the following criteria:
You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet ALL of the following criteria:

1. Each array must be at least 4 times the size of the available cache memory. In practice the minimum array size is about 3.8 times the cache size.
1. Example 1: One Xeon E3 with 8 MB L3 cache ``STREAM_ARRAY_SIZE`` should be ``>= 4 million``, giving an array size of 30.5 MB and a total memory requirement of 91.5 MB.
2. Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) ``STREAM_ARRAY_SIZE`` should be ``>= 20 million``, giving an array size of 153 MB and a total memory requirement of 458 MB.
2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.
2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.
3. The value ``24xSTREAM_ARRAY_SIZExRANKS_PER_NODE`` must be less than the amount of RAM on a node. STREAM creates 3 arrays of doubles; that is where 24 comes from. Each rank has 3 of these arrays.

Set ``STREAM_ARRAY_SIZE`` using the -D flag on your compile line.

Expand All @@ -88,8 +89,11 @@ The formula for ``STREAM_ARRAY_SIZE`` is:

ARRAY_SIZE ~= 4 x (last_level_cache_size x num_sockets) / size_of_double = last_level_cache_size

This reduces to the same number of elements as bytes in the last level cache of a single processor for two socket nodes.
This is the minimum size.
This reduces to a number of elements equal to the size of the last level cache of a single socket in bytes, assuming a node has two sockets.
This is the minimum size unless other system attributes constrain it.

The array size only influences the capacity of STREAM to fully load the memory bus.
At capacity, the measured values should reach a steady state where increasing the value of ``STREAM_ARRAY_SIZE`` doesn't influence the measurement for a certain number of processors.

Running
=======
Expand Down Expand Up @@ -117,7 +121,7 @@ Crossroads
These results were obtained using the cce v15.0.1 compiler and cray-mpich v 8.1.25.
Results using the intel-oneapi and intel-classic v2023.1.0 and the same cray-mpich were also collected; cce performed the best.

``STREAM_ARRAY_SIZE=105 NTIMES=20``
``STREAM_ARRAY_SIZE=40 NTIMES=20``

.. csv-table:: STREAM microbenchmark bandwidth measurement
:file: stream-xrds_ats5cce-cray-mpich.csv
Expand Down
63 changes: 54 additions & 9 deletions doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ MDTEST
Purpose
=======

The intent of this benchmark is to measure the performance of file metadata operations on the platform storage.
MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems.
It can be run on any type of POSIX-compliant file system but has been designed to test the performance of parallel file systems.

Expand All @@ -16,11 +17,19 @@ Characteristics
Problem
-------

MDtest measures the performance of various metadata operations using MPI to coordinate execution and collect the results.
In this case, the operations in question are file creation, stat, and removal.

Run Rules
---------

Figure of Merit
---------------
Observed benchmark performance shall be obtained from a storage system configured as closely as possible to the proposed platform storage.
If the proposed solution includes multiple file access protocols (e.g., pNFS and NFS) or multiple tiers accessible by applications, benchmark results for mdtest shall be provided for each protocol and/or tier.

Performance projections are permissible if they are derived from a similar system that is considered an earlier generation of the proposed system.

Modifications to the benchmark application code are only permissible to enable correct compilation and execution on the target platform.
Any modifications must be fully documented (e.g., as a diff or patch file) and reported with the benchmark results.

Building
========
Expand All @@ -35,17 +44,53 @@ After extracting the tar file, ensure that the MPI is loaded and that the releva
Running
=======

.. .. csv-table:: MDTEST Microbenchmark
.. :file: ats3_mdtest_sow.csv
.. :align: center
.. :widths: 10, 10, 10, 10, 10
.. :header-rows: 1
.. :stub-columns: 2
The results for the three operations, create, stat, remove, should be obtained for three different file configurations:

1) ``2^20`` files in a single directory.
2) ``2^20`` files in separate directories, 1 per MPI process.
3) 1 file accessed by multiple MPI processes.

These configurations are launched as follows.

.. code-block:: bash
# Shared Directory
srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16
# Unique Directories
srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -u
# One File Multi-Proc
srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -S
The following command-line flags MUST be changed:

* ``-n`` - the number of files **each MPI process** should manipulate. For a test run with 64 MPI processes, specifying ``-n 16384`` will produce the equired ``2^20`` files (``2^6`` MPI processes x ``2^14`` files each). This parameter must be changed for each level of concurrency.
* ``-d /scratch`` - the **absolute path** to the directory in which this test should be run.
* ``-N`` - MPI rank offset for each separate phase of the test. This parameter must be equal to the number of MPI processes per node in use (e.g., ``-N 16`` for a test with 16 processes per node) to ensure that each test phase (read, stat, and delete) is performed on a different node.

The following command-line flags MUST NOT be changed or omitted:

* ``-F`` - only operate on files, not directories
* ``-C`` - perform file creation test
* ``-T`` - perform file stat test
* ``-r`` - perform file remove test

Example Results
===============

.. csv-table:: MDTEST Microbenchmark Xrds
These nine tests: three operations, three file conditions should be performed under 4 different launch conditions, for a total of 36 results:

1) A single MPI process
2) The optimal number of MPI processes on a single compute node
3) The minimal number of MPI processes on multiple compute nodes that achieves the peak results for the proposed system.
4) The maximum possible MPI-level concurrency on the proposed system. This could mean:
1) Using one MPI process per CPU core across the entire system.
2) Using the maximum number of MPI processes possible if one MPI process per core will not be possible on the proposed architecture.
3) Using more than ``2^20`` files if the system is capable of launching more than ``2^20`` MPI processes.

Crossroads
----------

.. csv-table:: MDTEST Microbenchmark Crossroads
:file: ats3_mdtest.csv
:align: center
:widths: 10, 10, 10, 10, 10
Expand Down
3 changes: 1 addition & 2 deletions microbenchmarks/mdtest/README.XROADS.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,7 @@ node memory.

The Offeror shall run the following tests:

* creating, statting, and removing at least 1,048,576 files in a single
directory
* creating, statting, and removing at least 1,048,576 files in a single directory.
* creating, statting, and removing at least 1,048,576 files in separate
directories (one directory per MPI process)
* creating, statting, and removing one file by multiple MPI processes
Expand Down
30 changes: 30 additions & 0 deletions microbenchmarks/stream/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#
MPICC ?= $(PAV_MPICC)
CC ?= $(PAV_CC)
CFLAGS ?= $(PAV_CFLAGS)

FF ?= $(PAV_FC)
FFLAGS ?= $(PAV_FFLAGS)

ifeq ($(MPICC),)
MPICC=$(CC)
endif

all: xrds-stream.exe stream_f.exe stream_c.exe stream_mpi.exe

stream_f.exe: stream.f mysecond.o
$(CC) $(CFLAGS) -c mysecond.c
$(FF) $(FFLAGS) -c stream.f
$(FF) $(FFLAGS) stream.o mysecond.o -o stream_f.exe

clean:
rm -f *stream*.exe *.o

stream_c.exe: stream.c
$(CC) $(CFLAGS) stream.c -o stream_c.exe

stream_mpi.exe: stream_mpi.c
$(MPICC) $(CFLAGS) stream_mpi.c -o stream_mpi.exe

xrds-stream.exe: xrds-stream.c
$(CC) $(CFLAGS) xrds-stream.c -o xrds-stream.exe
74 changes: 74 additions & 0 deletions microbenchmarks/stream/README.ACES
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
========================================================================================
Crossroads Memory Bandwidth Benchmark
========================================================================================
Benchmark Version: 1.0.0

========================================================================================
Benchmark Description:
========================================================================================
The Crossroads Memory Bandwidth benchmark is a modified version of the STREAM
benchmark originally written by John D. McCalpin. The modifications have been made to
simplify the code. All memory bandwidth projections/results provided in the vendor
response should be measured using the kernels in this benchmark implementation and
not the original STREAM code.

========================================================================================
Permitted Modifications:
========================================================================================
Offerers are permitted to modify the benchmark in the following ways:

OpenMP Pragmas - the Offeror may modify the OpenMP pragmas in the benchmark as required
to permit execution on the proposed system provided: (1) all modified sources and build
scripts are included in the RFP response; (2) any modified code used for the response
must continue to be a valid OpenMP program (compliant to the standard being proposed in
the Offeror's response).

Memory Allocation Routines - memory allocation routines including modified allocations
to specify the level of the memory hierarchy or placement of the data are permitted
provided: (1) all modified sources and build scripts are included in the RFP response;
(2) the use of any specific libraries to provide allocation services must be provided
in the proposed system.

Array/Allocation Sizes - the sizes of the allocated arrays may be modified to exercise
the appropriate size and level of the memory hierarchy provided the benchmark correctly
exercises the memory system being targeted.

Index Type - the range of the index type is configured for a 32-bit signed integer ("int")
via the preprocessor define STREAM_INDEX_TYPE. If very large memories are benchmarked
the Offeror is permitted to change to a larger integer type. The Offeror should indicate
in their response that this modification has been made.

Accumulation Error Checking Type - the basic accumulation error type is configured for a
64-bit double precision value via the STREAM_CHECK_TYPE preprocessor define. This may be
modified to a higher precision type ("long double") if a large memory (requiring a 64-bit
integer for STREAM_INDEX_TYPE) is used. The Offeror should indicate in their response
that this modification has been made.

========================================================================================
Run Rules:
========================================================================================
The Offeror may utilize any number of threads, affinity and memory binding options for
execution of the benchmark provided: (1) details of all command line parameters,
environment variables and binding tools are included in the response.

The vendor is expected to provide memory bandwidth projections/benchmarked results using
the Crossroads memory bandwidth benchmark to each level of the memory hierarchy
accessible from each compute resource.

========================================================================================
How to Compile, Run and Verify:
========================================================================================
To build simply type modify the file Makefile for your compiler and type make. To run,
execute the file xroads-stream. xroads-stream performs self verification.

$ make
<lots of make output>
$ export OMP_NUM_THREADS=12
$ ./xroads-stream
<xroads-stream output>

========================================================================================
How to report
========================================================================================
The primary FOM is the Triad rate (MB/s). Report all data printed to stdout.

Loading

0 comments on commit d69e6fb

Please sign in to comment.