Skip to content

Commit

Permalink
Fixed STREAM docs, previous advice was wrong. MDTEST doc finished exc…
Browse files Browse the repository at this point in the history
…ept results.
  • Loading branch information
dmageeLANL committed Oct 6, 2023
1 parent dfd13ac commit 1ec48e6
Show file tree
Hide file tree
Showing 8 changed files with 78 additions and 30 deletions.
14 changes: 9 additions & 5 deletions doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,13 @@ Adjustments to ``GOMP_CPU_AFFINITY`` may be necessary.

The ``STREAM_ARRAY_SIZE`` value is a critical parameter set at compile time and controls the size of the array used to measure bandwidth. STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.

You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet BOTH of the following criteria:
You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet ALL of the following criteria:

1. Each array must be at least 4 times the size of the available cache memory. In practice the minimum array size is about 3.8 times the cache size.
1. Example 1: One Xeon E3 with 8 MB L3 cache ``STREAM_ARRAY_SIZE`` should be ``>= 4 million``, giving an array size of 30.5 MB and a total memory requirement of 91.5 MB.
2. Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) ``STREAM_ARRAY_SIZE`` should be ``>= 20 million``, giving an array size of 153 MB and a total memory requirement of 458 MB.
2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.
2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.
3. The value ``24xSTREAM_ARRAY_SIZExRANKS_PER_NODE`` must be less than the amount of RAM on a node. STREAM creates 3 arrays of doubles; that is where 24 comes from. Each rank has 3 of these arrays.

Set ``STREAM_ARRAY_SIZE`` using the -D flag on your compile line.

Expand All @@ -88,8 +89,11 @@ The formula for ``STREAM_ARRAY_SIZE`` is:

ARRAY_SIZE ~= 4 x (last_level_cache_size x num_sockets) / size_of_double = last_level_cache_size

This reduces to the same number of elements as bytes in the last level cache of a single processor for two socket nodes.
This is the minimum size.
This reduces to a number of elements equal to the size of the last level cache of a single socket in bytes, assuming a node has two sockets.
This is the minimum size unless other system attributes constrain it.

The array size only influences the capacity of STREAM to fully load the memory bus.
At capacity, the measured values should reach a steady state where increasing the value of ``STREAM_ARRAY_SIZE`` doesn't influence the measurement for a certain number of processors.

Running
=======
Expand Down Expand Up @@ -117,7 +121,7 @@ Crossroads
These results were obtained using the cce v15.0.1 compiler and cray-mpich v 8.1.25.
Results using the intel-oneapi and intel-classic v2023.1.0 and the same cray-mpich were also collected; cce performed the best.

``STREAM_ARRAY_SIZE=105 NTIMES=20``
``STREAM_ARRAY_SIZE=40 NTIMES=20``

.. csv-table:: STREAM microbenchmark bandwidth measurement
:file: stream-xrds_ats5cce-cray-mpich.csv
Expand Down
63 changes: 54 additions & 9 deletions doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ MDTEST
Purpose
=======

The intent of this benchmark is to measure the performance of file metadata operations on the platform storage.
MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems.
It can be run on any type of POSIX-compliant file system but has been designed to test the performance of parallel file systems.

Expand All @@ -16,11 +17,19 @@ Characteristics
Problem
-------

MDtest measures the performance of various metadata operations using MPI to coordinate execution and collect the results.
In this case, the operations in question are file creation, stat, and removal.

Run Rules
---------

Figure of Merit
---------------
Observed benchmark performance shall be obtained from a storage system configured as closely as possible to the proposed platform storage.
If the proposed solution includes multiple file access protocols (e.g., pNFS and NFS) or multiple tiers accessible by applications, benchmark results for mdtest shall be provided for each protocol and/or tier.

Performance projections are permissible if they are derived from a similar system that is considered an earlier generation of the proposed system.

Modifications to the benchmark application code are only permissible to enable correct compilation and execution on the target platform.
Any modifications must be fully documented (e.g., as a diff or patch file) and reported with the benchmark results.

Building
========
Expand All @@ -35,17 +44,53 @@ After extracting the tar file, ensure that the MPI is loaded and that the releva
Running
=======

.. .. csv-table:: MDTEST Microbenchmark
.. :file: ats3_mdtest_sow.csv
.. :align: center
.. :widths: 10, 10, 10, 10, 10
.. :header-rows: 1
.. :stub-columns: 2
The results for the three operations, create, stat, remove, should be obtained for three different file configurations:

1) ``2^20`` files in a single directory.
2) ``2^20`` files in separate directories, 1 per MPI process.
3) 1 file accessed by multiple MPI processes.

These configurations are launched as follows.

.. code-block:: bash
# Shared Directory
srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16
# Unique Directories
srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -u
# One File Multi-Proc
srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -S
The following command-line flags MUST be changed:

* ``-n`` - the number of files **each MPI process** should manipulate. For a test run with 64 MPI processes, specifying ``-n 16384`` will produce the equired ``2^20`` files (``2^6`` MPI processes x ``2^14`` files each). This parameter must be changed for each level of concurrency.
* ``-d /scratch`` - the **absolute path** to the directory in which this test should be run.
* ``-N`` - MPI rank offset for each separate phase of the test. This parameter must be equal to the number of MPI processes per node in use (e.g., ``-N 16`` for a test with 16 processes per node) to ensure that each test phase (read, stat, and delete) is performed on a different node.

The following command-line flags MUST NOT be changed or omitted:

* ``-F`` - only operate on files, not directories
* ``-C`` - perform file creation test
* ``-T`` - perform file stat test
* ``-r`` - perform file remove test

Example Results
===============

.. csv-table:: MDTEST Microbenchmark Xrds
These nine tests: three operations, three file conditions should be performed under 4 different launch conditions, for a total of 36 results:

1) A single MPI process
2) The optimal number of MPI processes on a single compute node
3) The minimal number of MPI processes on multiple compute nodes that achieves the peak results for the proposed system.
4) The maximum possible MPI-level concurrency on the proposed system. This could mean:
1) Using one MPI process per CPU core across the entire system.
2) Using the maximum number of MPI processes possible if one MPI process per core will not be possible on the proposed architecture.
3) Using more than ``2^20`` files if the system is capable of launching more than ``2^20`` MPI processes.

Crossroads
----------

.. csv-table:: MDTEST Microbenchmark Crossroads
:file: ats3_mdtest.csv
:align: center
:widths: 10, 10, 10, 10, 10
Expand Down
3 changes: 1 addition & 2 deletions microbenchmarks/mdtest/README.XROADS.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,7 @@ node memory.

The Offeror shall run the following tests:

* creating, statting, and removing at least 1,048,576 files in a single
directory
* creating, statting, and removing at least 1,048,576 files in a single directory.
* creating, statting, and removing at least 1,048,576 files in separate
directories (one directory per MPI process)
* creating, statting, and removing one file by multiple MPI processes
Expand Down
2 changes: 1 addition & 1 deletion sparta
Submodule sparta updated 169 files
2 changes: 1 addition & 1 deletion trilinos
Submodule trilinos updated 1557 files
20 changes: 10 additions & 10 deletions utils/pav_config/tests/stream.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,23 +75,23 @@ _base:
STREAM ARRAY SIZE CALCULATIONS:
###############
STREAM
XRDS DOCUMENTATION: 4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements = 15000000
FORMULA:
4 x ((cache / socket) x (num sockets)) / (num arrays) / 8 (size of double) = 15 Mi elements = 15e6
*****************************************************************************************************
HASWELL: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
CACHE: 40M
SOCKETS: 2
4 * ( 40M * 2 ) / 3 ARRAYS / 8 Bytes/element = 13.4 Mi elements = 13400000
4 * ( 40M * 2 ) / 3 ARRAYS / 8 = 13.4 Mi elements = 13.4e6
*****************************************************************************************************
BROADWELL: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
CACHE: 45M
SOCKETS: 2
4 * ( 45M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 15.0 Mi elements = 15000000
4 * ( 45M * 2 ) / 3 ARRAYS / 8 = 15.0 Mi elements = 15e6
*****************************************************************************************************
SAPPHIRE RAPIDS: Intel(R) Xeon(R) Platinum 8480+
CACHE: 105
CACHE: 105M
SOCKETS: 2
4 x (105M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 35 Mi elements = 35000000
4 x ( 105M * 2 ) / 3 ARRAYS / 8 = 35 Mi elements = 35e6
scheduler: slurm
schedule:
Expand Down Expand Up @@ -295,7 +295,7 @@ spr_ddr5_xrds:
"{{sys_name}}": [ darwin ]
variables:
arch: "spr"
stream_array_size: '105'
stream_array_size: '40'
target: "xrds-stream.exe"
omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
omp_places: [cores, sockets]
Expand Down Expand Up @@ -324,7 +324,7 @@ spr_hbm_xrds:
"{{sys_name}}": [ darwin ]
variables:
arch: "spr"
stream_array_size: '105'
stream_array_size: '40'
target: "xrds-stream.exe"
omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
omp_places: [cores, sockets]
Expand Down Expand Up @@ -400,8 +400,8 @@ xrds_ats5:
tpn: [8, 32, 56, 88, 112]
arch: "spr"
target: "xrds-stream.exe"
stream_array_size: '105'
ntimes: 20
stream_array_size: '40'
#ntimes: 20
#omp_places: [cores, sockets]
#omp_proc_bind: [true]
numnodes: '1'
Expand Down
2 changes: 1 addition & 1 deletion utils/pavilion
Submodule pavilion updated 81 files
+30 −0 .github/workflows/demo.yml
+24 −1 .github/workflows/unittests.yml
+2 −0 .gitignore
+3 −0 .gitmodules
+24 −0 INSTALLING.md
+8 −4 RELEASE.txt
+14 −2 bin/pav
+57 −61 bin/setup_pav_deps
+26 −2 docs/advanced.rst
+1 −1 docs/basics.rst
+2 −19 docs/install.rst
+1 −1 docs/plugins/basics.rst
+3 −3 docs/plugins/sys_vars.rst
+12 −0 examples/README.md
+3 −1 examples/demo/.gitignore
+88 −4 examples/demo/README.md
+1 −0 examples/demo/demo_github_workflow.yml
+45 −0 examples/demo/hosts/demo_host.yaml
+19 −0 examples/demo/modes/sample_10_perc.yaml
+7 −0 examples/demo/os/README.md
+5 −3 examples/demo/pavilion.yaml
+7 −0 examples/demo/plugins/README.md
+20 −0 examples/demo/plugins/sys_name.py
+3 −0 examples/demo/plugins/sys_name.yapsy-plugin
+30 −0 examples/demo/series/all_tests.yaml
+6 −0 examples/demo/test_src/README.md
+13 −0 examples/demo/test_src/built_example/buildit.yaml
+7 −0 examples/demo/test_src/built_example/hello_world.c
+88 −0 examples/demo/tests/advanced.yaml
+1 −0 examples/demo/tests/buildit.yaml
+1 −1 examples/demo/tests/demo.yaml
+1 −0 lib/hostlist.py
+1 −2 lib/pavilion/arguments.py
+3 −3 lib/pavilion/builder.py
+2 −2 lib/pavilion/commands/_run.py
+1 −1 lib/pavilion/commands/build.py
+3 −3 lib/pavilion/commands/config.py
+2 −2 lib/pavilion/commands/graph.py
+2 −2 lib/pavilion/commands/list_cmd.py
+43 −0 lib/pavilion/commands/log.py
+1 −1 lib/pavilion/commands/ls.py
+16 −9 lib/pavilion/commands/result.py
+2 −2 lib/pavilion/commands/run.py
+3 −3 lib/pavilion/commands/show.py
+1 −1 lib/pavilion/commands/view.py
+8 −5 lib/pavilion/config.py
+17 −8 lib/pavilion/errors.py
+2 −2 lib/pavilion/expression_functions/base.py
+16 −0 lib/pavilion/expression_functions/core.py
+2 −2 lib/pavilion/filters.py
+34 −17 lib/pavilion/parsers/expressions.py
+1 −1 lib/pavilion/resolver/proto_test.py
+4 −4 lib/pavilion/resolver/request.py
+3 −2 lib/pavilion/resolver/resolver.py
+0 −1 lib/pavilion/result/evaluations.py
+1 −1 lib/pavilion/result/parse.py
+5 −5 lib/pavilion/result_parsers/base_classes.py
+1 −1 lib/pavilion/result_parsers/filecheck.py
+7 −7 lib/pavilion/result_parsers/json.py
+4 −4 lib/pavilion/schedulers/advanced.py
+34 −72 lib/pavilion/schedulers/config.py
+26 −6 lib/pavilion/schedulers/plugins/flux.py
+9 −84 lib/pavilion/schedulers/plugins/slurm.py
+4 −4 lib/pavilion/schedulers/scheduler.py
+4 −2 lib/pavilion/series/series.py
+2 −2 lib/pavilion/series/test_set.py
+1 −1 lib/pavilion/sys_vars/sys_name.py
+10 −10 lib/pavilion/test_config/file_format.py
+1 −1 lib/pavilion/test_run/test_run.py
+8 −8 lib/pavilion/variables.py
+0 −1 lib/pavilion/wget.py
+1 −0 lib/sub_repos/python-hostlist
+3 −3 lib/yaml_config/structures.py
+3 −0 requirements.txt
+1 −0 test/tests/expression_function_tests.py
+28 −7 test/tests/log_cmd_tests.py
+76 −3 test/tests/result_tests.py
+4 −4 test/tests/sched_tests.py
+5 −4 test/tests/slurm_tests.py
+71 −5 test/tests/style_tests.py
+9 −0 test/utils/check_pav_deps.py

0 comments on commit 1ec48e6

Please sign in to comment.