Fixed STREAM docs, previous advice was wrong. MDTEST doc finished exc…

…ept results.
lanl · Oct 6, 2023 · 1ec48e6 · 1ec48e6
1 parent dfd13ac
commit 1ec48e6
Show file tree

Hide file tree

Showing 8 changed files with 78 additions and 30 deletions.
diff --git a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst
@@ -73,12 +73,13 @@ Adjustments to ``GOMP_CPU_AFFINITY`` may be necessary.
 
 The ``STREAM_ARRAY_SIZE`` value is a critical parameter set at compile time and controls the size of the array used to measure bandwidth. STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.
 
-You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet BOTH of the following criteria:
+You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet ALL of the following criteria:
 
 1. Each array must be at least 4 times the size of the available cache memory. In practice the minimum array size is about 3.8 times the cache size.
    1. Example 1: One Xeon E3 with 8 MB L3 cache ``STREAM_ARRAY_SIZE`` should be ``>= 4 million``, giving an array size of 30.5 MB and a total memory requirement of 91.5 MB.
    2. Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) ``STREAM_ARRAY_SIZE`` should be ``>= 20 million``, giving an array size of 153 MB and a total memory requirement of 458 MB.
-2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity.  20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.
+2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.
+3. The value ``24xSTREAM_ARRAY_SIZExRANKS_PER_NODE`` must be less than the amount of RAM on a node. STREAM creates 3 arrays of doubles; that is where 24 comes from. Each rank has 3 of these arrays.
 
 Set ``STREAM_ARRAY_SIZE`` using the -D flag on your compile line.
 
@@ -88,8 +89,11 @@ The formula for ``STREAM_ARRAY_SIZE`` is:
 
  ARRAY_SIZE ~= 4 x (last_level_cache_size x num_sockets) / size_of_double = last_level_cache_size
 
-This reduces to the same number of elements as bytes in the last level cache of a single processor for two socket nodes.
-This is the minimum size.
+This reduces to a number of elements equal to the size of the last level cache of a single socket in bytes, assuming a node has two sockets.
+This is the minimum size unless other system attributes constrain it.
+
+The array size only influences the capacity of STREAM to fully load the memory bus.
+At capacity, the measured values should reach a steady state where increasing the value of ``STREAM_ARRAY_SIZE`` doesn't influence the measurement for a certain number of processors.
 
 Running
 =======
@@ -117,7 +121,7 @@ Crossroads
 These results were obtained using the cce v15.0.1 compiler and cray-mpich v 8.1.25. 
 Results using the intel-oneapi and intel-classic v2023.1.0 and the same cray-mpich were also collected; cce performed the best.
 
-``STREAM_ARRAY_SIZE=105 NTIMES=20``
+``STREAM_ARRAY_SIZE=40 NTIMES=20``
 
 .. csv-table:: STREAM microbenchmark bandwidth measurement
    :file: stream-xrds_ats5cce-cray-mpich.csv

diff --git a/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst b/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst
@@ -5,6 +5,7 @@ MDTEST
 Purpose
 =======
 
+The intent of this benchmark is to measure the performance of file metadata operations on the platform storage.
 MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems.
 It can be run on any type of POSIX-compliant file system but has been designed to test the performance of parallel file systems.
 
@@ -16,11 +17,19 @@ Characteristics
 Problem
 -------
 
+MDtest measures the performance of various metadata operations using MPI to coordinate execution and collect the results.
+In this case, the operations in question are file creation, stat, and removal.
+
 Run Rules
 ---------
 
-Figure of Merit
----------------
+Observed benchmark performance shall be obtained from a storage system configured as closely as possible to the proposed platform storage. 
+If the proposed solution includes multiple file access protocols (e.g., pNFS and NFS) or multiple tiers accessible by applications, benchmark results for mdtest shall be provided for each protocol and/or tier.
+
+Performance projections are permissible if they are derived from a similar system that is considered an earlier generation of the proposed system.
+
+Modifications to the benchmark application code are only permissible to enable correct compilation and execution on the target platform. 
+Any modifications must be fully documented (e.g., as a diff or patch file) and reported with the benchmark results.
 
 Building
 ========
@@ -35,17 +44,53 @@ After extracting the tar file, ensure that the MPI is loaded and that the releva
 Running
 =======
 
-.. .. csv-table:: MDTEST Microbenchmark
-..    :file: ats3_mdtest_sow.csv
-..    :align: center
-..    :widths: 10, 10, 10, 10, 10
-..    :header-rows: 1
-..    :stub-columns: 2
+The results for the three operations, create, stat, remove, should be obtained for three different file configurations:
+
+1) ``2^20`` files in a single directory.
+2) ``2^20`` files in separate directories, 1 per MPI process.
+3) 1 file accessed by multiple MPI processes.
+
+These configurations are launched as follows.
+
+.. code-block:: bash
+
+    # Shared Directory
+    srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16
+    # Unique Directories
+    srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -u
+    # One File Multi-Proc
+    srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -S
+
+The following command-line flags MUST be changed:
+
+* ``-n`` - the number of files **each MPI process** should manipulate.  For a test run with 64 MPI processes, specifying ``-n 16384`` will produce the equired ``2^20`` files (``2^6`` MPI processes x ``2^14`` files each).  This parameter must be changed for each level of concurrency.
+* ``-d /scratch`` - the **absolute path** to the directory in which this test should be run. 
+* ``-N`` - MPI rank offset for each separate phase of the test.  This parameter must be equal to the number of MPI processes per node in use (e.g., ``-N 16`` for a test with 16 processes per node) to ensure that each test phase (read, stat, and delete) is performed on a different node.
+
+The following command-line flags MUST NOT be changed or omitted:
+
+* ``-F`` - only operate on files, not directories
+* ``-C`` - perform file creation test
+* ``-T`` - perform file stat test
+* ``-r`` - perform file remove test
 
 Example Results
 ===============
 
-.. csv-table:: MDTEST Microbenchmark Xrds
+These nine tests: three operations, three file conditions should be performed under 4 different launch conditions, for a total of 36 results:
+
+1) A single MPI process
+2) The optimal number of MPI processes on a single compute node
+3) The minimal number of MPI processes on multiple compute nodes that achieves the peak results for the proposed system.
+4) The maximum possible MPI-level concurrency on the proposed system. This could mean:
+   1) Using one MPI process per CPU core across the entire system.
+   2) Using the maximum number of MPI processes possible if one MPI process per core will not be possible on the proposed architecture.
+   3) Using more than ``2^20`` files if the system is capable of launching more than ``2^20`` MPI processes.
+
+Crossroads
+----------
+
+.. csv-table:: MDTEST Microbenchmark Crossroads
    :file: ats3_mdtest.csv
    :align: center
    :widths: 10, 10, 10, 10, 10

diff --git a/microbenchmarks/mdtest/README.XROADS.md b/microbenchmarks/mdtest/README.XROADS.md
@@ -57,8 +57,7 @@ node memory.
 
 The Offeror shall run the following tests:
 
-* creating, statting, and removing at least 1,048,576 files in a single
-  directory
+* creating, statting, and removing at least 1,048,576 files in a single directory.
 * creating, statting, and removing at least 1,048,576 files in separate
   directories (one directory per MPI process)
 * creating, statting, and removing one file by multiple MPI processes

diff --git a/parthenon b/parthenon
diff --git a/sparta b/sparta
diff --git a/trilinos b/trilinos
diff --git a/utils/pav_config/tests/stream.yaml b/utils/pav_config/tests/stream.yaml
@@ -75,23 +75,23 @@ _base:
 
             STREAM ARRAY SIZE CALCULATIONS:
             ###############
-            STREAM
-            XRDS DOCUMENTATION: 4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements = 15000000
+            FORMULA: 
+            4 x ((cache / socket) x (num sockets)) / (num arrays) / 8 (size of double) = 15 Mi elements = 15e6
             *****************************************************************************************************
             HASWELL: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
             CACHE: 40M
             SOCKETS: 2
-            4 * ( 40M * 2 ) / 3 ARRAYS / 8 Bytes/element =  13.4 Mi elements = 13400000 
+            4 * ( 40M * 2 ) / 3 ARRAYS / 8 =  13.4 Mi elements = 13.4e6
             *****************************************************************************************************
             BROADWELL: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
             CACHE: 45M
             SOCKETS: 2
-            4 * ( 45M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 15.0 Mi elements = 15000000
+            4 * ( 45M * 2 ) / 3 ARRAYS / 8 = 15.0 Mi elements = 15e6
             *****************************************************************************************************
             SAPPHIRE RAPIDS: Intel(R) Xeon(R) Platinum 8480+
-            CACHE: 105
+            CACHE: 105M
             SOCKETS: 2
-            4 x (105M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 35 Mi elements = 35000000
+            4 x ( 105M * 2 ) / 3 ARRAYS / 8 = 35 Mi elements = 35e6
 
     scheduler: slurm
     schedule:
@@ -295,7 +295,7 @@ spr_ddr5_xrds:
         "{{sys_name}}": [ darwin ]
     variables:
         arch: "spr"
-        stream_array_size: '105'
+        stream_array_size: '40'
         target: "xrds-stream.exe"
         omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
         omp_places: [cores, sockets]
@@ -324,7 +324,7 @@ spr_hbm_xrds:
         "{{sys_name}}": [ darwin ]
     variables:
         arch: "spr"
-        stream_array_size: '105'
+        stream_array_size: '40'
         target: "xrds-stream.exe"
         omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
         omp_places: [cores, sockets]
@@ -400,8 +400,8 @@ xrds_ats5:
         tpn: [8, 32, 56, 88, 112]
         arch: "spr"
         target: "xrds-stream.exe"
-        stream_array_size: '105'
-        ntimes: 20
+        stream_array_size: '40'
+        #ntimes: 20
         #omp_places: [cores, sockets]
         #omp_proc_bind: [true]
         numnodes: '1'

diff --git a/utils/pavilion b/utils/pavilion
+6 −2		CHANGELOG.md
+3 −2		benchmarks/burgers/main.cpp
+67 −0		doc/sphinx/src/parthenon_manager.rst
+4 −2		example/advection/main.cpp
+3 −2		example/calculate_pi/pi_driver.cpp
+3 −2		example/particle_leapfrog/main.cpp
+3 −2		example/particle_tracers/main.cpp
+3 −2		example/poisson/main.cpp
+3 −2		example/sparse_advection/main.cpp
+3 −2		example/stochastic_subgrid/main.cpp
+6 −2		src/bvals/bvals.hpp
+10 −7		src/bvals/bvals_base.cpp
+3 −1		src/driver/driver.cpp
+21 −34		src/interface/sparse_pack_base.cpp
+20 −6		src/interface/sparse_pack_base.hpp
+7 −4		src/interface/variable.cpp
+35 −9		src/mesh/amr_loadbalance.cpp
+34 −13		src/mesh/logical_location.hpp
+2 −11		src/mesh/mesh.cpp
+14 −5		src/parameter_input.cpp
+11 −10		src/parthenon_manager.cpp
+3 −2		src/parthenon_manager.hpp
+16 −0		tst/unit/test_sparse_pack.cpp
+30 −0		.github/workflows/demo.yml
+24 −1		.github/workflows/unittests.yml
+2 −0		.gitignore
+3 −0		.gitmodules
+24 −0		INSTALLING.md
+8 −4		RELEASE.txt
+14 −2		bin/pav
+57 −61		bin/setup_pav_deps
+26 −2		docs/advanced.rst
+1 −1		docs/basics.rst
+2 −19		docs/install.rst
+1 −1		docs/plugins/basics.rst
+3 −3		docs/plugins/sys_vars.rst
+12 −0		examples/README.md
+3 −1		examples/demo/.gitignore
+88 −4		examples/demo/README.md
+1 −0		examples/demo/demo_github_workflow.yml
+45 −0		examples/demo/hosts/demo_host.yaml
+19 −0		examples/demo/modes/sample_10_perc.yaml
+7 −0		examples/demo/os/README.md
+5 −3		examples/demo/pavilion.yaml
+7 −0		examples/demo/plugins/README.md
+20 −0		examples/demo/plugins/sys_name.py
+3 −0		examples/demo/plugins/sys_name.yapsy-plugin
+30 −0		examples/demo/series/all_tests.yaml
+6 −0		examples/demo/test_src/README.md
+13 −0		examples/demo/test_src/built_example/buildit.yaml
+7 −0		examples/demo/test_src/built_example/hello_world.c
+88 −0		examples/demo/tests/advanced.yaml
+1 −0		examples/demo/tests/buildit.yaml
+1 −1		examples/demo/tests/demo.yaml
+1 −0		lib/hostlist.py
+1 −2		lib/pavilion/arguments.py
+3 −3		lib/pavilion/builder.py
+2 −2		lib/pavilion/commands/_run.py
+1 −1		lib/pavilion/commands/build.py
+3 −3		lib/pavilion/commands/config.py
+2 −2		lib/pavilion/commands/graph.py
+2 −2		lib/pavilion/commands/list_cmd.py
+43 −0		lib/pavilion/commands/log.py
+1 −1		lib/pavilion/commands/ls.py
+16 −9		lib/pavilion/commands/result.py
+2 −2		lib/pavilion/commands/run.py
+3 −3		lib/pavilion/commands/show.py
+1 −1		lib/pavilion/commands/view.py
+8 −5		lib/pavilion/config.py
+17 −8		lib/pavilion/errors.py
+2 −2		lib/pavilion/expression_functions/base.py
+16 −0		lib/pavilion/expression_functions/core.py
+2 −2		lib/pavilion/filters.py
+34 −17		lib/pavilion/parsers/expressions.py
+1 −1		lib/pavilion/resolver/proto_test.py
+4 −4		lib/pavilion/resolver/request.py
+3 −2		lib/pavilion/resolver/resolver.py
+0 −1		lib/pavilion/result/evaluations.py
+1 −1		lib/pavilion/result/parse.py
+5 −5		lib/pavilion/result_parsers/base_classes.py
+1 −1		lib/pavilion/result_parsers/filecheck.py
+7 −7		lib/pavilion/result_parsers/json.py
+4 −4		lib/pavilion/schedulers/advanced.py
+34 −72		lib/pavilion/schedulers/config.py
+26 −6		lib/pavilion/schedulers/plugins/flux.py
+9 −84		lib/pavilion/schedulers/plugins/slurm.py
+4 −4		lib/pavilion/schedulers/scheduler.py
+4 −2		lib/pavilion/series/series.py
+2 −2		lib/pavilion/series/test_set.py
+1 −1		lib/pavilion/sys_vars/sys_name.py
+10 −10		lib/pavilion/test_config/file_format.py
+1 −1		lib/pavilion/test_run/test_run.py
+8 −8		lib/pavilion/variables.py
+0 −1		lib/pavilion/wget.py
+1 −0		lib/sub_repos/python-hostlist
+3 −3		lib/yaml_config/structures.py
+3 −0		requirements.txt
+1 −0		test/tests/expression_function_tests.py
+28 −7		test/tests/log_cmd_tests.py
+76 −3		test/tests/result_tests.py
+4 −4		test/tests/sched_tests.py
+5 −4		test/tests/slurm_tests.py
+71 −5		test/tests/style_tests.py
+9 −0		test/utils/check_pav_deps.py