diff --git a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst index 88199844..136bb2bc 100644 --- a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst +++ b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst @@ -73,12 +73,13 @@ Adjustments to ``GOMP_CPU_AFFINITY`` may be necessary. The ``STREAM_ARRAY_SIZE`` value is a critical parameter set at compile time and controls the size of the array used to measure bandwidth. STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer. -You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet BOTH of the following criteria: +You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet ALL of the following criteria: 1. Each array must be at least 4 times the size of the available cache memory. In practice the minimum array size is about 3.8 times the cache size. 1. Example 1: One Xeon E3 with 8 MB L3 cache ``STREAM_ARRAY_SIZE`` should be ``>= 4 million``, giving an array size of 30.5 MB and a total memory requirement of 91.5 MB. 2. Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) ``STREAM_ARRAY_SIZE`` should be ``>= 20 million``, giving an array size of 153 MB and a total memory requirement of 458 MB. -2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements. +2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements. +3. The value ``24xSTREAM_ARRAY_SIZExRANKS_PER_NODE`` must be less than the amount of RAM on a node. STREAM creates 3 arrays of doubles; that is where 24 comes from. Each rank has 3 of these arrays. Set ``STREAM_ARRAY_SIZE`` using the -D flag on your compile line. @@ -88,8 +89,11 @@ The formula for ``STREAM_ARRAY_SIZE`` is: ARRAY_SIZE ~= 4 x (last_level_cache_size x num_sockets) / size_of_double = last_level_cache_size -This reduces to the same number of elements as bytes in the last level cache of a single processor for two socket nodes. -This is the minimum size. +This reduces to a number of elements equal to the size of the last level cache of a single socket in bytes, assuming a node has two sockets. +This is the minimum size unless other system attributes constrain it. + +The array size only influences the capacity of STREAM to fully load the memory bus. +At capacity, the measured values should reach a steady state where increasing the value of ``STREAM_ARRAY_SIZE`` doesn't influence the measurement for a certain number of processors. Running ======= @@ -117,7 +121,7 @@ Crossroads These results were obtained using the cce v15.0.1 compiler and cray-mpich v 8.1.25. Results using the intel-oneapi and intel-classic v2023.1.0 and the same cray-mpich were also collected; cce performed the best. -``STREAM_ARRAY_SIZE=105 NTIMES=20`` +``STREAM_ARRAY_SIZE=40 NTIMES=20`` .. csv-table:: STREAM microbenchmark bandwidth measurement :file: stream-xrds_ats5cce-cray-mpich.csv diff --git a/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst b/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst index 74029613..3bd11eae 100644 --- a/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst +++ b/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst @@ -5,6 +5,7 @@ MDTEST Purpose ======= +The intent of this benchmark is to measure the performance of file metadata operations on the platform storage. MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. It can be run on any type of POSIX-compliant file system but has been designed to test the performance of parallel file systems. @@ -16,11 +17,19 @@ Characteristics Problem ------- +MDtest measures the performance of various metadata operations using MPI to coordinate execution and collect the results. +In this case, the operations in question are file creation, stat, and removal. + Run Rules --------- -Figure of Merit ---------------- +Observed benchmark performance shall be obtained from a storage system configured as closely as possible to the proposed platform storage. +If the proposed solution includes multiple file access protocols (e.g., pNFS and NFS) or multiple tiers accessible by applications, benchmark results for mdtest shall be provided for each protocol and/or tier. + +Performance projections are permissible if they are derived from a similar system that is considered an earlier generation of the proposed system. + +Modifications to the benchmark application code are only permissible to enable correct compilation and execution on the target platform. +Any modifications must be fully documented (e.g., as a diff or patch file) and reported with the benchmark results. Building ======== @@ -35,17 +44,53 @@ After extracting the tar file, ensure that the MPI is loaded and that the releva Running ======= -.. .. csv-table:: MDTEST Microbenchmark -.. :file: ats3_mdtest_sow.csv -.. :align: center -.. :widths: 10, 10, 10, 10, 10 -.. :header-rows: 1 -.. :stub-columns: 2 +The results for the three operations, create, stat, remove, should be obtained for three different file configurations: + +1) ``2^20`` files in a single directory. +2) ``2^20`` files in separate directories, 1 per MPI process. +3) 1 file accessed by multiple MPI processes. + +These configurations are launched as follows. + +.. code-block:: bash + + # Shared Directory + srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 + # Unique Directories + srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -u + # One File Multi-Proc + srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -S + +The following command-line flags MUST be changed: + +* ``-n`` - the number of files **each MPI process** should manipulate. For a test run with 64 MPI processes, specifying ``-n 16384`` will produce the equired ``2^20`` files (``2^6`` MPI processes x ``2^14`` files each). This parameter must be changed for each level of concurrency. +* ``-d /scratch`` - the **absolute path** to the directory in which this test should be run. +* ``-N`` - MPI rank offset for each separate phase of the test. This parameter must be equal to the number of MPI processes per node in use (e.g., ``-N 16`` for a test with 16 processes per node) to ensure that each test phase (read, stat, and delete) is performed on a different node. + +The following command-line flags MUST NOT be changed or omitted: + +* ``-F`` - only operate on files, not directories +* ``-C`` - perform file creation test +* ``-T`` - perform file stat test +* ``-r`` - perform file remove test Example Results =============== -.. csv-table:: MDTEST Microbenchmark Xrds +These nine tests: three operations, three file conditions should be performed under 4 different launch conditions, for a total of 36 results: + +1) A single MPI process +2) The optimal number of MPI processes on a single compute node +3) The minimal number of MPI processes on multiple compute nodes that achieves the peak results for the proposed system. +4) The maximum possible MPI-level concurrency on the proposed system. This could mean: + 1) Using one MPI process per CPU core across the entire system. + 2) Using the maximum number of MPI processes possible if one MPI process per core will not be possible on the proposed architecture. + 3) Using more than ``2^20`` files if the system is capable of launching more than ``2^20`` MPI processes. + +Crossroads +---------- + +.. csv-table:: MDTEST Microbenchmark Crossroads :file: ats3_mdtest.csv :align: center :widths: 10, 10, 10, 10, 10 diff --git a/microbenchmarks/mdtest/README.XROADS.md b/microbenchmarks/mdtest/README.XROADS.md index 2783b547..90f065a6 100644 --- a/microbenchmarks/mdtest/README.XROADS.md +++ b/microbenchmarks/mdtest/README.XROADS.md @@ -57,8 +57,7 @@ node memory. The Offeror shall run the following tests: -* creating, statting, and removing at least 1,048,576 files in a single - directory +* creating, statting, and removing at least 1,048,576 files in a single directory. * creating, statting, and removing at least 1,048,576 files in separate directories (one directory per MPI process) * creating, statting, and removing one file by multiple MPI processes diff --git a/parthenon b/parthenon index 11c53d1c..c75ce20f 160000 --- a/parthenon +++ b/parthenon @@ -1 +1 @@ -Subproject commit 11c53d1cd4ada0629e06d069b70b410234ed0bde +Subproject commit c75ce20f938a4adaedb4425584954c3e74d56868 diff --git a/sparta b/sparta index 83d5f3a9..ca0ce28f 160000 --- a/sparta +++ b/sparta @@ -1 +1 @@ -Subproject commit 83d5f3a92c5fc0b59d4d973c6b1dddc4d77a7147 +Subproject commit ca0ce28fd76080d8b2828db77adde14fdc382c76 diff --git a/trilinos b/trilinos index f3ff0b54..5aaae1ad 160000 --- a/trilinos +++ b/trilinos @@ -1 +1 @@ -Subproject commit f3ff0b54c5158790295daff089ff0d286bda3c2c +Subproject commit 5aaae1ada6fe1ce777e671a0ff84fdc4f0779406 diff --git a/utils/pav_config/tests/stream.yaml b/utils/pav_config/tests/stream.yaml index a6d813e9..ba2c09d0 100644 --- a/utils/pav_config/tests/stream.yaml +++ b/utils/pav_config/tests/stream.yaml @@ -75,23 +75,23 @@ _base: STREAM ARRAY SIZE CALCULATIONS: ############### - STREAM - XRDS DOCUMENTATION: 4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements = 15000000 + FORMULA: + 4 x ((cache / socket) x (num sockets)) / (num arrays) / 8 (size of double) = 15 Mi elements = 15e6 ***************************************************************************************************** HASWELL: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz CACHE: 40M SOCKETS: 2 - 4 * ( 40M * 2 ) / 3 ARRAYS / 8 Bytes/element = 13.4 Mi elements = 13400000 + 4 * ( 40M * 2 ) / 3 ARRAYS / 8 = 13.4 Mi elements = 13.4e6 ***************************************************************************************************** BROADWELL: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz CACHE: 45M SOCKETS: 2 - 4 * ( 45M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 15.0 Mi elements = 15000000 + 4 * ( 45M * 2 ) / 3 ARRAYS / 8 = 15.0 Mi elements = 15e6 ***************************************************************************************************** SAPPHIRE RAPIDS: Intel(R) Xeon(R) Platinum 8480+ - CACHE: 105 + CACHE: 105M SOCKETS: 2 - 4 x (105M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 35 Mi elements = 35000000 + 4 x ( 105M * 2 ) / 3 ARRAYS / 8 = 35 Mi elements = 35e6 scheduler: slurm schedule: @@ -295,7 +295,7 @@ spr_ddr5_xrds: "{{sys_name}}": [ darwin ] variables: arch: "spr" - stream_array_size: '105' + stream_array_size: '40' target: "xrds-stream.exe" omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] omp_places: [cores, sockets] @@ -324,7 +324,7 @@ spr_hbm_xrds: "{{sys_name}}": [ darwin ] variables: arch: "spr" - stream_array_size: '105' + stream_array_size: '40' target: "xrds-stream.exe" omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] omp_places: [cores, sockets] @@ -400,8 +400,8 @@ xrds_ats5: tpn: [8, 32, 56, 88, 112] arch: "spr" target: "xrds-stream.exe" - stream_array_size: '105' - ntimes: 20 + stream_array_size: '40' + #ntimes: 20 #omp_places: [cores, sockets] #omp_proc_bind: [true] numnodes: '1' diff --git a/utils/pavilion b/utils/pavilion index 69b2d45d..f502ca86 160000 --- a/utils/pavilion +++ b/utils/pavilion @@ -1 +1 @@ -Subproject commit 69b2d45d696e623127c106b50525ba65daa23d76 +Subproject commit f502ca86fa27f4bc894aa19232c9f1f42361e269