diff --git a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst index 88199844..136bb2bc 100644 --- a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst +++ b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst @@ -73,12 +73,13 @@ Adjustments to ``GOMP_CPU_AFFINITY`` may be necessary. The ``STREAM_ARRAY_SIZE`` value is a critical parameter set at compile time and controls the size of the array used to measure bandwidth. STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer. -You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet BOTH of the following criteria: +You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet ALL of the following criteria: 1. Each array must be at least 4 times the size of the available cache memory. In practice the minimum array size is about 3.8 times the cache size. 1. Example 1: One Xeon E3 with 8 MB L3 cache ``STREAM_ARRAY_SIZE`` should be ``>= 4 million``, giving an array size of 30.5 MB and a total memory requirement of 91.5 MB. 2. Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) ``STREAM_ARRAY_SIZE`` should be ``>= 20 million``, giving an array size of 153 MB and a total memory requirement of 458 MB. -2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements. +2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements. +3. The value ``24xSTREAM_ARRAY_SIZExRANKS_PER_NODE`` must be less than the amount of RAM on a node. STREAM creates 3 arrays of doubles; that is where 24 comes from. Each rank has 3 of these arrays. Set ``STREAM_ARRAY_SIZE`` using the -D flag on your compile line. @@ -88,8 +89,11 @@ The formula for ``STREAM_ARRAY_SIZE`` is: ARRAY_SIZE ~= 4 x (last_level_cache_size x num_sockets) / size_of_double = last_level_cache_size -This reduces to the same number of elements as bytes in the last level cache of a single processor for two socket nodes. -This is the minimum size. +This reduces to a number of elements equal to the size of the last level cache of a single socket in bytes, assuming a node has two sockets. +This is the minimum size unless other system attributes constrain it. + +The array size only influences the capacity of STREAM to fully load the memory bus. +At capacity, the measured values should reach a steady state where increasing the value of ``STREAM_ARRAY_SIZE`` doesn't influence the measurement for a certain number of processors. Running ======= @@ -117,7 +121,7 @@ Crossroads These results were obtained using the cce v15.0.1 compiler and cray-mpich v 8.1.25. Results using the intel-oneapi and intel-classic v2023.1.0 and the same cray-mpich were also collected; cce performed the best. -``STREAM_ARRAY_SIZE=105 NTIMES=20`` +``STREAM_ARRAY_SIZE=40 NTIMES=20`` .. csv-table:: STREAM microbenchmark bandwidth measurement :file: stream-xrds_ats5cce-cray-mpich.csv diff --git a/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst b/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst index 74029613..3bd11eae 100644 --- a/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst +++ b/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst @@ -5,6 +5,7 @@ MDTEST Purpose ======= +The intent of this benchmark is to measure the performance of file metadata operations on the platform storage. MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. It can be run on any type of POSIX-compliant file system but has been designed to test the performance of parallel file systems. @@ -16,11 +17,19 @@ Characteristics Problem ------- +MDtest measures the performance of various metadata operations using MPI to coordinate execution and collect the results. +In this case, the operations in question are file creation, stat, and removal. + Run Rules --------- -Figure of Merit ---------------- +Observed benchmark performance shall be obtained from a storage system configured as closely as possible to the proposed platform storage. +If the proposed solution includes multiple file access protocols (e.g., pNFS and NFS) or multiple tiers accessible by applications, benchmark results for mdtest shall be provided for each protocol and/or tier. + +Performance projections are permissible if they are derived from a similar system that is considered an earlier generation of the proposed system. + +Modifications to the benchmark application code are only permissible to enable correct compilation and execution on the target platform. +Any modifications must be fully documented (e.g., as a diff or patch file) and reported with the benchmark results. Building ======== @@ -35,17 +44,53 @@ After extracting the tar file, ensure that the MPI is loaded and that the releva Running ======= -.. .. csv-table:: MDTEST Microbenchmark -.. :file: ats3_mdtest_sow.csv -.. :align: center -.. :widths: 10, 10, 10, 10, 10 -.. :header-rows: 1 -.. :stub-columns: 2 +The results for the three operations, create, stat, remove, should be obtained for three different file configurations: + +1) ``2^20`` files in a single directory. +2) ``2^20`` files in separate directories, 1 per MPI process. +3) 1 file accessed by multiple MPI processes. + +These configurations are launched as follows. + +.. code-block:: bash + + # Shared Directory + srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 + # Unique Directories + srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -u + # One File Multi-Proc + srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -S + +The following command-line flags MUST be changed: + +* ``-n`` - the number of files **each MPI process** should manipulate. For a test run with 64 MPI processes, specifying ``-n 16384`` will produce the equired ``2^20`` files (``2^6`` MPI processes x ``2^14`` files each). This parameter must be changed for each level of concurrency. +* ``-d /scratch`` - the **absolute path** to the directory in which this test should be run. +* ``-N`` - MPI rank offset for each separate phase of the test. This parameter must be equal to the number of MPI processes per node in use (e.g., ``-N 16`` for a test with 16 processes per node) to ensure that each test phase (read, stat, and delete) is performed on a different node. + +The following command-line flags MUST NOT be changed or omitted: + +* ``-F`` - only operate on files, not directories +* ``-C`` - perform file creation test +* ``-T`` - perform file stat test +* ``-r`` - perform file remove test Example Results =============== -.. csv-table:: MDTEST Microbenchmark Xrds +These nine tests: three operations, three file conditions should be performed under 4 different launch conditions, for a total of 36 results: + +1) A single MPI process +2) The optimal number of MPI processes on a single compute node +3) The minimal number of MPI processes on multiple compute nodes that achieves the peak results for the proposed system. +4) The maximum possible MPI-level concurrency on the proposed system. This could mean: + 1) Using one MPI process per CPU core across the entire system. + 2) Using the maximum number of MPI processes possible if one MPI process per core will not be possible on the proposed architecture. + 3) Using more than ``2^20`` files if the system is capable of launching more than ``2^20`` MPI processes. + +Crossroads +---------- + +.. csv-table:: MDTEST Microbenchmark Crossroads :file: ats3_mdtest.csv :align: center :widths: 10, 10, 10, 10, 10 diff --git a/microbenchmarks/mdtest/README.XROADS.md b/microbenchmarks/mdtest/README.XROADS.md index 2783b547..90f065a6 100644 --- a/microbenchmarks/mdtest/README.XROADS.md +++ b/microbenchmarks/mdtest/README.XROADS.md @@ -57,8 +57,7 @@ node memory. The Offeror shall run the following tests: -* creating, statting, and removing at least 1,048,576 files in a single - directory +* creating, statting, and removing at least 1,048,576 files in a single directory. * creating, statting, and removing at least 1,048,576 files in separate directories (one directory per MPI process) * creating, statting, and removing one file by multiple MPI processes diff --git a/microbenchmarks/stream/Makefile b/microbenchmarks/stream/Makefile new file mode 100644 index 00000000..e98efebf --- /dev/null +++ b/microbenchmarks/stream/Makefile @@ -0,0 +1,30 @@ +# +MPICC ?= $(PAV_MPICC) +CC ?= $(PAV_CC) +CFLAGS ?= $(PAV_CFLAGS) + +FF ?= $(PAV_FC) +FFLAGS ?= $(PAV_FFLAGS) + +ifeq ($(MPICC),) + MPICC=$(CC) +endif + +all: xrds-stream.exe stream_f.exe stream_c.exe stream_mpi.exe + +stream_f.exe: stream.f mysecond.o + $(CC) $(CFLAGS) -c mysecond.c + $(FF) $(FFLAGS) -c stream.f + $(FF) $(FFLAGS) stream.o mysecond.o -o stream_f.exe + +clean: + rm -f *stream*.exe *.o + +stream_c.exe: stream.c + $(CC) $(CFLAGS) stream.c -o stream_c.exe + +stream_mpi.exe: stream_mpi.c + $(MPICC) $(CFLAGS) stream_mpi.c -o stream_mpi.exe + +xrds-stream.exe: xrds-stream.c + $(CC) $(CFLAGS) xrds-stream.c -o xrds-stream.exe diff --git a/microbenchmarks/stream/README.ACES b/microbenchmarks/stream/README.ACES new file mode 100644 index 00000000..01cb258e --- /dev/null +++ b/microbenchmarks/stream/README.ACES @@ -0,0 +1,74 @@ +======================================================================================== +Crossroads Memory Bandwidth Benchmark +======================================================================================== +Benchmark Version: 1.0.0 + +======================================================================================== +Benchmark Description: +======================================================================================== +The Crossroads Memory Bandwidth benchmark is a modified version of the STREAM +benchmark originally written by John D. McCalpin. The modifications have been made to +simplify the code. All memory bandwidth projections/results provided in the vendor +response should be measured using the kernels in this benchmark implementation and +not the original STREAM code. + +======================================================================================== +Permitted Modifications: +======================================================================================== +Offerers are permitted to modify the benchmark in the following ways: + +OpenMP Pragmas - the Offeror may modify the OpenMP pragmas in the benchmark as required +to permit execution on the proposed system provided: (1) all modified sources and build +scripts are included in the RFP response; (2) any modified code used for the response +must continue to be a valid OpenMP program (compliant to the standard being proposed in +the Offeror's response). + +Memory Allocation Routines - memory allocation routines including modified allocations +to specify the level of the memory hierarchy or placement of the data are permitted +provided: (1) all modified sources and build scripts are included in the RFP response; +(2) the use of any specific libraries to provide allocation services must be provided +in the proposed system. + +Array/Allocation Sizes - the sizes of the allocated arrays may be modified to exercise +the appropriate size and level of the memory hierarchy provided the benchmark correctly +exercises the memory system being targeted. + +Index Type - the range of the index type is configured for a 32-bit signed integer ("int") +via the preprocessor define STREAM_INDEX_TYPE. If very large memories are benchmarked +the Offeror is permitted to change to a larger integer type. The Offeror should indicate +in their response that this modification has been made. + +Accumulation Error Checking Type - the basic accumulation error type is configured for a +64-bit double precision value via the STREAM_CHECK_TYPE preprocessor define. This may be +modified to a higher precision type ("long double") if a large memory (requiring a 64-bit +integer for STREAM_INDEX_TYPE) is used. The Offeror should indicate in their response +that this modification has been made. + +======================================================================================== +Run Rules: +======================================================================================== +The Offeror may utilize any number of threads, affinity and memory binding options for +execution of the benchmark provided: (1) details of all command line parameters, +environment variables and binding tools are included in the response. + +The vendor is expected to provide memory bandwidth projections/benchmarked results using +the Crossroads memory bandwidth benchmark to each level of the memory hierarchy +accessible from each compute resource. + +======================================================================================== +How to Compile, Run and Verify: +======================================================================================== +To build simply type modify the file Makefile for your compiler and type make. To run, +execute the file xroads-stream. xroads-stream performs self verification. + +$ make + +$ export OMP_NUM_THREADS=12 +$ ./xroads-stream + + +======================================================================================== +How to report +======================================================================================== +The primary FOM is the Triad rate (MB/s). Report all data printed to stdout. + diff --git a/microbenchmarks/stream/README.md b/microbenchmarks/stream/README.md new file mode 100644 index 00000000..731676c9 --- /dev/null +++ b/microbenchmarks/stream/README.md @@ -0,0 +1,561 @@ +# Crossroads Acceptance + +- PI: Douglas M. Pase [dmpase@sandia.gov](mailto:dmpase@sandia.gov) +- Erik A. Illescas [eailles@sandia.gov](mailto:eailles@sandia.gov) +- Anthony M. Agelastos [amagela@sandia.gov](mailto:amagela@sandia.gov) + +This project tracks data and analysis from the Crossroads Acceptance effort. + + +## STREAM +STREAM, like DGEMM, is also a microbenchmark that measures a single fundamental +aspect of a system. +But while DGEMM measures floating-point vector performance, STREAM +measures memory bandwidth. +More specifically, it measures the performance of two 64-bit loads and a +store operation to a third location. +The operation looks like this: +```C +for (int i=0; i < size; i++) { + x[i] = y[i] + constant * z[i]; +} +``` + +Two 64-bit words are loaded into cache and registers, combined arithmetically, +and the result is stored in a third location. +In cache-based microprocessors, this typically means that all three +locations, ```x[i]```, ```y[i]```, and ```z[i]```, must be read into +cache. +Values ```x[i]``` and ```y[i]``` must be read because their values are needed +for the computation, but ```z[i]``` must also be loaded into cache in order to +maintain cache coherency. +With no other hardware support, this would mean the best throughput one could +hope for is about 75% of peak, because the processor must execute three loads +(one for each of ```x[i]```, ```y[i]```, and ```z[i]```) and one store (```z[i]```), +but it only gets credit for two loads (```x[i]``` and ```y[i]```) and one store (```z[i]```). +And, (2+1)/(3+1)=75%. + +But most most current architectures support an optimization feature, called a +Microarchitectural Store Buffer (MSB), that speeds up the write operation in +some cases. +When a processor executes a cacheable write smaller than a cache line, the cache +controller first checks whether the cache line is already resident in cache. +When it is not, it caches the partial line in the MSB and sends the request on +to the memory controller to fetch the cache line. +If the rest of the cache line is overwritten before the data comes back from +the memory controller, the read request is cancelled and the new data is +moved from the MSB to the cache as a dirty (modified) cache line. +In this way the cache controller avoids the inefficiency of retrieving the +extra line from memory. + +Another architectural feature that affects STREAM performance is +Non-Uniform Memory Access (NUMA). +This is best illustrated with a picture. +The following block diagram shows the processor, core, cache, and memory +structure of a typical Intel processor-based compute node. + +![DGEMM Results](img/broadwell-cache-structure.png) +***Processor, Core, Cache, and Memory Structure for Sandia Compute Nodes*** + +The illustration shows two processors, each with their own local memory (DIMMs). +High-speed communication links (QPI links) connect the two processors. +The two processors share a physical memory address space, so both processors +can access not only their own local memory, but also memory that is attached +to the other processor. +A program running on one of the cores in a processor requests to load or store +an address. +In the event the address misses cache, the request is forwarded to the System +Agent, which then sorts out whether the address is local or remote. +The System Agent then forwards the request to the appropriate memory controller +to be handled. +The important concept to understand is that access to memory that is local is +faster than access to memory that is remote. + +The next concept to be concerned with is how memory is mapped to a program. +Programs do not reference physical memory directly. +Instead each program uses virtual addresses to reference data it uses, and +relies upon the hardware and operating system to translate those virtual +addresses into physical, or hardware, addresses. +Most modern NUMA-aware operating systems, Linux included, use a "first touch" +policy for deciding where memory is allocated. +When a program allocates memory it is only given new virtual addresses to use. +The physical memory backing the virtual address is allocated the first time +the memory is read or written. +Whenever possible, Linux allocates physical memory that is local to the +processor that is running the program at the moment the memory is first +referenced. +Linux then works very hard to keep the process running on the same core so +caches remain loaded and memory stays local. +For most programs this policy works very well, but for some threading +models, such as OpenMP, obtaining good performance in a NUMA environment +requires some care. + +Because we are running an OpenMP-enabled version of STREAM, we must be +sure to schedule the parallel loops statically so the same iterations +go to the same threads (and therefore the same NUMA domains) every time. +This will ensure that memory references are always local, and therefore +of highest performance. + +Note that the raw bandwidth of the node is determined by the number, width, +and speed of the memory channels. +In the above illustration there are four channels per processor, but the +number will vary from one architecture to the next. +Each memory channel is the width of the data path of a DIMM, which is always +64-bits, or 8 bytes. +The frequency will vary with the DIMM, and it's measured in mega-transfers +per second, or MT/s. +(DIMMs use a differential clock, so the clock frequency of DDR memory is +always half of the transfer rate.) +Most systems at Sandia use memory speeds ranging from 2133 MT/s to 4800 MT/s. +A Haswell node, with four channels per processor, two processors per node, +and 2133 MT/s memory will have a hardware (or peak) bandwidth of: + +``` +4 channels/processor x 2 processors x 8 bytes/channel x 2133 MT/s = 136,512 MB/s +``` + +The peak bandwidth represents a least upper bound, a limit you are guaranteed +never to exceed. +The performance you measure is the throughput, and it will always be a value +below the peak bandwidth. +The ratio of the two values, throughput over bandwidth, is known as the +memory efficiency, and is a measure of how efficiently the memory transfers +data to and from the processor. +Modern processors typically have a memory efficiency of around 75% to 85%. +New memory architectures may have somewhat less, and rarely, an exceptionally +efficient system may have more than 90%. + +Yet another consideration is how the instructions are generated by the compiler. +In this case there are two issues, using vector units and how addressing is +handled. +Vector instructions reduce the number of instructions that need to be executed, +which reduces memory traffic. +AVX instructions use wider vectors than SSE4.3 vectors, so AVX instructions +provide better performance than using SSE4.3 instructions. +Similarly, AVX2 is faster than AVX, and AVX-512 is faster than AVX2. + +Furthermore, how array indexes are generated and used can also have an impact +on STREAM performance. +For example, using statically allocated arrays allows the code generation to +use a constant array base address for references to ```x```, ```y```, and ```z```. +But using dynamically allocated arrays forces the program to load the base +addresses into registers and index the arrays indirectly. +On the Haswell architecture, at least, this impacts performance by roughly 25%. +Fortunately, on other architectures the impact is less severe, or not at all. + +We also noticed on some systems a drop in STREAM performance with intel 18.0 +and later compilers. +We don't know with certainty what the cause is, but it is consistent, and we +suspect it might have had something to do with the way arrays are indexed. + +The clock speed of the processor will also impact STREAM performance by a small +amount. +Faster processor clocks allow the processor to generate new memory requests +more rapidly. +And because the compute nodes in each of the clusters are not perfectly +identical, this means some nodes will have faster STREAM throughput than +others. + +Our STREAM measurements are laid out in the following table. + +| Cluster | Family | Processor | Ch./Node | DIMM MT/s | Bandwidth | STREAM Static | Eff. | STREAM Dynamic | +| :-----: | :----: | :-------: | :------: | :-------: | --------: | ----------: | ---: | ------: | +| Mutrino | Haswell | E5-2698 v3 | 8 | 2133 | 136.5 GB/s | 119361.4101 | 87.4% | 90583.2 | +| Eclipse | Broadwell | E5-2695 v4 | 8 | 2400 | 153.6 GB/s | 130660.2840 | 85.1% | 131046.4 | +| Attaway | Skylake | Gold 6140 | 12 | 2666 | 256.0 GB/s | 186956.2034 | 73.0% | 185199.4 | +| Manzano | Cascade Lake | Platinum 8268 | 12 | 2933 | 281.6 GB/s | 221318.2715 | 78.6% | 221242.9 | +| cxsr | Sapphire Rapids | Sample | 16 | 4800 | 614.4 GB/s | 373103.3951 | 60.7% | 367872.5 | + +Worthy of note here is the low memory efficiency of the cxsr system. +The explanation for this is simple -- it is that these processors are +early pre-production samples, and not fully performance enabled. +The understanding is that these systems are only able to achieve about +80% of the bandwidth. +Taking that into account the memory efficiency would rise to almost 76%, +which is within the expected range. + +OpenMP has environment variables that allow you to control how threads are +distributed around the node. +Two such environment variables are ```OMP_PROC_BIND``` and ```OMP_PLACES```. +Their allowed values are: + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OMP_PROC_BIND<\TD> +
  true threads should not be moved
  false threads may be moved
  spread threads are distributed across partitions (i.e., NUMA domains)
  close threads are kept close to the master thread
OMP_PLACES<\TD> +
  sockets OpenMP threads are placed on successive sockets
  cores OpenMP threads are placed on successive hardware cores
  threads OpenMP threads are placed on successive hardware threads (i.e., virtual cores)
+ +A complete table of the STREAM results for various settings for Mutrino +using the intel/19.1.3 icc compiler is laid out in the following table. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Haswell
xroads-stream.c
stream_d_omp.c
OMP_PROC_BIND
undefinedclosefalsespreadtrue
OMP_PLACESundefinedintel/19.0.452020.2       
72461.6026
90637.8       
118811.0970
53463.0       
76455.0476
90631.3       
118860.3734
90583.2       
118852.3040
intel/19.1.353153.9       
73036.5731
90596.2       
119063.8735
54403.5       
72259.5228
90579.9       
119138.5591
90587.3       
119028.3251
socketsintel/19.0.490610.1       
118807.4161
90668.9       
118843.3588
90576.7       
118814.0770
90575.9       
118831.7846
90644.4       
118886.8698
intel/19.1.390616.6       
118996.6661
90607.7       
119004.9310
90620.7       
119029.9084
90583.2       
119013.0211
90641.9       
118989.4573
coresintel/19.0.490620.7       
118831.7846
90647.6       
118850.9008
90613.4       
118808.9936
90696.6       
118866.1630
90589.7       
118855.2861
intel/19.1.390627.2       
119081.4796
90620.7       
119058.5927
90620.7       
119089.0518
90511.5       
119087.4669
90645.2       
119097.1534
threadsintel/19.0.490593.0       
118792.6946
45169.7       
57789.0339
45171.3       
57786.6287
90600.3       
118754.5059
90627.2       
118831.7846
intel/19.1.390641.1       
119050.6723
45160.3       
58005.8595
45168.0       
58011.8348
90576.7       
119040.9934
90597.1       
118928.1295
+ +As can be easily seen from the table, certain combinations yield +significantly lower performance results than others. +For example, ```OMP_PROC_BIND``` set to ```false``` combined with +```OMP_PLACES``` set to ```threads``` or undefined causes the +memory throughput to drop significantly. +Furthermore, setting ```OMP_PROC_BIND=close``` and ```OMP_PLACES=threads``` +or leaving both ```OMP_PROC_BIND``` and ```OMP_PLACES``` undefined also +severely impacts performance. + +Notice that setting ```OMP_PROC_BIND=false``` or ```OMP_PROC_BIND=close``` +combined with ```OMP_PLACES=threads``` causes OpenMP threads to be allocated +all on the same NUMA domain, which cuts the memory throughput by half. +Setting ```OMP_PLACES``` as undefined while also leaving ```OMP_PROC_BIND``` +undefined or setting it to ```false``` causes a different problem. +It allows threads to migrate between NUMA domains, which causes some +references to be non-local, slowing down the performance overall. +Also note that these effects are specific to the Intel compiler in this +specific environment. +This behavior, especially the behavior when these variables are undefined, +can vary depending on the compiler vendor (e.g., GNU or Intel) and environment +(e.g., Cray or TOSS). + +It is also clear from the table that there is a significant performance difference, +about 25% to 30%, between ```xroads-stream.c``` and ```stream_d_omp.c```. +The only significant difference in the source code between these two versions is +that the memory is allocated dynamically in the ```xroads-stream.c``` version, +and allocated statically in ```stream_d_omp.c```. +This illustrates how minor changes to code generation, in this case how the +memory is addressed, can result in significant changes in performance. + +It also illustrates that this is a problem with the architecture, though, and +not necessarrily an indication that the code generation is done incorrectly. +It is a common requirement that data references be generated from addresses +that are dynamically specified. +Two examples are, when the memory is dynamically allocated, as is the case +here, and when the address is passed into a subroutine or function. +In both cases, building the code using a static base address is simply not +an option. +Thus it is up to the architecture to minimize the cost if it can. +Haswell imposes a penalty for referencing an address generated from +a base address in a register, while later architectures +(e.g., Intel Broadwell, Skylake, and Canon Lake) may not. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Canon Lake
xroads-stream.c
stream_d_omp.c
OMP_PROC_BIND
undefinedclosefalsespreadtrue
OMP_PLACESundefinedintel/16.0184007.8       
220570.2428
219376.9       
220631.2772
219018.9       
205437.9183
219439.1       
220537.6245
219281.3       
220634.2996
intel/21.3.0163099.4       
195830.6060
163276.6       
207675.2236
148348.4       
196058.5001
163276.6       
207625.9637
163311.0       
207601.3426
socketsintel/16.0219314.8       
220580.5136
219539.6       
220717.7499
189887.8       
186992.2348
219381.7       
220570.8470
219314.8       
220587.7641
intel/21.3.0163266.0       
207473.5135
163231.6       
207417.4041
163189.3       
207229.5250
163255.4       
207397.1053
163199.8       
207325.5570
coresintel/16.0219276.6       
220613.7490
219415.2       
220588.3683
219415.2       
204518.6163
219357.8       
220626.4415
219314.8       
220643.9717
intel/21.3.0163242.2       
207684.3285
163276.6       
207621.1461
163244.8       
207610.4410
163244.8       
207711.1123
163287.2       
207711.1123
threadsintel/16.0219458.2       
220666.9465
107706.2       
108841.0698
219095.2       
220471.8131
219357.8       
220613.7490
219520.4       
220717.7499
intel/21.3.0163252.8       
207404.0492
80412.3       
100545.1554
80415.5       
100557.8360
163231.6       
207397.1053
163242.2       
207363.4608
+ +It is interesting to note that (generally) the Intel 16.0 icc compiler +outperforms the 21.3.0 compiler, even though the 21.3.0 compiler has +had more development time to mature and improve. +It would be easy to expect that later is better, but in this case it +appears otherwise. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Sapphire Rapids
xroads-stream.c
stream_d_omp.c
OMP_PROC_BIND
undefinedclosefalsespreadtrue
OMP_PLACESundefinedicc216079.9       
274053.5539
371114.9       
371880.1053
215905.0       
214983.4133
372093.0       
371588.3942
373540.9       
372396.0083
icx218043.1       
219178.0822
371229.7       
370799.5365
218678.8       
215439.8564
373715.4       
371157.9354
373192.3       
370942.8130
socketsicc365352.4       
368156.8840
366524.1       
370221.7580
246305.4       
277893.0840
368154.6       
368595.0055
368154.6       
369813.7252
icx367928.9       
367393.8002
367309.5       
368451.3529
215265.9       
219729.9153
364686.2       
366902.3505
368154.6       
368451.3529
coresicc372266.2       
373536.0490
372902.4       
372447.6774
218878.2       
208423.4091
372497.3       
372740.7396
372960.4       
371245.7902
icx372555.1       
369017.8743
371689.6       
371014.4928
213808.5       
217539.0891
370828.2       
369728.4806
371920.0       
372670.8075
threadsicc372612.9       
372533.8243
186234.2       
186898.0616
218142.2       
214736.9122
372381.7       
372309.9251
371747.2       
372447.6774
icx372381.7       
370013.4901
186538.2       
186552.6623
218479.7       
214765.1007
372381.7       
370441.8291
372497.3       
370584.8292
+ +Size also plays a role in determining STREAM performance. +As with any program, the act of measuring the performance subtly alters the +behavior, and performance, of that program. +Especially on small microbenchmarks like STREAM, setting up timers and loops +takes time away from the activity being measured. +That set-up time needs to be amortized across the activity, and amortization +is more effective when the time spent in set-up is very small relative to the +activity being measured. +This all suggests that larger array sizes will show better performance than +smaller sizes. +But it is also the case that very large memory may require the system to use +more expensive measures to fetch the data, such as deeper page table walks. +So, tuning the benchmark for size, large enough to amortize the cost of timers +and loops, but not so large as to slow access down, needs to be done. +For the version that uses statically allocated memory (```stream_d_omp.c```), +the ideal memory size is the maximum the compiler allows, or 80,000,000 elements. +For the version that uses dynamically allocated memory (```xroads-stream.c```), +we saw improvements in performance up to around 100,000,000 elements. +After that, it took more time to run the benchmark but did not significantly +improve our results. + +It must also be noted that going with *very* small memory also gives faster +results, but doing so violates the benchmark run rules. +The STREAM run rules require that the memory size be at least 4x the size of +cache, or the results measure cache performance instead of memory performance, +which is not what is wanted here. +The actual size of STREAM memory is calculated as +``` +memory size in bytes = 3 arrays x 8 bytes / element x number of elements / array +``` +This value must be at least 4 times larger than the total available cache. +On our Broadwell processors there is 45 MiB per processor, or 90 MiB per node. +That implies we need a size of at least 15 million (15 x 2**20) elements. +``` +4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements +``` + +And finally, STREAM results are very sensitive to other activities going on +within the system, so STREAM needs to be run multiple times for an accurate +picture to emerge. +Even when the system is otherwise "quiet", the operating system services +interrupts and schedules maintenance tasks that steal minor amounts of +memory throughput from user programs. +Furthermore, this activity may force user threads to migrate from one +core to another, or from one NUMA domain to another, which may also affect +performance. +Running STREAM multiple times is the only way to see what the memory +throughput really is -- a statistical population of values rather than a +single value. diff --git a/microbenchmarks/stream/img/broadwell-cache-structure.png b/microbenchmarks/stream/img/broadwell-cache-structure.png new file mode 100644 index 00000000..a7230470 Binary files /dev/null and b/microbenchmarks/stream/img/broadwell-cache-structure.png differ diff --git a/microbenchmarks/stream/mysecond.c b/microbenchmarks/stream/mysecond.c new file mode 100644 index 00000000..f6778b77 --- /dev/null +++ b/microbenchmarks/stream/mysecond.c @@ -0,0 +1,28 @@ +/* A gettimeofday routine to give access to the wall + * clock timer on most UNIX-like systems. + * + * This version defines two entry points -- with + * and without appended underscores, so it *should* + * automagically link with FORTRAN */ + +#include + +double mysecond() +{ +/* struct timeval { long tv_sec; + * long tv_usec; }; + * + * struct timezone { int tz_minuteswest; + * int tz_dsttime; }; */ + + struct timeval tp; + struct timezone tzp; + int i; + + i = gettimeofday(&tp,&tzp); + return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 ); +} + +double mysecond_() {return mysecond();} + + diff --git a/microbenchmarks/stream/splunk-stream.xml b/microbenchmarks/stream/splunk-stream.xml new file mode 100644 index 00000000..57c5f479 --- /dev/null +++ b/microbenchmarks/stream/splunk-stream.xml @@ -0,0 +1,98 @@ + + + Stream Benchmarks + + + OMP_NUM_THREADS + OMP_NUM_THREADS + * + + index=hpctest sourcetype=pavilion2 sys_name=$machine$ name=stream* omp_num_threads=* | +table omp_num_threads | +dedup omp_num_threads + $field1.earliest$ + $field1.latest$ + + + + + triad + copy + add + triad + + + + node + max(triad) + min(triad) + avg(triad) + stdev(triad) + max(copy) + min(copy) + avg(copy) + stdev(copy) + max(add) + min(add) + avg(add) + stdev(add) + |sort node + + + + |sort node + + + Stream Triad Performance by OMP configuration $machine$ + + index=hpctest sourcetype=pavilion2 sys_name=$machine$ name=stream* omp_num_threads=$OMP_NUM_THREADS$ | +rex field=file "(?<node>\w+\d+)-stream.out" | +chart max($calc$) min($calc$) avg($calc$) stdev($calc$) by node| +sort max($calc$) + $field1.earliest$ + $field1.latest$ + + + + + + + + + Stream Single Node $machine$ $calc$ Rate + + index=hpctest sourcetype=pavilion2 sys_name=$machine$ name=stream* omp_num_threads=$OMP_NUM_THREADS$ | +rex field=file "(?<node>\w+\d+)-stream.out" | +chart max($calc$) min($calc$) avg($calc$) stdev($calc$) by node $streamsort$ + $field1.earliest$ + $field1.latest$ + + + + + + + + + + HPL GFLOP/s Full System + + HPL Full System Performance $machine$ + + index=hpctest sourcetype=pavilion2 sys_name=$machine$ name=hpl-full* gflops!=null| +chart max(gflops) as "MAX GFLOP/s" min(gflops) as "MIN GFLOP/s" avg(gflops) as "AVG GFLOP/s" by valn| +sort max(gflops) + $field1.earliest$ + $field1.latest$ + + + + + + + + + + + + diff --git a/microbenchmarks/stream/stream.c b/microbenchmarks/stream/stream.c new file mode 100644 index 00000000..24edd556 --- /dev/null +++ b/microbenchmarks/stream/stream.c @@ -0,0 +1,586 @@ +/*-----------------------------------------------------------------------*/ +/* Program: STREAM */ +/* Revision: $Id: stream.c,v 5.10 2013/01/17 16:01:06 mccalpin Exp mccalpin $ */ +/* Original code developed by John D. McCalpin */ +/* Programmers: John D. McCalpin */ +/* Joe R. Zagar */ +/* */ +/* This program measures memory transfer rates in MB/s for simple */ +/* computational kernels coded in C. */ +/*-----------------------------------------------------------------------*/ +/* Copyright 1991-2013: John D. McCalpin */ +/*-----------------------------------------------------------------------*/ +/* License: */ +/* 1. You are free to use this program and/or to redistribute */ +/* this program. */ +/* 2. You are free to modify this program for your own use, */ +/* including commercial use, subject to the publication */ +/* restrictions in item 3. */ +/* 3. You are free to publish results obtained from running this */ +/* program, or from works that you derive from this program, */ +/* with the following limitations: */ +/* 3a. In order to be referred to as "STREAM benchmark results", */ +/* published results must be in conformance to the STREAM */ +/* Run Rules, (briefly reviewed below) published at */ +/* http://www.cs.virginia.edu/stream/ref.html */ +/* and incorporated herein by reference. */ +/* As the copyright holder, John McCalpin retains the */ +/* right to determine conformity with the Run Rules. */ +/* 3b. Results based on modified source code or on runs not in */ +/* accordance with the STREAM Run Rules must be clearly */ +/* labelled whenever they are published. Examples of */ +/* proper labelling include: */ +/* "tuned STREAM benchmark results" */ +/* "based on a variant of the STREAM benchmark code" */ +/* Other comparable, clear, and reasonable labelling is */ +/* acceptable. */ +/* 3c. Submission of results to the STREAM benchmark web site */ +/* is encouraged, but not required. */ +/* 4. Use of this program or creation of derived works based on this */ +/* program constitutes acceptance of these licensing restrictions. */ +/* 5. Absolutely no warranty is expressed or implied. */ +/*-----------------------------------------------------------------------*/ +# include +# include +# include +# include +# include +# include + +/*----------------------------------------------------------------------- + * INSTRUCTIONS: + * + * 1) STREAM requires different amounts of memory to run on different + * systems, depending on both the system cache size(s) and the + * granularity of the system timer. + * You should adjust the value of 'STREAM_ARRAY_SIZE' (below) + * to meet *both* of the following criteria: + * (a) Each array must be at least 4 times the size of the + * available cache memory. I don't worry about the difference + * between 10^6 and 2^20, so in practice the minimum array size + * is about 3.8 times the cache size. + * Example 1: One Xeon E3 with 8 MB L3 cache + * STREAM_ARRAY_SIZE should be >= 4 million, giving + * an array size of 30.5 MB and a total memory requirement + * of 91.5 MB. + * Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) + * STREAM_ARRAY_SIZE should be >= 20 million, giving + * an array size of 153 MB and a total memory requirement + * of 458 MB. + * (b) The size should be large enough so that the 'timing calibration' + * output by the program is at least 20 clock-ticks. + * Example: most versions of Windows have a 10 millisecond timer + * granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. + * If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. + * This means the each array must be at least 1 GB, or 128M elements. + * + * Version 5.10 increases the default array size from 2 million + * elements to 10 million elements in response to the increasing + * size of L3 caches. The new default size is large enough for caches + * up to 20 MB. + * Version 5.10 changes the loop index variables from "register int" + * to "ssize_t", which allows array indices >2^32 (4 billion) + * on properly configured 64-bit systems. Additional compiler options + * (such as "-mcmodel=medium") may be required for large memory runs. + * + * Array size can be set at compile time without modifying the source + * code for the (many) compilers that support preprocessor definitions + * on the compile line. E.g., + * gcc -O -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream.100M + * will override the default size of 10M with a new size of 100M elements + * per array. + */ +#ifndef STREAM_ARRAY_SIZE +# define STREAM_ARRAY_SIZE 10000000 +#endif + +/* 2) STREAM runs each kernel "NTIMES" times and reports the *best* result + * for any iteration after the first, therefore the minimum value + * for NTIMES is 2. + * There are no rules on maximum allowable values for NTIMES, but + * values larger than the default are unlikely to noticeably + * increase the reported performance. + * NTIMES can also be set on the compile line without changing the source + * code using, for example, "-DNTIMES=7". + */ +#ifdef NTIMES +#if NTIMES<=1 +# define NTIMES 10 +#endif +#endif +#ifndef NTIMES +# define NTIMES 10 +#endif + +/* Users are allowed to modify the "OFFSET" variable, which *may* change the + * relative alignment of the arrays (though compilers may change the + * effective offset by making the arrays non-contiguous on some systems). + * Use of non-zero values for OFFSET can be especially helpful if the + * STREAM_ARRAY_SIZE is set to a value close to a large power of 2. + * OFFSET can also be set on the compile line without changing the source + * code using, for example, "-DOFFSET=56". + */ +#ifndef OFFSET +# define OFFSET 0 +#endif + +/* + * 3) Compile the code with optimization. Many compilers generate + * unreasonably bad code before the optimizer tightens things up. + * If the results are unreasonably good, on the other hand, the + * optimizer might be too smart for me! + * + * For a simple single-core version, try compiling with: + * cc -O stream.c -o stream + * This is known to work on many, many systems.... + * + * To use multiple cores, you need to tell the compiler to obey the OpenMP + * directives in the code. This varies by compiler, but a common example is + * gcc -O -fopenmp stream.c -o stream_omp + * The environment variable OMP_NUM_THREADS allows runtime control of the + * number of threads/cores used when the resulting "stream_omp" program + * is executed. + * + * To run with single-precision variables and arithmetic, simply add + * -DSTREAM_TYPE=float + * to the compile line. + * Note that this changes the minimum array sizes required --- see (1) above. + * + * The preprocessor directive "TUNED" does not do much -- it simply causes the + * code to call separate functions to execute each kernel. Trivial versions + * of these functions are provided, but they are *not* tuned -- they just + * provide predefined interfaces to be replaced with tuned code. + * + * + * 4) Optional: Mail the results to mccalpin@cs.virginia.edu + * Be sure to include info that will help me understand: + * a) the computer hardware configuration (e.g., processor model, memory type) + * b) the compiler name/version and compilation flags + * c) any run-time information (such as OMP_NUM_THREADS) + * d) all of the output from the test case. + * + * Thanks! + * + *-----------------------------------------------------------------------*/ + +# define HLINE "-------------------------------------------------------------\n" + +# ifndef MIN +# define MIN(x,y) ((x)<(y)?(x):(y)) +# endif +# ifndef MAX +# define MAX(x,y) ((x)>(y)?(x):(y)) +# endif + +#ifndef STREAM_TYPE +#define STREAM_TYPE double +#endif + +static STREAM_TYPE a[STREAM_ARRAY_SIZE+OFFSET], + b[STREAM_ARRAY_SIZE+OFFSET], + c[STREAM_ARRAY_SIZE+OFFSET]; + +static double avgtime[4] = {0}, maxtime[4] = {0}, + mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX}; + +static char *label[4] = {"Copy: ", "Scale: ", + "Add: ", "Triad: "}; + +static double bytes[4] = { + 2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE + }; + +extern double mysecond(); +extern void checkSTREAMresults(); +#ifdef TUNED +extern void tuned_STREAM_Copy(); +extern void tuned_STREAM_Scale(STREAM_TYPE scalar); +extern void tuned_STREAM_Add(); +extern void tuned_STREAM_Triad(STREAM_TYPE scalar); +#endif +#ifdef _OPENMP +extern int omp_get_num_threads(); +#endif +int +main() + { + int quantum, checktick(); + int BytesPerWord; + int k; + ssize_t j; + STREAM_TYPE scalar; + double t, times[4][NTIMES]; + + /* --- SETUP --- determine precision and check timing --- */ + + printf(HLINE); + printf("STREAM version $Revision: 5.10 $\n"); + printf(HLINE); + BytesPerWord = sizeof(STREAM_TYPE); + printf("This system uses %d bytes per array element.\n", + BytesPerWord); + + printf(HLINE); +#ifdef N + printf("***** WARNING: ******\n"); + printf(" It appears that you set the preprocessor variable N when compiling this code.\n"); + printf(" This version of the code uses the preprocesor variable STREAM_ARRAY_SIZE to control the array size\n"); + printf(" Reverting to default value of STREAM_ARRAY_SIZE=%llu\n",(unsigned long long) STREAM_ARRAY_SIZE); + printf("***** WARNING: ******\n"); +#endif + + printf("Array size = %llu (elements), Offset = %d (elements)\n" , (unsigned long long) STREAM_ARRAY_SIZE, OFFSET); + printf("Memory per array = %.1f MiB (= %.1f GiB).\n", + BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0), + BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0/1024.0)); + printf("Total memory required = %.1f MiB (= %.1f GiB).\n", + (3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.), + (3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024./1024.)); + printf("Each kernel will be executed %d times.\n", NTIMES); + printf(" The *best* time for each kernel (excluding the first iteration)\n"); + printf(" will be used to compute the reported bandwidth.\n"); + +#ifdef _OPENMP + printf(HLINE); +#pragma omp parallel + { +#pragma omp master + { + k = omp_get_num_threads(); + printf ("Number of Threads requested = %i\n",k); + } + } +#endif + +#ifdef _OPENMP + k = 0; +#pragma omp parallel +#pragma omp atomic + k++; + printf ("Number of Threads counted = %i\n",k); +#endif + + /* Get initial value for system clock. */ +#pragma omp parallel for + for (j=0; j= 1) + printf("Your clock granularity/precision appears to be " + "%d microseconds.\n", quantum); + else { + printf("Your clock granularity appears to be " + "less than one microsecond.\n"); + quantum = 1; + } + + t = mysecond(); +#pragma omp parallel for + for (j = 0; j < STREAM_ARRAY_SIZE; j++) + a[j] = 2.0E0 * a[j]; + t = 1.0E6 * (mysecond() - t); + + printf("Each test below will take on the order" + " of %d microseconds.\n", (int) t ); + printf(" (= %d clock ticks)\n", (int) (t/quantum) ); + printf("Increase the size of the arrays if this shows that\n"); + printf("you are not getting at least 20 clock ticks per test.\n"); + + printf(HLINE); + + printf("WARNING -- The above is only a rough guideline.\n"); + printf("For best results, please be sure you know the\n"); + printf("precision of your system timer.\n"); + printf(HLINE); + + /* --- MAIN LOOP --- repeat test cases NTIMES times --- */ + + scalar = 3.0; + for (k=0; k + +double mysecond() +{ + struct timeval tp; + struct timezone tzp; + int i; + + i = gettimeofday(&tp,&tzp); + return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 ); +} + +#ifndef abs +#define abs(a) ((a) >= 0 ? (a) : -(a)) +#endif +void checkSTREAMresults () +{ + STREAM_TYPE aj,bj,cj,scalar; + STREAM_TYPE aSumErr,bSumErr,cSumErr; + STREAM_TYPE aAvgErr,bAvgErr,cAvgErr; + double epsilon; + ssize_t j; + int k,ierr,err; + + /* reproduce initialization */ + aj = 1.0; + bj = 2.0; + cj = 0.0; + /* a[] is modified during timing check */ + aj = 2.0E0 * aj; + /* now execute timing loop */ + scalar = 3.0; + for (k=0; k epsilon) { + err++; + printf ("Failed Validation on array a[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",aj,aAvgErr,abs(aAvgErr)/aj); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array a: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,aj,a[j],abs((aj-a[j])/aAvgErr)); + } +#endif + } + } + printf(" For array a[], %d errors were found.\n",ierr); + } + if (abs(bAvgErr/bj) > epsilon) { + err++; + printf ("Failed Validation on array b[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",bj,bAvgErr,abs(bAvgErr)/bj); + printf (" AvgRelAbsErr > Epsilon (%e)\n",epsilon); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array b: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,bj,b[j],abs((bj-b[j])/bAvgErr)); + } +#endif + } + } + printf(" For array b[], %d errors were found.\n",ierr); + } + if (abs(cAvgErr/cj) > epsilon) { + err++; + printf ("Failed Validation on array c[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",cj,cAvgErr,abs(cAvgErr)/cj); + printf (" AvgRelAbsErr > Epsilon (%e)\n",epsilon); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array c: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,cj,c[j],abs((cj-c[j])/cAvgErr)); + } +#endif + } + } + printf(" For array c[], %d errors were found.\n",ierr); + } + if (err == 0) { + printf ("Solution Validates: avg error less than %e on all three arrays\n",epsilon); + } +#ifdef VERBOSE + printf ("Results Validation Verbose Results: \n"); + printf (" Expected a(1), b(1), c(1): %f %f %f \n",aj,bj,cj); + printf (" Observed a(1), b(1), c(1): %f %f %f \n",a[1],b[1],c[1]); + printf (" Rel Errors on a, b, c: %e %e %e \n",abs(aAvgErr/aj),abs(bAvgErr/bj),abs(cAvgErr/cj)); +#endif +} + +#ifdef TUNED +/* stubs for "tuned" versions of the kernels */ +void tuned_STREAM_Copy() +{ + ssize_t j; +#pragma omp parallel for + for (j=0; j +# include +# include +# include +# include +# include +# include +# include +# include "mpi.h" + +/*----------------------------------------------------------------------- + * INSTRUCTIONS: + * + * 1) STREAM requires different amounts of memory to run on different + * systems, depending on both the system cache size(s) and the + * granularity of the system timer. + * You should adjust the value of 'STREAM_ARRAY_SIZE' (below) + * to meet *both* of the following criteria: + * (a) Each array must be at least 4 times the size of the + * available cache memory. I don't worry about the difference + * between 10^6 and 2^20, so in practice the minimum array size + * is about 3.8 times the cache size. + * Example 1: One Xeon E3 with 8 MB L3 cache + * STREAM_ARRAY_SIZE should be >= 4 million, giving + * an array size of 30.5 MB and a total memory requirement + * of 91.5 MB. + * Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) + * STREAM_ARRAY_SIZE should be >= 20 million, giving + * an array size of 153 MB and a total memory requirement + * of 458 MB. + * (b) The size should be large enough so that the 'timing calibration' + * output by the program is at least 20 clock-ticks. + * Example: most versions of Windows have a 10 millisecond timer + * granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. + * If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. + * This means the each array must be at least 1 GB, or 128M elements. + * + * Version 5.10 increases the default array size from 2 million + * elements to 10 million elements in response to the increasing + * size of L3 caches. The new default size is large enough for caches + * up to 20 MB. + * Version 5.10 changes the loop index variables from "register int" + * to "ssize_t", which allows array indices >2^32 (4 billion) + * on properly configured 64-bit systems. Additional compiler options + * (such as "-mcmodel=medium") may be required for large memory runs. + * + * Array size can be set at compile time without modifying the source + * code for the (many) compilers that support preprocessor definitions + * on the compile line. E.g., + * gcc -O -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream.100M + * will override the default size of 10M with a new size of 100M elements + * per array. + */ + +// ----------------------- !!! NOTE CHANGE IN DEFINITION !!! ------------------ +// For the MPI version of STREAM, the three arrays with this many elements +// each will be *distributed* across the MPI ranks. +// +// Be careful when computing the array size needed for a particular target +// system to meet the minimum size requirement to ensure overflowing the caches. +// +// Example: +// Assume 4 nodes with two Intel Xeon E5-2680 processors (20 MiB L3) each. +// The *total* L3 cache size is 4*2*20 = 160 MiB, so each array must be +// at least 640 MiB, or at least 80 million 8 Byte elements. +// Note that it does not matter whether you use one MPI rank per node or +// 16 MPI ranks per node -- only the total array size and the total +// cache size matter. +// +#ifndef STREAM_ARRAY_SIZE +# define STREAM_ARRAY_SIZE 10000000 +#endif + +/* 2) STREAM runs each kernel "NTIMES" times and reports the *best* result + * for any iteration after the first, therefore the minimum value + * for NTIMES is 2. + * There are no rules on maximum allowable values for NTIMES, but + * values larger than the default are unlikely to noticeably + * increase the reported performance. + * NTIMES can also be set on the compile line without changing the source + * code using, for example, "-DNTIMES=7". + */ +#ifdef NTIMES +#if NTIMES<=1 +# define NTIMES 10 +#endif +#endif +#ifndef NTIMES +# define NTIMES 10 +#endif + +// Make the scalar coefficient modifiable at compile time. +// The old value of 3.0 cause floating-point overflows after a relatively small +// number of iterations. The new default of 0.42 allows over 2000 iterations for +// 32-bit IEEE arithmetic and over 18000 iterations for 64-bit IEEE arithmetic. +// The growth in the solution can be eliminated (almost) completely by setting +// the scalar value to 0.41421445, but this also means that the error checking +// code no longer triggers an error if the code does not actually execute the +// correct number of iterations! +#ifndef SCALAR +#define SCALAR 0.42 +#endif + + +// ----------------------- !!! NOTE CHANGE IN DEFINITION !!! ------------------ +// The OFFSET preprocessor variable is not used in this version of the benchmark. +// The user must change the code at or after the "posix_memalign" array allocations +// to change the relative alignment of the pointers. +// ----------------------- !!! NOTE CHANGE IN DEFINITION !!! ------------------ +#ifndef OFFSET +# define OFFSET 0 +#endif + + +/* + * 3) Compile the code with optimization. Many compilers generate + * unreasonably bad code before the optimizer tightens things up. + * If the results are unreasonably good, on the other hand, the + * optimizer might be too smart for me! + * + * For a simple single-core version, try compiling with: + * cc -O stream.c -o stream + * This is known to work on many, many systems.... + * + * To use multiple cores, you need to tell the compiler to obey the OpenMP + * directives in the code. This varies by compiler, but a common example is + * gcc -O -fopenmp stream.c -o stream_omp + * The environment variable OMP_NUM_THREADS allows runtime control of the + * number of threads/cores used when the resulting "stream_omp" program + * is executed. + * + * To run with single-precision variables and arithmetic, simply add + * -DSTREAM_TYPE=float + * to the compile line. + * Note that this changes the minimum array sizes required --- see (1) above. + * + * The preprocessor directive "TUNED" does not do much -- it simply causes the + * code to call separate functions to execute each kernel. Trivial versions + * of these functions are provided, but they are *not* tuned -- they just + * provide predefined interfaces to be replaced with tuned code. + * + * + * 4) Optional: Mail the results to mccalpin@cs.virginia.edu + * Be sure to include info that will help me understand: + * a) the computer hardware configuration (e.g., processor model, memory type) + * b) the compiler name/version and compilation flags + * c) any run-time information (such as OMP_NUM_THREADS) + * d) all of the output from the test case. + * + * Thanks! + * + *-----------------------------------------------------------------------*/ + +# define HLINE "-------------------------------------------------------------\n" + +# ifndef MIN +# define MIN(x,y) ((x)<(y)?(x):(y)) +# endif +# ifndef MAX +# define MAX(x,y) ((x)>(y)?(x):(y)) +# endif + +#ifndef STREAM_TYPE +#define STREAM_TYPE double +#endif + +//static STREAM_TYPE a[STREAM_ARRAY_SIZE+OFFSET], +// b[STREAM_ARRAY_SIZE+OFFSET], +// c[STREAM_ARRAY_SIZE+OFFSET]; + +// Some compilers require an extra keyword to recognize the "restrict" qualifier. +double * restrict a, * restrict b, * restrict c; + +size_t array_elements, array_bytes, array_alignment; +static double avgtime[4] = {0}, maxtime[4] = {0}, + mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX}; + +static char *label[4] = {"Copy: ", "Scale: ", + "Add: ", "Triad: "}; + +static double bytes[4] = { + 2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE + }; + +extern void checkSTREAMresults(STREAM_TYPE *AvgErrByRank, int numranks); +extern void computeSTREAMerrors(STREAM_TYPE *aAvgErr, STREAM_TYPE *bAvgErr, STREAM_TYPE *cAvgErr); +#ifdef TUNED +extern void tuned_STREAM_Copy(); +extern void tuned_STREAM_Scale(STREAM_TYPE scalar); +extern void tuned_STREAM_Add(); +extern void tuned_STREAM_Triad(STREAM_TYPE scalar); +#endif +#ifdef _OPENMP +extern int omp_get_num_threads(); +#endif +int +main() + { + int quantum, checktick(); + int BytesPerWord; + int i,k; + ssize_t j; + STREAM_TYPE scalar; + double t, times[4][NTIMES]; + double *TimesByRank; + double t0,t1,tmin; + int rc, numranks, myrank; + STREAM_TYPE AvgError[3] = {0.0,0.0,0.0}; + STREAM_TYPE *AvgErrByRank; + + /* --- SETUP --- call MPI_Init() before anything else! --- */ + + rc = MPI_Init(NULL, NULL); + t0 = MPI_Wtime(); + if (rc != MPI_SUCCESS) { + printf("ERROR: MPI Initialization failed with return code %d\n",rc); + exit(1); + } + // if either of these fail there is something really screwed up! + MPI_Comm_size(MPI_COMM_WORLD, &numranks); + MPI_Comm_rank(MPI_COMM_WORLD, &myrank); + + /* --- NEW FEATURE --- distribute requested storage across MPI ranks --- */ + array_elements = STREAM_ARRAY_SIZE / numranks; // don't worry about rounding vs truncation + array_alignment = 64; // Can be modified -- provides partial support for adjusting relative alignment + + // Dynamically allocate the three arrays using "posix_memalign()" + // NOTE that the OFFSET parameter is not used in this version of the code! + array_bytes = array_elements * sizeof(STREAM_TYPE); + k = posix_memalign((void **)&a, array_alignment, array_bytes); + if (k != 0) { + printf("Rank %d: Allocation of array a failed, return code is %d\n",myrank,k); + MPI_Abort(MPI_COMM_WORLD, 2); + exit(1); + } + k = posix_memalign((void **)&b, array_alignment, array_bytes); + if (k != 0) { + printf("Rank %d: Allocation of array b failed, return code is %d\n",myrank,k); + MPI_Abort(MPI_COMM_WORLD, 2); + exit(1); + } + k = posix_memalign((void **)&c, array_alignment, array_bytes); + if (k != 0) { + printf("Rank %d: Allocation of array c failed, return code is %d\n",myrank,k); + MPI_Abort(MPI_COMM_WORLD, 2); + exit(1); + } + + // Initial informational printouts -- rank 0 handles all the output + if (myrank == 0) { + printf(HLINE); + printf("STREAM version $Revision: 1.8 $\n"); + printf(HLINE); + BytesPerWord = sizeof(STREAM_TYPE); + printf("This system uses %d bytes per array element.\n", + BytesPerWord); + + printf(HLINE); +#ifdef N + printf("***** WARNING: ******\n"); + printf(" It appears that you set the preprocessor variable N when compiling this code.\n"); + printf(" This version of the code uses the preprocesor variable STREAM_ARRAY_SIZE to control the array size\n"); + printf(" Reverting to default value of STREAM_ARRAY_SIZE=%llu\n",(unsigned long long) STREAM_ARRAY_SIZE); + printf("***** WARNING: ******\n"); +#endif + if (OFFSET != 0) { + printf("***** WARNING: ******\n"); + printf(" This version ignores the OFFSET parameter.\n"); + printf("***** WARNING: ******\n"); + } + + printf("Total Aggregate Array size = %llu (elements)\n" , (unsigned long long) STREAM_ARRAY_SIZE); + printf("Total Aggregate Memory per array = %.1f MiB (= %.1f GiB).\n", + BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0), + BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0/1024.0)); + printf("Total Aggregate memory required = %.1f MiB (= %.1f GiB).\n", + (3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.), + (3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024./1024.)); + printf("Data is distributed across %d MPI ranks\n",numranks); + printf(" Array size per MPI rank = %llu (elements)\n" , (unsigned long long) array_elements); + printf(" Memory per array per MPI rank = %.1f MiB (= %.1f GiB).\n", + BytesPerWord * ( (double) array_elements / 1024.0/1024.0), + BytesPerWord * ( (double) array_elements / 1024.0/1024.0/1024.0)); + printf(" Total memory per MPI rank = %.1f MiB (= %.1f GiB).\n", + (3.0 * BytesPerWord) * ( (double) array_elements / 1024.0/1024.), + (3.0 * BytesPerWord) * ( (double) array_elements / 1024.0/1024./1024.)); + + printf(HLINE); + printf("Each kernel will be executed %d times.\n", NTIMES); + printf(" The *best* time for each kernel (excluding the first iteration)\n"); + printf(" will be used to compute the reported bandwidth.\n"); + printf("The SCALAR value used for this run is %f\n",SCALAR); + +#ifdef _OPENMP + printf(HLINE); +#pragma omp parallel + { +#pragma omp master + { + k = omp_get_num_threads(); + printf ("Number of Threads requested for each MPI rank = %i\n",k); + } + } +#endif + +#ifdef _OPENMP + k = 0; +#pragma omp parallel +#pragma omp atomic + k++; + printf ("Number of Threads counted for rank 0 = %i\n",k); +#endif + + } + + /* --- SETUP --- initialize arrays and estimate precision of timer --- */ + +#pragma omp parallel for + for (j=0; j= 1) + printf("Your timer granularity/precision appears to be " + "%d microseconds.\n", quantum); + else { + printf("Your timer granularity appears to be " + "less than one microsecond.\n"); + quantum = 1; + } + } + + /* Get initial timing estimate to compare to timer granularity. */ + /* All ranks need to run this code since it changes the values in array a */ + t = MPI_Wtime(); +#pragma omp parallel for + for (j = 0; j < array_elements; j++) + a[j] = 2.0E0 * a[j]; + t = 1.0E6 * (MPI_Wtime() - t); + + if (myrank == 0) { + printf("Each test below will take on the order" + " of %d microseconds.\n", (int) t ); + printf(" (= %d timer ticks)\n", (int) (t/quantum) ); + printf("Increase the size of the arrays if this shows that\n"); + printf("you are not getting at least 20 timer ticks per test.\n"); + + printf(HLINE); + + printf("WARNING -- The above is only a rough guideline.\n"); + printf("For best results, please be sure you know the\n"); + printf("precision of your system timer.\n"); + printf(HLINE); +#ifdef VERBOSE + t1 = MPI_Wtime(); + printf("VERBOSE: total setup time for rank 0 = %f seconds\n",t1-t0); + printf(HLINE); +#endif + } + + /* --- MAIN LOOP --- repeat test cases NTIMES times --- */ + + // This code has more barriers and timing calls than are actually needed, but + // this should not cause a problem for arrays that are large enough to satisfy + // the STREAM run rules. + // MAJOR FIX!!! Version 1.7 had the start timer for each loop *after* the + // MPI_Barrier(), when it should have been *before* the MPI_Barrier(). + // + + scalar = SCALAR; + for (k=0; k= 0 ? (a) : -(a)) +#endif +void computeSTREAMerrors(STREAM_TYPE *aAvgErr, STREAM_TYPE *bAvgErr, STREAM_TYPE *cAvgErr) +{ + STREAM_TYPE aj,bj,cj,scalar; + STREAM_TYPE aSumErr,bSumErr,cSumErr; + ssize_t j; + int k; + + /* reproduce initialization */ + aj = 1.0; + bj = 2.0; + cj = 0.0; + /* a[] is modified during timing check */ + aj = 2.0E0 * aj; + /* now execute timing loop */ + scalar = SCALAR; + for (k=0; k epsilon) { + err++; + printf ("Failed Validation on array a[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",aj,aAvgErr,abs(aAvgErr)/aj); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array a: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,aj,a[j],abs((aj-a[j])/aAvgErr)); + } +#endif + } + } + printf(" For array a[], %d errors were found.\n",ierr); + } + if (abs(bAvgErr/bj) > epsilon) { + err++; + printf ("Failed Validation on array b[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",bj,bAvgErr,abs(bAvgErr)/bj); + printf (" AvgRelAbsErr > Epsilon (%e)\n",epsilon); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array b: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,bj,b[j],abs((bj-b[j])/bAvgErr)); + } +#endif + } + } + printf(" For array b[], %d errors were found.\n",ierr); + } + if (abs(cAvgErr/cj) > epsilon) { + err++; + printf ("Failed Validation on array c[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",cj,cAvgErr,abs(cAvgErr)/cj); + printf (" AvgRelAbsErr > Epsilon (%e)\n",epsilon); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array c: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,cj,c[j],abs((cj-c[j])/cAvgErr)); + } +#endif + } + } + printf(" For array c[], %d errors were found.\n",ierr); + } + if (err == 0) { + printf ("Solution Validates: avg error less than %e on all three arrays\n",epsilon); + } +#ifdef VERBOSE + printf ("Results Validation Verbose Results: \n"); + printf (" Expected a(1), b(1), c(1): %f %f %f \n",aj,bj,cj); + printf (" Observed a(1), b(1), c(1): %f %f %f \n",a[1],b[1],c[1]); + printf (" Rel Errors on a, b, c: %e %e %e \n",abs(aAvgErr/aj),abs(bAvgErr/bj),abs(cAvgErr/cj)); +#endif +} + +#ifdef TUNED +/* stubs for "tuned" versions of the kernels */ +void tuned_STREAM_Copy() +{ + ssize_t j; +#pragma omp parallel for + for (j=0; j +#include +#include +#include +#include +#include +#include + +#ifdef _OPENMP +#include +#endif + +#ifndef STREAM_ARRAY_SIZE +#define STREAM_ARRAY_SIZE 100000000 +#endif + +#define NTIMES 10 +#define OFFSET 0 + +# define HLINE "-------------------------------------------------------------\n" + +#ifndef MIN +#define MIN(x,y) ((x)<(y)?(x):(y)) +#endif +#ifndef MAX +#define MAX(x,y) ((x)>(y)?(x):(y)) +#endif + +#ifndef STREAM_TYPE +#define STREAM_TYPE double +#endif + +#define STREAM_RESTRICT __restrict__ + +static STREAM_TYPE* STREAM_RESTRICT a; +static STREAM_TYPE* STREAM_RESTRICT b; +static STREAM_TYPE* STREAM_RESTRICT c; + +static double avgtime[4] = {0}, maxtime[4] = {0}, + mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX}; + +static char *label[4] = {"Copy: ", "Scale: ", + "Add: ", "Triad: "}; + +static double bytes[4] = { + 2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE, + 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE + }; + +double mysecond(); +void checkSTREAMresults(); + + +int main() { + int quantum, checktick(); + int BytesPerWord; + int k; + ssize_t j; + STREAM_TYPE scalar; + double t, times[4][NTIMES]; + + /* --- SETUP --- determine precision and check timing --- */ + + printf(HLINE); + printf("STREAM Crossroads Memory Bandwidth\n"); + printf("(Based on the original STREAM benchmark by John D. McCalpin)\n"); + printf(HLINE); + BytesPerWord = sizeof(STREAM_TYPE); + printf("This system uses %d bytes per array element.\n", + BytesPerWord); + + printf(HLINE); + + const unsigned long long int array_size = (sizeof(STREAM_TYPE) * (STREAM_ARRAY_SIZE + OFFSET)); + + printf("Array size = %llu (elements), Offset = %d (elements)\n" , (unsigned long long) array_size, OFFSET); + printf("Memory per array = %.1f MiB (= %.1f GiB).\n", + BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0), + BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0/1024.0)); + printf("Total memory required = %.1f MiB (= %.1f GiB).\n", + (3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.), + (3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024./1024.)); + printf("Each kernel will be executed %d times.\n", NTIMES); + printf(" The *best* time for each kernel (excluding the first iteration)\n"); + printf(" will be used to compute the reported bandwidth.\n"); + + printf("Allocating arrays ...\n"); + + /////////////////////////////////////////////////////////////////////////////// + // VENDOR NOTIFICATION + // + // Memory allocation routines should be changed to reflect allocation in memory + // hierarchy level. All modifications should be reported in benchmark response + // + /////////////////////////////////////////////////////////////////////////////// + + posix_memalign((void**) &a, 64, array_size ); + posix_memalign((void**) &b, 64, array_size ); + posix_memalign((void**) &c, 64, array_size ); + + /////////////////////////////////////////////////////////////////////////////// + // END VENDOR NOTIFICATION + /////////////////////////////////////////////////////////////////////////////// + +#ifdef _OPENMP + k = 0; +#pragma omp parallel +#pragma omp atomic + k++; + printf ("Number of Threads counted = %i\n",k); +#endif + + printf("Populating values and performing first touch ... \n"); + /* Get initial value for system clock. */ +#pragma omp parallel for + for (j=0; j= 1) + printf("Your clock granularity/precision appears to be " + "%d microseconds.\n", quantum); + else { + printf("Your clock granularity appears to be " + "less than one microsecond.\n"); + quantum = 1; + } + + t = mysecond(); +#pragma omp parallel for + for (j = 0; j < STREAM_ARRAY_SIZE; j++) + a[j] = 2.0E0 * a[j]; + t = 1.0E6 * (mysecond() - t); + + printf("Each test below will take on the order" + " of %d microseconds.\n", (int) t ); + printf(" (= %d clock ticks)\n", (int) (t/quantum) ); + printf("Increase the size of the arrays if this shows that\n"); + printf("you are not getting at least 20 clock ticks per test.\n"); + + printf(HLINE); + + printf("WARNING -- The above is only a rough guideline.\n"); + printf("For best results, please be sure you know the\n"); + printf("precision of your system timer.\n"); + printf(HLINE); + + /* --- MAIN LOOP --- repeat test cases NTIMES times --- */ + + scalar = 3.0; + for (k=0; k < NTIMES; k++) + { + times[0][k] = mysecond(); + #pragma omp parallel for + for (j=0; j + +double mysecond() +{ + struct timeval tp; + struct timezone tzp; + int i; + + i = gettimeofday(&tp,&tzp); + return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 ); +} + +#ifndef abs +#define abs(a) ((a) >= 0 ? (a) : -(a)) +#endif +void checkSTREAMresults () +{ + STREAM_TYPE aj,bj,cj,scalar; + STREAM_TYPE aSumErr,bSumErr,cSumErr; + STREAM_TYPE aAvgErr,bAvgErr,cAvgErr; + double epsilon; + ssize_t j; + int k,ierr,err; + + /* reproduce initialization */ + aj = 1.0; + bj = 2.0; + cj = 0.0; + /* a[] is modified during timing check */ + aj = 2.0E0 * aj; + /* now execute timing loop */ + scalar = 3.0; + for (k=0; k epsilon) { + err++; + printf ("Failed Validation on array a[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",aj,aAvgErr,abs(aAvgErr)/aj); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array a: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,aj,a[j],abs((aj-a[j])/aAvgErr)); + } +#endif + } + } + printf(" For array a[], %d errors were found.\n",ierr); + } + if (abs(bAvgErr/bj) > epsilon) { + err++; + printf ("Failed Validation on array b[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",bj,bAvgErr,abs(bAvgErr)/bj); + printf (" AvgRelAbsErr > Epsilon (%e)\n",epsilon); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array b: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,bj,b[j],abs((bj-b[j])/bAvgErr)); + } +#endif + } + } + printf(" For array b[], %d errors were found.\n",ierr); + } + if (abs(cAvgErr/cj) > epsilon) { + err++; + printf ("Failed Validation on array c[], AvgRelAbsErr > epsilon (%e)\n",epsilon); + printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",cj,cAvgErr,abs(cAvgErr)/cj); + printf (" AvgRelAbsErr > Epsilon (%e)\n",epsilon); + ierr = 0; + for (j=0; j epsilon) { + ierr++; +#ifdef VERBOSE + if (ierr < 10) { + printf(" array c: index: %ld, expected: %e, observed: %e, relative error: %e\n", + j,cj,c[j],abs((cj-c[j])/cAvgErr)); + } +#endif + } + } + printf(" For array c[], %d errors were found.\n",ierr); + } + if (err == 0) { + printf ("Solution Validates: avg error less than %e on all three arrays\n",epsilon); + } +#ifdef VERBOSE + printf ("Results Validation Verbose Results: \n"); + printf (" Expected a(1), b(1), c(1): %f %f %f \n",aj,bj,cj); + printf (" Observed a(1), b(1), c(1): %f %f %f \n",a[1],b[1],c[1]); + printf (" Rel Errors on a, b, c: %e %e %e \n",abs(aAvgErr/aj),abs(bAvgErr/bj),abs(cAvgErr/cj)); +#endif +} diff --git a/parthenon b/parthenon index 11c53d1c..c75ce20f 160000 --- a/parthenon +++ b/parthenon @@ -1 +1 @@ -Subproject commit 11c53d1cd4ada0629e06d069b70b410234ed0bde +Subproject commit c75ce20f938a4adaedb4425584954c3e74d56868 diff --git a/sparta b/sparta index 83d5f3a9..ca0ce28f 160000 --- a/sparta +++ b/sparta @@ -1 +1 @@ -Subproject commit 83d5f3a92c5fc0b59d4d973c6b1dddc4d77a7147 +Subproject commit ca0ce28fd76080d8b2828db77adde14fdc382c76 diff --git a/trilinos b/trilinos index f3ff0b54..5aaae1ad 160000 --- a/trilinos +++ b/trilinos @@ -1 +1 @@ -Subproject commit f3ff0b54c5158790295daff089ff0d286bda3c2c +Subproject commit 5aaae1ada6fe1ce777e671a0ff84fdc4f0779406 diff --git a/utils/pav_config/hosts/crossroads.yaml b/utils/pav_config/hosts/crossroads.yaml index 4e8b115c..e94ebcf9 100755 --- a/utils/pav_config/hosts/crossroads.yaml +++ b/utils/pav_config/hosts/crossroads.yaml @@ -6,14 +6,15 @@ variables: crayversion: "16.0.0" craympichversion: "8.1.26" partn: 'standard' + mpis: - { name: "cray-mpich", version: "{{craympichversion}}", mpicc: "cc", mpicxx: "CC", mpifc: "ftn", mpival: "cray"} compilers: - - { name: "intel-classic", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn" } - - { name: "intel-oneapi", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn" } - - { name: "intel", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn" } - - { name: "cce", version: "{{crayversion}}", cc: "cc", cxx: "CC", pe_env: cray, fc: "ftn" } - + - { name: "intel-classic", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -qopenmp', arch_opt: ''} + - { name: "intel-oneapi", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -fopenmp', arch_opt: ''} + - { name: "intel", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -qopenmp', arch_opt: ''} + - { name: "cce", version: "{{crayversion}}", cc: "cc", cxx: "CC", pe_env: cray, fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -openmp', arch_opt: ''} + # - { name: "gcc", version: "12.2.0", pe_env: "PrgEnv-gnu", cc: "cc", cxx: "CC", fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -fopenmp', arch_opt: ''} scratch: - name: xrscratch path: "/lustre/xrscratch1/{{pav.user}}" diff --git a/utils/pav_config/test_src/dgemm b/utils/pav_config/test_src/dgemm new file mode 120000 index 00000000..8c5d18b9 --- /dev/null +++ b/utils/pav_config/test_src/dgemm @@ -0,0 +1 @@ +../../../microbenchmarks/dgemm \ No newline at end of file diff --git a/utils/pav_config/test_src/ior b/utils/pav_config/test_src/ior new file mode 120000 index 00000000..06a5443f --- /dev/null +++ b/utils/pav_config/test_src/ior @@ -0,0 +1 @@ +../../../microbenchmarks/ior \ No newline at end of file diff --git a/utils/pav_config/test_src/mdtest b/utils/pav_config/test_src/mdtest new file mode 120000 index 00000000..e7d2e746 --- /dev/null +++ b/utils/pav_config/test_src/mdtest @@ -0,0 +1 @@ +../../../microbenchmarks/mdtest \ No newline at end of file diff --git a/utils/pav_config/test_src/osumb b/utils/pav_config/test_src/osumb new file mode 120000 index 00000000..1315ac3b --- /dev/null +++ b/utils/pav_config/test_src/osumb @@ -0,0 +1 @@ +../../../microbenchmarks/osumb \ No newline at end of file diff --git a/utils/pav_config/test_src/stream b/utils/pav_config/test_src/stream new file mode 120000 index 00000000..55789187 --- /dev/null +++ b/utils/pav_config/test_src/stream @@ -0,0 +1 @@ +../../../microbenchmarks/stream \ No newline at end of file diff --git a/utils/pav_config/tests/stream.yaml b/utils/pav_config/tests/stream.yaml index fd438cc8..ed62f732 100644 --- a/utils/pav_config/tests/stream.yaml +++ b/utils/pav_config/tests/stream.yaml @@ -75,30 +75,33 @@ _base: STREAM ARRAY SIZE CALCULATIONS: ############### - STREAM - XRDS DOCUMENTATION: 4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements = 15000000 + FORMULA: + 4 x ((cache / socket) x (num sockets)) / (num arrays) / 8 (size of double) = 15 Mi elements = 15e6 ***************************************************************************************************** HASWELL: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz CACHE: 40M SOCKETS: 2 - 4 * ( 40M * 2 ) / 3 ARRAYS / 8 Bytes/element = 13.4 Mi elements = 13400000 + 4 * ( 40M * 2 ) / 3 ARRAYS / 8 = 13.4 Mi elements = 13.4e6 ***************************************************************************************************** BROADWELL: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz CACHE: 45M SOCKETS: 2 - 4 * ( 45M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 15.0 Mi elements = 15000000 + 4 * ( 45M * 2 ) / 3 ARRAYS / 8 = 15.0 Mi elements = 15e6 ***************************************************************************************************** SAPPHIRE RAPIDS: Intel(R) Xeon(R) Platinum 8480+ - CACHE: 105 + CACHE: 105M SOCKETS: 2 - 4 x (105M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 35 Mi elements = 35000000 + 4 x ( 105M * 2 ) / 3 ARRAYS / 8 = 35 Mi elements = 35e6 scheduler: slurm schedule: nodes: '10' # 'ALL' tasks_per_node: 1 share_allocation: false + + variables: ntimes: '10' + stream_array_size: '40' permute_on: - compilers @@ -129,14 +132,14 @@ _base: - 'fi' - 'export PAV_CC PAV_FC PAV_CFLAGS PAV_FFLAGS' - 'make clean' - - 'make {{target}} || exit 1' + - 'make all || exit 1' - '[ -x {{target}} ] || exit 1' run: env: CC: '{{compilers.cc}}' OMP_NUM_THREADS: '{{omp_num_threads}}' - GOMP_CPU_AFFINITY: '0-{{omp_num_threads-1}}' +# GOMP_CPU_AFFINITY: '0-{{omp_num_threads-1}}' preamble: - 'module load friendly-testing' - 'module load {{compilers.name}}/{{compilers.version}}' @@ -144,7 +147,7 @@ _base: cmds: - 'NTIMES={{ntimes}}' - 'N={{stream_array_size}}000000' - - 'echo "GOMP_CPU_AFFINITY: $GOMP_CPU_AFFINITY"' + # - 'echo "GOMP_CPU_AFFINITY: $GOMP_CPU_AFFINITY"' - 'echo "OMP_NUM_THREADS: $OMP_NUM_THREADS"' - 'echo "NTIMES=$NTIMES"' - 'echo "N=${N}"' @@ -292,7 +295,7 @@ spr_ddr5_xrds: "{{sys_name}}": [ darwin ] variables: arch: "spr" - stream_array_size: '105' + stream_array_size: '40' target: "xrds-stream.exe" omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] omp_places: [cores, sockets] @@ -321,7 +324,7 @@ spr_hbm_xrds: "{{sys_name}}": [ darwin ] variables: arch: "spr" - stream_array_size: '105' + stream_array_size: '40' target: "xrds-stream.exe" omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] omp_places: [cores, sockets] @@ -360,16 +363,13 @@ cts1_ats5: numnodes: '1' omp_num_threads: '1' stream_array_size: '40' + target: "stream_mpi.exe" schedule: nodes: "{{numnodes}}" share_allocation: true tasks_per_node: "{{tpn}}" - run: - env: - GOMP_CPU_AFFINITY: '' - result_parse: regex: triad_once: @@ -396,8 +396,8 @@ xrds_ats5: variables: tpn: [8, 32, 56, 88, 112] arch: "spr" - target: "xrds-stream.exe" - stream_array_size: '105' + target: "xrds_stream.exe" + stream_array_size: '40' ntimes: 20 #omp_places: [cores, sockets] #omp_proc_bind: [true] @@ -407,7 +407,6 @@ xrds_ats5: chunk: '{{chunk_ids.0}}' schedule: - partition: 'hbm' nodes: "{{numnodes}}" share_allocation: true tasks_per_node: "{{tpn}}" @@ -430,9 +429,6 @@ xrds_ats5: - 'module load {{compilers.name}}/{{compilers.version}}' - 'module load {{mpis.name}}/{{mpis.version}}' - env: - GOMP_CPU_AFFINITY: '' - result_parse: regex: triad_once: @@ -444,3 +440,9 @@ xrds_ats5: result_evaluate: per_proc_bw: 'sum(triad_once)/len(triad_once)' total_bw: 'sum(triad_once)' + +roci_ats5: + inherits_from: xrds_ats5 + + schedule: + partition: 'hbm' diff --git a/utils/pavilion b/utils/pavilion index 69b2d45d..f502ca86 160000 --- a/utils/pavilion +++ b/utils/pavilion @@ -1 +1 @@ -Subproject commit 69b2d45d696e623127c106b50525ba65daa23d76 +Subproject commit f502ca86fa27f4bc894aa19232c9f1f42361e269