diff --git a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst
index 88199844..136bb2bc 100644
--- a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst
+++ b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst
@@ -73,12 +73,13 @@ Adjustments to ``GOMP_CPU_AFFINITY`` may be necessary.
 
 The ``STREAM_ARRAY_SIZE`` value is a critical parameter set at compile time and controls the size of the array used to measure bandwidth. STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.
 
-You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet BOTH of the following criteria:
+You should adjust the value of ``STREAM_ARRAY_SIZE`` to meet ALL of the following criteria:
 
 1. Each array must be at least 4 times the size of the available cache memory. In practice the minimum array size is about 3.8 times the cache size.
    1. Example 1: One Xeon E3 with 8 MB L3 cache ``STREAM_ARRAY_SIZE`` should be ``>= 4 million``, giving an array size of 30.5 MB and a total memory requirement of 91.5 MB.
    2. Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) ``STREAM_ARRAY_SIZE`` should be ``>= 20 million``, giving an array size of 153 MB and a total memory requirement of 458 MB.
-2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity.  20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.
+2. The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.
+3. The value ``24xSTREAM_ARRAY_SIZExRANKS_PER_NODE`` must be less than the amount of RAM on a node. STREAM creates 3 arrays of doubles; that is where 24 comes from. Each rank has 3 of these arrays.
 
 Set ``STREAM_ARRAY_SIZE`` using the -D flag on your compile line.
 
@@ -88,8 +89,11 @@ The formula for ``STREAM_ARRAY_SIZE`` is:
 
  ARRAY_SIZE ~= 4 x (last_level_cache_size x num_sockets) / size_of_double = last_level_cache_size
 
-This reduces to the same number of elements as bytes in the last level cache of a single processor for two socket nodes.
-This is the minimum size.
+This reduces to a number of elements equal to the size of the last level cache of a single socket in bytes, assuming a node has two sockets.
+This is the minimum size unless other system attributes constrain it.
+
+The array size only influences the capacity of STREAM to fully load the memory bus.
+At capacity, the measured values should reach a steady state where increasing the value of ``STREAM_ARRAY_SIZE`` doesn't influence the measurement for a certain number of processors.
 
 Running
 =======
@@ -117,7 +121,7 @@ Crossroads
 These results were obtained using the cce v15.0.1 compiler and cray-mpich v 8.1.25. 
 Results using the intel-oneapi and intel-classic v2023.1.0 and the same cray-mpich were also collected; cce performed the best.
 
-``STREAM_ARRAY_SIZE=105 NTIMES=20``
+``STREAM_ARRAY_SIZE=40 NTIMES=20``
 
 .. csv-table:: STREAM microbenchmark bandwidth measurement
    :file: stream-xrds_ats5cce-cray-mpich.csv
diff --git a/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst b/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst
index 74029613..3bd11eae 100644
--- a/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst
+++ b/doc/sphinx/09_Microbenchmarks/M8_MDTEST/MDTEST.rst
@@ -5,6 +5,7 @@ MDTEST
 Purpose
 =======
 
+The intent of this benchmark is to measure the performance of file metadata operations on the platform storage.
 MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems.
 It can be run on any type of POSIX-compliant file system but has been designed to test the performance of parallel file systems.
 
@@ -16,11 +17,19 @@ Characteristics
 Problem
 -------
 
+MDtest measures the performance of various metadata operations using MPI to coordinate execution and collect the results.
+In this case, the operations in question are file creation, stat, and removal.
+
 Run Rules
 ---------
 
-Figure of Merit
----------------
+Observed benchmark performance shall be obtained from a storage system configured as closely as possible to the proposed platform storage. 
+If the proposed solution includes multiple file access protocols (e.g., pNFS and NFS) or multiple tiers accessible by applications, benchmark results for mdtest shall be provided for each protocol and/or tier.
+
+Performance projections are permissible if they are derived from a similar system that is considered an earlier generation of the proposed system.
+
+Modifications to the benchmark application code are only permissible to enable correct compilation and execution on the target platform. 
+Any modifications must be fully documented (e.g., as a diff or patch file) and reported with the benchmark results.
 
 Building
 ========
@@ -35,17 +44,53 @@ After extracting the tar file, ensure that the MPI is loaded and that the releva
 Running
 =======
 
-.. .. csv-table:: MDTEST Microbenchmark
-..    :file: ats3_mdtest_sow.csv
-..    :align: center
-..    :widths: 10, 10, 10, 10, 10
-..    :header-rows: 1
-..    :stub-columns: 2
+The results for the three operations, create, stat, remove, should be obtained for three different file configurations:
+
+1) ``2^20`` files in a single directory.
+2) ``2^20`` files in separate directories, 1 per MPI process.
+3) 1 file accessed by multiple MPI processes.
+
+These configurations are launched as follows.
+
+.. code-block:: bash
+
+    # Shared Directory
+    srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16
+    # Unique Directories
+    srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -u
+    # One File Multi-Proc
+    srun -n 64 ./mdtest -F -C -T -r -n 16384 -d /scratch/$USER -N 16 -S
+
+The following command-line flags MUST be changed:
+
+* ``-n`` - the number of files **each MPI process** should manipulate.  For a test run with 64 MPI processes, specifying ``-n 16384`` will produce the equired ``2^20`` files (``2^6`` MPI processes x ``2^14`` files each).  This parameter must be changed for each level of concurrency.
+* ``-d /scratch`` - the **absolute path** to the directory in which this test should be run. 
+* ``-N`` - MPI rank offset for each separate phase of the test.  This parameter must be equal to the number of MPI processes per node in use (e.g., ``-N 16`` for a test with 16 processes per node) to ensure that each test phase (read, stat, and delete) is performed on a different node.
+
+The following command-line flags MUST NOT be changed or omitted:
+
+* ``-F`` - only operate on files, not directories
+* ``-C`` - perform file creation test
+* ``-T`` - perform file stat test
+* ``-r`` - perform file remove test
 
 Example Results
 ===============
 
-.. csv-table:: MDTEST Microbenchmark Xrds
+These nine tests: three operations, three file conditions should be performed under 4 different launch conditions, for a total of 36 results:
+
+1) A single MPI process
+2) The optimal number of MPI processes on a single compute node
+3) The minimal number of MPI processes on multiple compute nodes that achieves the peak results for the proposed system.
+4) The maximum possible MPI-level concurrency on the proposed system. This could mean:
+   1) Using one MPI process per CPU core across the entire system.
+   2) Using the maximum number of MPI processes possible if one MPI process per core will not be possible on the proposed architecture.
+   3) Using more than ``2^20`` files if the system is capable of launching more than ``2^20`` MPI processes.
+
+Crossroads
+----------
+
+.. csv-table:: MDTEST Microbenchmark Crossroads
    :file: ats3_mdtest.csv
    :align: center
    :widths: 10, 10, 10, 10, 10
diff --git a/microbenchmarks/mdtest/README.XROADS.md b/microbenchmarks/mdtest/README.XROADS.md
index 2783b547..90f065a6 100644
--- a/microbenchmarks/mdtest/README.XROADS.md
+++ b/microbenchmarks/mdtest/README.XROADS.md
@@ -57,8 +57,7 @@ node memory.
 
 The Offeror shall run the following tests:
 
-* creating, statting, and removing at least 1,048,576 files in a single
-  directory
+* creating, statting, and removing at least 1,048,576 files in a single directory.
 * creating, statting, and removing at least 1,048,576 files in separate
   directories (one directory per MPI process)
 * creating, statting, and removing one file by multiple MPI processes
diff --git a/microbenchmarks/stream/Makefile b/microbenchmarks/stream/Makefile
new file mode 100644
index 00000000..e98efebf
--- /dev/null
+++ b/microbenchmarks/stream/Makefile
@@ -0,0 +1,30 @@
+#
+MPICC ?= $(PAV_MPICC)
+CC ?= $(PAV_CC)
+CFLAGS ?= $(PAV_CFLAGS)
+
+FF ?= $(PAV_FC)
+FFLAGS ?= $(PAV_FFLAGS)
+
+ifeq ($(MPICC),)
+	MPICC=$(CC)
+endif
+
+all: xrds-stream.exe stream_f.exe stream_c.exe stream_mpi.exe
+
+stream_f.exe: stream.f mysecond.o
+	$(CC) $(CFLAGS) -c mysecond.c
+	$(FF) $(FFLAGS) -c stream.f
+	$(FF) $(FFLAGS) stream.o mysecond.o -o stream_f.exe
+
+clean:
+	rm -f *stream*.exe *.o
+
+stream_c.exe: stream.c
+	$(CC) $(CFLAGS) stream.c -o stream_c.exe
+
+stream_mpi.exe: stream_mpi.c
+	$(MPICC) $(CFLAGS) stream_mpi.c -o stream_mpi.exe
+
+xrds-stream.exe: xrds-stream.c
+	$(CC) $(CFLAGS) xrds-stream.c -o xrds-stream.exe
diff --git a/microbenchmarks/stream/README.ACES b/microbenchmarks/stream/README.ACES
new file mode 100644
index 00000000..01cb258e
--- /dev/null
+++ b/microbenchmarks/stream/README.ACES
@@ -0,0 +1,74 @@
+========================================================================================
+Crossroads Memory Bandwidth Benchmark
+========================================================================================
+Benchmark Version: 1.0.0
+
+========================================================================================
+Benchmark Description:
+========================================================================================
+The Crossroads Memory Bandwidth benchmark is a modified version of the STREAM 
+benchmark originally written by John D. McCalpin. The modifications have been made to 
+simplify the code. All memory bandwidth projections/results provided in the vendor 
+response should be measured using the kernels in this benchmark implementation and 
+not the original STREAM code.
+
+========================================================================================
+Permitted Modifications:
+========================================================================================
+Offerers are permitted to modify the benchmark in the following ways:
+
+OpenMP Pragmas - the Offeror may modify the OpenMP pragmas in the benchmark as required 
+to permit execution on the proposed system provided: (1) all modified sources and build 
+scripts are included in the RFP response; (2) any modified code used for the response 
+must continue to be a valid OpenMP program (compliant to the standard being proposed in 
+the Offeror's response).
+
+Memory Allocation Routines - memory allocation routines including modified allocations 
+to specify the level of the memory hierarchy or placement of the data are permitted 
+provided: (1) all modified sources and build scripts are included in the RFP response; 
+(2) the use of any specific libraries to provide allocation services must be provided 
+in the proposed system.
+
+Array/Allocation Sizes - the sizes of the allocated arrays may be modified to exercise 
+the appropriate size and level of the memory hierarchy provided the benchmark correctly 
+exercises the memory system being targeted. 
+
+Index Type - the range of the index type is configured for a 32-bit signed integer ("int") 
+via the preprocessor define STREAM_INDEX_TYPE. If very large memories are benchmarked 
+the Offeror is permitted to change to a larger integer type. The Offeror should indicate 
+in their response that this modification has been made.
+
+Accumulation Error Checking Type - the basic accumulation error type is configured for a 
+64-bit double precision value via the STREAM_CHECK_TYPE preprocessor define. This may be 
+modified to a higher precision type ("long double") if a large memory (requiring a 64-bit 
+integer for STREAM_INDEX_TYPE) is used. The Offeror should indicate in their response 
+that this modification has been made.
+
+========================================================================================
+Run Rules:
+========================================================================================
+The Offeror may utilize any number of threads, affinity and memory binding options for 
+execution of the benchmark provided: (1) details of all command line parameters, 
+environment variables and binding tools are included in the response.
+
+The vendor is expected to provide memory bandwidth projections/benchmarked results using 
+the Crossroads memory bandwidth benchmark to each level of the memory hierarchy 
+accessible from each compute resource.
+
+========================================================================================
+How to Compile, Run and Verify:
+========================================================================================
+To build simply type modify the file Makefile for your compiler and type make. To run, 
+execute the file xroads-stream. xroads-stream performs self verification.
+
+$ make
+<lots of make output>
+$ export OMP_NUM_THREADS=12
+$ ./xroads-stream
+<xroads-stream output>
+
+========================================================================================
+How to report
+========================================================================================
+The primary FOM is the Triad rate (MB/s). Report all data printed to stdout.
+
diff --git a/microbenchmarks/stream/README.md b/microbenchmarks/stream/README.md
new file mode 100644
index 00000000..731676c9
--- /dev/null
+++ b/microbenchmarks/stream/README.md
@@ -0,0 +1,561 @@
+# Crossroads Acceptance
+
+- PI: Douglas M. Pase [dmpase@sandia.gov](mailto:dmpase@sandia.gov)
+- Erik A. Illescas [eailles@sandia.gov](mailto:eailles@sandia.gov)
+- Anthony M. Agelastos [amagela@sandia.gov](mailto:amagela@sandia.gov)
+
+This project tracks data and analysis from the Crossroads Acceptance effort.
+
+
+## STREAM
+STREAM, like DGEMM, is also a microbenchmark that measures a single fundamental
+aspect of a system.
+But while DGEMM measures floating-point vector performance, STREAM
+measures memory bandwidth.
+More specifically, it measures the performance of two 64-bit loads and a
+store operation to a third location.
+The operation looks like this:
+```C
+for (int i=0; i < size; i++) {
+    x[i] = y[i] + constant * z[i];
+}
+```
+
+Two 64-bit words are loaded into cache and registers, combined arithmetically,
+and the result is stored in a third location.
+In cache-based microprocessors, this typically means that all three
+locations, ```x[i]```, ```y[i]```, and ```z[i]```, must be read into
+cache.
+Values ```x[i]``` and ```y[i]``` must be read because their values are needed
+for the computation, but ```z[i]``` must also be loaded into cache in order to
+maintain cache coherency.
+With no other hardware support, this would mean the best throughput one could
+hope for is about 75% of peak, because the processor must execute three loads
+(one for each of ```x[i]```, ```y[i]```, and ```z[i]```) and one store (```z[i]```),
+but it only gets credit for two loads (```x[i]``` and ```y[i]```) and one store (```z[i]```).
+And, (2+1)/(3+1)=75%.
+
+But most most current architectures support an optimization feature, called a
+Microarchitectural Store Buffer (MSB), that speeds up the write operation in
+some cases.
+When a processor executes a cacheable write smaller than a cache line, the cache
+controller first checks whether the cache line is already resident in cache.
+When it is not, it caches the partial line in the MSB and sends the request on
+to the memory controller to fetch the cache line.
+If the rest of the cache line is overwritten before the data comes back from
+the memory controller, the read request is cancelled and the new data is
+moved from the MSB to the cache as a dirty (modified) cache line.
+In this way the cache controller avoids the inefficiency of retrieving the
+extra line from memory.
+
+Another architectural feature that affects STREAM performance is
+Non-Uniform Memory Access (NUMA).
+This is best illustrated with a picture.
+The following block diagram shows the processor, core, cache, and memory
+structure of a typical Intel processor-based compute node.
+
+![DGEMM Results](img/broadwell-cache-structure.png)
+***Processor, Core, Cache, and Memory Structure for Sandia Compute Nodes***
+
+The illustration shows two processors, each with their own local memory (DIMMs).
+High-speed communication links (QPI links) connect the two processors.
+The two processors share a physical memory address space, so both processors
+can access not only their own local memory, but also memory that is attached
+to the other processor.
+A program running on one of the cores in a processor requests to load or store
+an address.
+In the event the address misses cache, the request is forwarded to the System
+Agent, which then sorts out whether the address is local or remote.
+The System Agent then forwards the request to the appropriate memory controller
+to be handled.
+The important concept to understand is that access to memory that is local is
+faster than access to memory that is remote.
+
+The next concept to be concerned with is how memory is mapped to a program.
+Programs do not reference physical memory directly.
+Instead each program uses virtual addresses to reference data it uses, and
+relies upon the hardware and operating system to translate those virtual
+addresses into physical, or hardware, addresses.
+Most modern NUMA-aware operating systems, Linux included, use a "first touch"
+policy for deciding where memory is allocated.
+When a program allocates memory it is only given new virtual addresses to use.
+The physical memory backing the virtual address is allocated the first time
+the memory is read or written.
+Whenever possible, Linux allocates physical memory that is local to the
+processor that is running the program at the moment the memory is first
+referenced.
+Linux then works very hard to keep the process running on the same core so
+caches remain loaded and memory stays local.
+For most programs this policy works very well, but for some threading
+models, such as OpenMP, obtaining good performance in a NUMA environment
+requires some care.
+
+Because we are running an OpenMP-enabled version of STREAM, we must be
+sure to schedule the parallel loops statically so the same iterations
+go to the same threads (and therefore the same NUMA domains) every time.
+This will ensure that memory references are always local, and therefore
+of highest performance.
+
+Note that the raw bandwidth of the node is determined by the number, width, 
+and speed of the memory channels.
+In the above illustration there are four channels per processor, but the
+number will vary from one architecture to the next.
+Each memory channel is the width of the data path of a DIMM, which is always
+64-bits, or 8 bytes.
+The frequency will vary with the DIMM, and it's measured in mega-transfers
+per second, or MT/s.
+(DIMMs use a differential clock, so the clock frequency of DDR memory is
+always half of the transfer rate.)
+Most systems at Sandia use memory speeds ranging from 2133 MT/s to 4800 MT/s.
+A Haswell node, with four channels per processor, two processors per node,
+and 2133 MT/s memory will have a hardware (or peak) bandwidth of:
+
+```
+4 channels/processor x 2 processors x 8 bytes/channel x 2133 MT/s = 136,512 MB/s
+```
+
+The peak bandwidth represents a least upper bound, a limit you are guaranteed
+never to exceed.
+The performance you measure is the throughput, and it will always be a value
+below the peak bandwidth.
+The ratio of the two values, throughput over bandwidth, is known as the
+memory efficiency, and is a measure of how efficiently the memory transfers
+data to and from the processor.
+Modern processors typically have a memory efficiency of around 75% to 85%.
+New memory architectures may have somewhat less, and rarely, an exceptionally
+efficient system may have more than 90%.
+
+Yet another consideration is how the instructions are generated by the compiler.
+In this case there are two issues, using vector units and how addressing is
+handled.
+Vector instructions reduce the number of instructions that need to be executed,
+which reduces memory traffic.
+AVX instructions use wider vectors than SSE4.3 vectors, so AVX instructions 
+provide better performance than using SSE4.3 instructions.
+Similarly, AVX2 is faster than AVX, and AVX-512 is faster than AVX2.
+
+Furthermore, how array indexes are generated and used can also have an impact
+on STREAM performance.
+For example, using statically allocated arrays allows the code generation to
+use a constant array base address for references to ```x```, ```y```, and ```z```.
+But using dynamically allocated arrays forces the program to load the base
+addresses into registers and index the arrays indirectly.
+On the Haswell architecture, at least, this impacts performance by roughly 25%.
+Fortunately, on other architectures the impact is less severe, or not at all.
+
+We also noticed on some systems a drop in STREAM performance with intel 18.0 
+and later compilers.
+We don't know with certainty what the cause is, but it is consistent, and we
+suspect it might have had something to do with the way arrays are indexed.
+
+The clock speed of the processor will also impact STREAM performance by a small
+amount.
+Faster processor clocks allow the processor to generate new memory requests 
+more rapidly.
+And because the compute nodes in each of the clusters are not perfectly 
+identical, this means some nodes will have faster STREAM throughput than
+others.
+
+Our STREAM measurements are laid out in the following table.
+
+| Cluster | Family          | Processor     | Ch./Node | DIMM MT/s | Bandwidth  | STREAM Static | Eff.  | STREAM Dynamic  |
+| :-----: | :----:          | :-------:     | :------: | :-------: | --------:  | ----------: | ---:  | ------:  |
+| Mutrino | Haswell         | <A HREF="https://ark.intel.com/content/www/us/en/ark/products/81060/intel-xeon-processor-e52698-v3-40m-cache-2-30-ghz.html">E5-2698 v3</A>            |        8 |      2133 | 136.5 GB/s | 119361.4101 | 87.4% |  90583.2 |
+| Eclipse | Broadwell       | <A HREF="https://ark.intel.com/content/www/us/en/ark/products/91316/intel-xeon-processor-e52695-v4-45m-cache-2-10-ghz.html">E5-2695 v4</A>            |        8 |      2400 | 153.6 GB/s | 130660.2840 | 85.1% | 131046.4 |
+| Attaway | Skylake         | <A HREF="https://ark.intel.com/content/www/us/en/ark/products/120485/intel-xeon-gold-6140-processor-24-75m-cache-2-30-ghz.html">Gold 6140</A>         |       12 |      2666 | 256.0 GB/s | 186956.2034 | 73.0% | 185199.4 |
+| Manzano | Cascade Lake    | <A HREF="https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html">Platinum 8268</A> |       12 |      2933 | 281.6 GB/s | 221318.2715 | 78.6% | 221242.9 |
+| cxsr    | Sapphire Rapids | Sample        |       16 |      4800 | 614.4 GB/s | 373103.3951 | 60.7% | 367872.5 |
+
+Worthy of note here is the low memory efficiency of the cxsr system.
+The explanation for this is simple -- it is that these processors are
+early pre-production samples, and not fully performance enabled. 
+The understanding is that these systems are only able to achieve about
+80% of the bandwidth.
+Taking that into account the memory efficiency would rise to almost 76%,
+which is within the expected range.
+
+OpenMP has environment variables that allow you to control how threads are
+distributed around the node.
+Two such environment variables are ```OMP_PROC_BIND``` and ```OMP_PLACES```.
+Their allowed values are:
+
+<TABLE BORDER=0>
+    <TR>
+	<TD colspan=3>OMP_PROC_BIND<\TD>
+    </TR>
+    <TR>
+	<TD>&nbsp;</TD> <TD>true</TD> <TD>threads should not be moved</TD>
+    </TR>
+    <TR>
+	<TD>&nbsp;</TD> <TD>false</TD> <TD>threads may be moved</TD>
+    </TR>
+    <TR>
+	<TD>&nbsp;</TD> <TD>spread</TD> <TD>threads are distributed across partitions (i.e., NUMA domains)</TD>
+    </TR>
+    <TR>
+	<TD>&nbsp;</TD> <TD>close</TD> <TD>threads are kept close to the master thread</TD>
+    </TR>
+    <TR>
+	<TD colspan=3>OMP_PLACES<\TD>
+    </TR>
+    <TR>
+	<TD>&nbsp;</TD> <TD>sockets</TD> <TD>OpenMP threads are placed on successive sockets</TD>
+    </TR>
+    <TR>
+	<TD>&nbsp;</TD> <TD>cores</TD> <TD>OpenMP threads are placed on successive hardware cores</TD>
+    </TR>
+    <TR>
+	<TD>&nbsp;</TD> <TD>threads</TD> <TD>OpenMP threads are placed on successive hardware threads (i.e., virtual cores)</TD>
+    </TR>
+</TABLE>
+
+A complete table of the STREAM results for various settings for Mutrino 
+using the intel/19.1.3 icc compiler is laid out in the following table.
+
+<TABLE>
+    <TR>
+	<TD rowspan=2 colspan=3 align=center>Haswell<BR>xroads-stream.c<BR>stream_d_omp.c</TD>
+	<TD align=center colspan=6>OMP_PROC_BIND</TD>
+    </TR>
+    <TR>
+	<TD align=center><I>undefined</I></TD>
+	<TD align=center>close</TD>
+	<TD align=center>false</TD>
+	<TD align=center>spread</TD>
+	<TD align=center>true</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=8>OMP_PLACES</TD>
+	<TD align=center rowspan=2><I>undefined</I></TD>
+	<TD align=center>intel/19.0.4</TD>
+	<TD align=right>52020.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR> 72461.6026</TD>
+	<TD align=right>90637.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118811.0970</TD>
+	<TD align=right>53463.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR> 76455.0476</TD>
+	<TD align=right>90631.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118860.3734</TD>
+	<TD align=right>90583.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118852.3040</TD>
+    </TR>
+    <TR>
+	<TD align=center>intel/19.1.3</TD>
+	<TD align=right>53153.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR> 73036.5731</TD>
+	<TD align=right>90596.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119063.8735</TD>
+	<TD align=right>54403.5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR> 72259.5228</TD>
+	<TD align=right>90579.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119138.5591</TD>
+	<TD align=right>90587.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119028.3251</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=2>sockets</TD>
+	<TD align=center>intel/19.0.4</TD>
+	<TD align=right>90610.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118807.4161</TD>
+	<TD align=right>90668.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118843.3588</TD>
+	<TD align=right>90576.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118814.0770</TD>
+	<TD align=right>90575.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118831.7846</TD>
+	<TD align=right>90644.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118886.8698</TD>
+    </TR>
+    <TR>
+	<TD align=center>intel/19.1.3</TD>
+	<TD align=right>90616.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118996.6661</TD>
+	<TD align=right>90607.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119004.9310</TD>
+	<TD align=right>90620.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119029.9084</TD>
+	<TD align=right>90583.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119013.0211</TD>
+	<TD align=right>90641.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118989.4573</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=2>cores</TD>
+	<TD align=center>intel/19.0.4</TD>
+	<TD align=right>90620.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118831.7846</TD>
+	<TD align=right>90647.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118850.9008</TD>
+	<TD align=right>90613.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118808.9936</TD>
+	<TD align=right>90696.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118866.1630</TD>
+	<TD align=right>90589.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118855.2861</TD>
+    </TR>
+    <TR>
+	<TD align=center>intel/19.1.3</TD>
+	<TD align=right>90627.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119081.4796</TD>
+	<TD align=right>90620.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119058.5927</TD>
+	<TD align=right>90620.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119089.0518</TD>
+	<TD align=right>90511.5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119087.4669</TD>
+	<TD align=right>90645.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119097.1534</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=19>threads</TD>
+	<TD align=center>intel/19.0.4</TD>
+	<TD align=right>90593.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118792.6946</TD>
+	<TD align=right>45169.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR> 57789.0339</TD>
+	<TD align=right>45171.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR> 57786.6287</TD>
+	<TD align=right>90600.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118754.5059</TD>
+	<TD align=right>90627.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118831.7846</TD>
+    </TR>
+    <TR>
+	<TD align=center>intel/19.1.3</TD>
+	<TD align=right>90641.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119050.6723</TD>
+	<TD align=right>45160.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR> 58005.8595</TD>
+	<TD align=right>45168.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR> 58011.8348</TD>
+	<TD align=right>90576.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>119040.9934</TD>
+	<TD align=right>90597.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>118928.1295</TD>
+    </TR>
+</TABLE>
+
+As can be easily seen from the table, certain combinations yield 
+significantly lower performance results than others.
+For example, ```OMP_PROC_BIND``` set to ```false``` combined with
+```OMP_PLACES``` set to ```threads``` or undefined causes the 
+memory throughput to drop significantly.
+Furthermore, setting ```OMP_PROC_BIND=close``` and ```OMP_PLACES=threads```
+or leaving both ```OMP_PROC_BIND``` and ```OMP_PLACES``` undefined also 
+severely impacts performance.
+
+Notice that setting ```OMP_PROC_BIND=false``` or ```OMP_PROC_BIND=close``` 
+combined with ```OMP_PLACES=threads``` causes OpenMP threads to be allocated
+all on the same NUMA domain, which cuts the memory throughput by half.
+Setting ```OMP_PLACES``` as undefined while also leaving ```OMP_PROC_BIND``` 
+undefined or setting it to ```false``` causes a different problem.
+It allows threads to migrate between NUMA domains, which causes some 
+references to be non-local, slowing down the performance overall.
+Also note that these effects are specific to the Intel compiler in this
+specific environment. 
+This behavior, especially the behavior when these variables are undefined, 
+can vary depending on the compiler vendor (e.g., GNU or Intel) and environment 
+(e.g., Cray or TOSS).
+
+It is also clear from the table that there is a significant performance difference, 
+about 25% to 30%, between ```xroads-stream.c``` and ```stream_d_omp.c```.
+The only significant difference in the source code between these two versions is 
+that the memory is allocated dynamically in the ```xroads-stream.c``` version, 
+and allocated statically in ```stream_d_omp.c```.
+This illustrates how minor changes to code generation, in this case how the 
+memory is addressed, can result in significant changes in performance. 
+
+It also illustrates that this is a problem with the architecture, though, and
+not necessarrily an indication that the code generation is done incorrectly.
+It is a common requirement that data references be generated from addresses
+that are dynamically specified. 
+Two examples are, when the memory is dynamically allocated, as is the case
+here, and when the address is passed into a subroutine or function.
+In both cases, building the code using a static base address is simply not
+an option.
+Thus it is up to the architecture to minimize the cost if it can.
+Haswell imposes a penalty for referencing an address generated from 
+a base address in a register, while later architectures 
+(e.g., Intel Broadwell, Skylake, and Canon Lake) may not.
+
+<TABLE>
+    <TR>
+	<TD rowspan=2 colspan=3 align=center>Canon Lake<BR>xroads-stream.c<BR>stream_d_omp.c</TD>
+	<TD align=center colspan=6>OMP_PROC_BIND</TD>
+    </TR>
+    <TR>
+	<TD align=center><I>undefined</I></TD>
+	<TD align=center>close</TD>
+	<TD align=center>false</TD>
+	<TD align=center>spread</TD>
+	<TD align=center>true</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=8>OMP_PLACES</TD>
+	<TD align=center rowspan=2><I>undefined</I></TD>
+	<TD align=center>intel/16.0</TD>
+	<TD align=right>184007.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220570.2428</TD>
+	<TD align=right>219376.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220631.2772</TD>
+	<TD align=right>219018.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>205437.9183</TD>
+	<TD align=right>219439.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220537.6245</TD>
+	<TD align=right>219281.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220634.2996</TD>
+    </TR>
+    <TR>
+	<TD align=center>intel/21.3.0</TD>
+	<TD align=right>163099.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>195830.6060</TD>
+	<TD align=right>163276.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207675.2236</TD>
+	<TD align=right>148348.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>196058.5001</TD>
+	<TD align=right>163276.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207625.9637</TD>
+	<TD align=right>163311.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207601.3426</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=2>sockets</TD>
+	<TD align=center>intel/16.0</TD>
+	<TD align=right>219314.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220580.5136</TD>
+	<TD align=right>219539.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220717.7499</TD>
+	<TD align=right>189887.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>186992.2348</TD>
+	<TD align=right>219381.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220570.8470</TD>
+	<TD align=right>219314.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220587.7641</TD>
+    </TR>
+    <TR>
+	<TD align=center>intel/21.3.0</TD>
+	<TD align=right>163266.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207473.5135</TD>
+	<TD align=right>163231.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207417.4041</TD>
+	<TD align=right>163189.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207229.5250</TD>
+	<TD align=right>163255.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207397.1053</TD>
+	<TD align=right>163199.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207325.5570</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=2>cores</TD>
+	<TD align=center>intel/16.0</TD>
+	<TD align=right>219276.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220613.7490</TD>
+	<TD align=right>219415.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220588.3683</TD>
+	<TD align=right>219415.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>204518.6163</TD>
+	<TD align=right>219357.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220626.4415</TD>
+	<TD align=right>219314.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220643.9717</TD>
+    </TR>
+    <TR>
+	<TD align=center>intel/21.3.0</TD>
+	<TD align=right>163242.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207684.3285</TD>
+	<TD align=right>163276.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207621.1461</TD>
+	<TD align=right>163244.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207610.4410</TD>
+	<TD align=right>163244.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207711.1123</TD>
+	<TD align=right>163287.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207711.1123</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=19>threads</TD>
+	<TD align=center>intel/16.0</TD>
+	<TD align=right>219458.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220666.9465</TD>
+	<TD align=right>107706.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>108841.0698</TD>
+	<TD align=right>219095.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220471.8131</TD>
+	<TD align=right>219357.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220613.7490</TD>
+	<TD align=right>219520.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>220717.7499</TD>
+    </TR>
+    <TR>
+	<TD align=center>intel/21.3.0</TD>
+	<TD align=right>163252.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207404.0492</TD>
+	<TD align=right> 80412.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>100545.1554</TD>
+	<TD align=right> 80415.5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>100557.8360</TD>
+	<TD align=right>163231.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207397.1053</TD>
+	<TD align=right>163242.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>207363.4608</TD>
+    </TR>
+</TABLE>
+
+It is interesting to note that (generally) the Intel 16.0 icc compiler 
+outperforms the 21.3.0 compiler, even though the 21.3.0 compiler has 
+had more development time to mature and improve. 
+It would be easy to expect that later is better, but in this case it 
+appears otherwise. 
+
+<TABLE>
+    <TR>
+	<TD rowspan=2 colspan=3 align=center>Sapphire Rapids<BR>xroads-stream.c<BR>stream_d_omp.c</TD>
+	<TD align=center colspan=6>OMP_PROC_BIND</TD>
+    </TR>
+    <TR>
+	<TD align=center><I>undefined</I></TD>
+	<TD align=center>close</TD>
+	<TD align=center>false</TD>
+	<TD align=center>spread</TD>
+	<TD align=center>true</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=8>OMP_PLACES</TD>
+	<TD align=center rowspan=2><I>undefined</I></TD>
+	<TD align=center>icc</TD>
+	<TD align=right>216079.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>274053.5539</TD>
+	<TD align=right>371114.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>371880.1053</TD>
+	<TD align=right>215905.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>214983.4133</TD>
+	<TD align=right>372093.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>371588.3942</TD>
+	<TD align=right>373540.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>372396.0083</TD>
+    </TR>
+    <TR>
+	<TD align=center>icx</TD>
+	<TD align=right>218043.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>219178.0822</TD>
+	<TD align=right>371229.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>370799.5365</TD>
+	<TD align=right>218678.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>215439.8564</TD>
+	<TD align=right>373715.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>371157.9354</TD>
+	<TD align=right>373192.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>370942.8130</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=2>sockets</TD>
+	<TD align=center>icc</TD>
+	<TD align=right>365352.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>368156.8840</TD>
+	<TD align=right>366524.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>370221.7580</TD>
+	<TD align=right>246305.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>277893.0840</TD>
+	<TD align=right>368154.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>368595.0055</TD>
+	<TD align=right>368154.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>369813.7252</TD>
+    </TR>
+    <TR>
+	<TD align=center>icx</TD>
+	<TD align=right>367928.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>367393.8002</TD>
+	<TD align=right>367309.5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>368451.3529</TD>
+	<TD align=right>215265.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>219729.9153</TD>
+	<TD align=right>364686.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>366902.3505</TD>
+	<TD align=right>368154.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>368451.3529</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=2>cores</TD>
+	<TD align=center>icc</TD>
+	<TD align=right>372266.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>373536.0490</TD>
+	<TD align=right>372902.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>372447.6774</TD>
+	<TD align=right>218878.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>208423.4091</TD>
+	<TD align=right>372497.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>372740.7396</TD>
+	<TD align=right>372960.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>371245.7902</TD>
+    </TR>
+    <TR>
+	<TD align=center>icx</TD>
+	<TD align=right>372555.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>369017.8743</TD>
+	<TD align=right>371689.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>371014.4928</TD>
+	<TD align=right>213808.5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>217539.0891</TD>
+	<TD align=right>370828.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>369728.4806</TD>
+	<TD align=right>371920.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>372670.8075</TD>
+    </TR>
+    <TR>
+	<TD align=center rowspan=2>threads</TD>
+	<TD align=center>icc</TD>
+	<TD align=right>372612.9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>372533.8243</TD>
+	<TD align=right>186234.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>186898.0616</TD>
+	<TD align=right>218142.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>214736.9122</TD>
+	<TD align=right>372381.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>372309.9251</TD>
+	<TD align=right>371747.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>372447.6774</TD>
+    </TR>
+    <TR>
+	<TD align=center>icx</TD>
+	<TD align=right>372381.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>370013.4901</TD>
+	<TD align=right>186538.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>186552.6623</TD>
+	<TD align=right>218479.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>214765.1007</TD>
+	<TD align=right>372381.7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>370441.8291</TD>
+	<TD align=right>372497.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<BR>370584.8292</TD>
+    </TR>
+</TABLE>
+
+Size also plays a role in determining STREAM performance. 
+As with any program, the act of measuring the performance subtly alters the
+behavior, and performance, of that program.
+Especially on small microbenchmarks like STREAM, setting up timers and loops
+takes time away from the activity being measured.
+That set-up time needs to be amortized across the activity, and amortization
+is more effective when the time spent in set-up is very small relative to the
+activity being measured.
+This all suggests that larger array sizes will show better performance than 
+smaller sizes. 
+But it is also the case that very large memory may require the system to use
+more expensive measures to fetch the data, such as deeper page table walks.
+So, tuning the benchmark for size, large enough to amortize the cost of timers
+and loops, but not so large as to slow access down, needs to be done. 
+For the version that uses statically allocated memory (```stream_d_omp.c```),
+the ideal memory size is the maximum the compiler allows, or 80,000,000 elements.
+For the version that uses dynamically allocated memory (```xroads-stream.c```),
+we saw improvements in performance up to around 100,000,000 elements. 
+After that, it took more time to run the benchmark but did not significantly 
+improve our results.
+
+It must also be noted that going with *very* small memory also gives faster 
+results, but doing so violates the benchmark run rules. 
+The STREAM run rules require that the memory size be at least 4x the size of
+cache, or the results measure cache performance instead of memory performance,
+which is not what is wanted here.
+The actual size of STREAM memory is calculated as
+```
+memory size in bytes = 3 arrays x 8 bytes / element x number of elements / array
+```
+This value must be at least 4 times larger than the total available cache.
+On our Broadwell processors there is 45 MiB per processor, or 90 MiB per node.
+That implies we need a size of at least 15 million (15 x 2**20) elements.
+```
+4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements
+```
+
+And finally, STREAM results are very sensitive to other activities going on
+within the system, so STREAM needs to be run multiple times for an accurate 
+picture to emerge. 
+Even when the system is otherwise "quiet", the operating system services
+interrupts and schedules maintenance tasks that steal minor amounts of 
+memory throughput from user programs. 
+Furthermore, this activity may force user threads to migrate from one 
+core to another, or from one NUMA domain to another, which may also affect
+performance.
+Running STREAM multiple times is the only way to see what the memory 
+throughput really is -- a statistical population of values rather than a
+single value.
diff --git a/microbenchmarks/stream/img/broadwell-cache-structure.png b/microbenchmarks/stream/img/broadwell-cache-structure.png
new file mode 100644
index 00000000..a7230470
Binary files /dev/null and b/microbenchmarks/stream/img/broadwell-cache-structure.png differ
diff --git a/microbenchmarks/stream/mysecond.c b/microbenchmarks/stream/mysecond.c
new file mode 100644
index 00000000..f6778b77
--- /dev/null
+++ b/microbenchmarks/stream/mysecond.c
@@ -0,0 +1,28 @@
+/* A gettimeofday routine to give access to the wall
+ *    clock timer on most UNIX-like systems.
+ *
+ *       This version defines two entry points -- with 
+ *          and without appended underscores, so it *should*
+ *             automagically link with FORTRAN */
+
+#include <sys/time.h>
+
+double mysecond()
+{
+/* struct timeval { long        tv_sec;
+ *             long        tv_usec;        };
+ *
+ *             struct timezone { int   tz_minuteswest;
+ *                          int        tz_dsttime;      };     */
+
+        struct timeval tp;
+        struct timezone tzp;
+        int i;
+
+        i = gettimeofday(&tp,&tzp);
+        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
+}
+
+double mysecond_() {return mysecond();}
+
+
diff --git a/microbenchmarks/stream/splunk-stream.xml b/microbenchmarks/stream/splunk-stream.xml
new file mode 100644
index 00000000..57c5f479
--- /dev/null
+++ b/microbenchmarks/stream/splunk-stream.xml
@@ -0,0 +1,98 @@
+  <row>
+    <panel>
+      <title>Stream Benchmarks</title>
+      <input type="dropdown" token="OMP_NUM_THREADS" searchWhenChanged="true">
+        <label>OMP_NUM_THREADS</label>
+        <fieldForLabel>OMP_NUM_THREADS</fieldForLabel>
+        <fieldForValue>OMP_NUM_THREADS</fieldForValue>
+        <default>*</default>
+        <search>
+          <query>index=hpctest sourcetype=pavilion2 sys_name=$machine$ name=stream* omp_num_threads=* |
+table omp_num_threads |
+dedup omp_num_threads</query>
+          <earliest>$field1.earliest$</earliest>
+          <latest>$field1.latest$</latest>
+        </search>
+      </input>
+      <input type="dropdown" token="calc" searchWhenChanged="true">
+        <label>Calculation</label>
+        <choice value="triad">triad</choice>
+        <choice value="copy">copy</choice>
+        <choice value="add">add</choice>
+        <default>triad</default>
+      </input>
+      <input type="dropdown" token="streamsort" searchWhenChanged="true">
+        <label>Sort by</label>
+        <choice value="|sort node">node</choice>
+        <choice value="|sort max(triad)">max(triad)</choice>
+        <choice value="|sort min(triad)">min(triad)</choice>
+        <choice value="|sort avg(triad)">avg(triad)</choice>
+        <choice value="|sort stdev(triad)">stdev(triad)</choice>
+        <choice value="|sort max(copy)">max(copy)</choice>
+        <choice value="|sort min(copy)">min(copy)</choice>
+        <choice value="|sort avg(copy)">avg(copy)</choice>
+        <choice value="|sort stdev(copy)">stdev(copy)</choice>
+        <choice value="|sort max(add)">max(add)</choice>
+        <choice value="|sort min(add)">min(add)</choice>
+        <choice value="|sort avg(add)">avg(add)</choice>
+        <choice value="|sort stdev(add)">stdev(add)</choice>
+        <initialValue>|sort node</initialValue>
+        <search>
+          <query/>
+        </search>
+        <default>|sort node</default>
+      </input>
+      <chart>
+        <title>Stream Triad Performance by OMP configuration $machine$</title>
+        <search>
+          <query>index=hpctest sourcetype=pavilion2 sys_name=$machine$ name=stream* omp_num_threads=$OMP_NUM_THREADS$ |
+rex field=file "(?&lt;node&gt;\w+\d+)-stream.out" | 
+chart max($calc$) min($calc$) avg($calc$) stdev($calc$) by node|
+sort max($calc$)</query>
+          <earliest>$field1.earliest$</earliest>
+          <latest>$field1.latest$</latest>
+        </search>
+        <option name="charting.axisTitleY.text">MB/s</option>
+        <option name="charting.chart">column</option>
+        <option name="charting.chart.overlayFields">stdev($calc$)</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+      <chart>
+        <title>Stream Single Node $machine$ $calc$ Rate</title>
+        <search>
+          <query>index=hpctest sourcetype=pavilion2 sys_name=$machine$ name=stream* omp_num_threads=$OMP_NUM_THREADS$ |
+rex field=file "(?&lt;node&gt;\w+\d+)-stream.out" | 
+chart max($calc$) min($calc$) avg($calc$) stdev($calc$) by node $streamsort$</query>
+          <earliest>$field1.earliest$</earliest>
+          <latest>$field1.latest$</latest>
+        </search>
+        <option name="charting.axisTitleY.text">MB/s</option>
+        <option name="charting.chart">column</option>
+        <option name="charting.chart.overlayFields">stdev($calc$)</option>
+        <option name="charting.drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+    <panel>
+      <title>HPL GFLOP/s Full System</title>
+      <chart>
+        <title>HPL Full System Performance $machine$</title>
+        <search>
+          <query>index=hpctest sourcetype=pavilion2 sys_name=$machine$ name=hpl-full* gflops!=null|
+chart max(gflops) as "MAX GFLOP/s"  min(gflops) as "MIN GFLOP/s" avg(gflops) as "AVG GFLOP/s" by valn|
+sort max(gflops)</query>
+          <earliest>$field1.earliest$</earliest>
+          <latest>$field1.latest$</latest>
+        </search>
+        <option name="charting.axisTitleY.text">GFLOP/s</option>
+        <option name="charting.axisY.scale">linear</option>
+        <option name="charting.axisY2.enabled">0</option>
+        <option name="charting.chart">column</option>
+        <option name="charting.chart.showDataLabels">all</option>
+        <option name="charting.drilldown">none</option>
+        <option name="charting.legend.placement">bottom</option>
+        <option name="refresh.display">progressbar</option>
+      </chart>
+    </panel>
+  </row>
diff --git a/microbenchmarks/stream/stream.c b/microbenchmarks/stream/stream.c
new file mode 100644
index 00000000..24edd556
--- /dev/null
+++ b/microbenchmarks/stream/stream.c
@@ -0,0 +1,586 @@
+/*-----------------------------------------------------------------------*/
+/* Program: STREAM                                                       */
+/* Revision: $Id: stream.c,v 5.10 2013/01/17 16:01:06 mccalpin Exp mccalpin $ */
+/* Original code developed by John D. McCalpin                           */
+/* Programmers: John D. McCalpin                                         */
+/*              Joe R. Zagar                                             */
+/*                                                                       */
+/* This program measures memory transfer rates in MB/s for simple        */
+/* computational kernels coded in C.                                     */
+/*-----------------------------------------------------------------------*/
+/* Copyright 1991-2013: John D. McCalpin                                 */
+/*-----------------------------------------------------------------------*/
+/* License:                                                              */
+/*  1. You are free to use this program and/or to redistribute           */
+/*     this program.                                                     */
+/*  2. You are free to modify this program for your own use,             */
+/*     including commercial use, subject to the publication              */
+/*     restrictions in item 3.                                           */
+/*  3. You are free to publish results obtained from running this        */
+/*     program, or from works that you derive from this program,         */
+/*     with the following limitations:                                   */
+/*     3a. In order to be referred to as "STREAM benchmark results",     */
+/*         published results must be in conformance to the STREAM        */
+/*         Run Rules, (briefly reviewed below) published at              */
+/*         http://www.cs.virginia.edu/stream/ref.html                    */
+/*         and incorporated herein by reference.                         */
+/*         As the copyright holder, John McCalpin retains the            */
+/*         right to determine conformity with the Run Rules.             */
+/*     3b. Results based on modified source code or on runs not in       */
+/*         accordance with the STREAM Run Rules must be clearly          */
+/*         labelled whenever they are published.  Examples of            */
+/*         proper labelling include:                                     */
+/*           "tuned STREAM benchmark results"                            */
+/*           "based on a variant of the STREAM benchmark code"           */
+/*         Other comparable, clear, and reasonable labelling is          */
+/*         acceptable.                                                   */
+/*     3c. Submission of results to the STREAM benchmark web site        */
+/*         is encouraged, but not required.                              */
+/*  4. Use of this program or creation of derived works based on this    */
+/*     program constitutes acceptance of these licensing restrictions.   */
+/*  5. Absolutely no warranty is expressed or implied.                   */
+/*-----------------------------------------------------------------------*/
+# include <stdio.h>
+# include <unistd.h>
+# include <math.h>
+# include <float.h>
+# include <limits.h>
+# include <sys/time.h>
+
+/*-----------------------------------------------------------------------
+ * INSTRUCTIONS:
+ *
+ *	1) STREAM requires different amounts of memory to run on different
+ *           systems, depending on both the system cache size(s) and the
+ *           granularity of the system timer.
+ *     You should adjust the value of 'STREAM_ARRAY_SIZE' (below)
+ *           to meet *both* of the following criteria:
+ *       (a) Each array must be at least 4 times the size of the
+ *           available cache memory. I don't worry about the difference
+ *           between 10^6 and 2^20, so in practice the minimum array size
+ *           is about 3.8 times the cache size.
+ *           Example 1: One Xeon E3 with 8 MB L3 cache
+ *               STREAM_ARRAY_SIZE should be >= 4 million, giving
+ *               an array size of 30.5 MB and a total memory requirement
+ *               of 91.5 MB.  
+ *           Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
+ *               STREAM_ARRAY_SIZE should be >= 20 million, giving
+ *               an array size of 153 MB and a total memory requirement
+ *               of 458 MB.  
+ *       (b) The size should be large enough so that the 'timing calibration'
+ *           output by the program is at least 20 clock-ticks.  
+ *           Example: most versions of Windows have a 10 millisecond timer
+ *               granularity.  20 "ticks" at 10 ms/tic is 200 milliseconds.
+ *               If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
+ *               This means the each array must be at least 1 GB, or 128M elements.
+ *
+ *      Version 5.10 increases the default array size from 2 million
+ *          elements to 10 million elements in response to the increasing
+ *          size of L3 caches.  The new default size is large enough for caches
+ *          up to 20 MB. 
+ *      Version 5.10 changes the loop index variables from "register int"
+ *          to "ssize_t", which allows array indices >2^32 (4 billion)
+ *          on properly configured 64-bit systems.  Additional compiler options
+ *          (such as "-mcmodel=medium") may be required for large memory runs.
+ *
+ *      Array size can be set at compile time without modifying the source
+ *          code for the (many) compilers that support preprocessor definitions
+ *          on the compile line.  E.g.,
+ *                gcc -O -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream.100M
+ *          will override the default size of 10M with a new size of 100M elements
+ *          per array.
+ */
+#ifndef STREAM_ARRAY_SIZE
+#   define STREAM_ARRAY_SIZE	10000000
+#endif
+
+/*  2) STREAM runs each kernel "NTIMES" times and reports the *best* result
+ *         for any iteration after the first, therefore the minimum value
+ *         for NTIMES is 2.
+ *      There are no rules on maximum allowable values for NTIMES, but
+ *         values larger than the default are unlikely to noticeably
+ *         increase the reported performance.
+ *      NTIMES can also be set on the compile line without changing the source
+ *         code using, for example, "-DNTIMES=7".
+ */
+#ifdef NTIMES
+#if NTIMES<=1
+#   define NTIMES	10
+#endif
+#endif
+#ifndef NTIMES
+#   define NTIMES	10
+#endif
+
+/*  Users are allowed to modify the "OFFSET" variable, which *may* change the
+ *         relative alignment of the arrays (though compilers may change the 
+ *         effective offset by making the arrays non-contiguous on some systems). 
+ *      Use of non-zero values for OFFSET can be especially helpful if the
+ *         STREAM_ARRAY_SIZE is set to a value close to a large power of 2.
+ *      OFFSET can also be set on the compile line without changing the source
+ *         code using, for example, "-DOFFSET=56".
+ */
+#ifndef OFFSET
+#   define OFFSET	0
+#endif
+
+/*
+ *	3) Compile the code with optimization.  Many compilers generate
+ *       unreasonably bad code before the optimizer tightens things up.  
+ *     If the results are unreasonably good, on the other hand, the
+ *       optimizer might be too smart for me!
+ *
+ *     For a simple single-core version, try compiling with:
+ *            cc -O stream.c -o stream
+ *     This is known to work on many, many systems....
+ *
+ *     To use multiple cores, you need to tell the compiler to obey the OpenMP
+ *       directives in the code.  This varies by compiler, but a common example is
+ *            gcc -O -fopenmp stream.c -o stream_omp
+ *       The environment variable OMP_NUM_THREADS allows runtime control of the 
+ *         number of threads/cores used when the resulting "stream_omp" program
+ *         is executed.
+ *
+ *     To run with single-precision variables and arithmetic, simply add
+ *         -DSTREAM_TYPE=float
+ *     to the compile line.
+ *     Note that this changes the minimum array sizes required --- see (1) above.
+ *
+ *     The preprocessor directive "TUNED" does not do much -- it simply causes the 
+ *       code to call separate functions to execute each kernel.  Trivial versions
+ *       of these functions are provided, but they are *not* tuned -- they just 
+ *       provide predefined interfaces to be replaced with tuned code.
+ *
+ *
+ *	4) Optional: Mail the results to mccalpin@cs.virginia.edu
+ *	   Be sure to include info that will help me understand:
+ *		a) the computer hardware configuration (e.g., processor model, memory type)
+ *		b) the compiler name/version and compilation flags
+ *      c) any run-time information (such as OMP_NUM_THREADS)
+ *		d) all of the output from the test case.
+ *
+ * Thanks!
+ *
+ *-----------------------------------------------------------------------*/
+
+# define HLINE "-------------------------------------------------------------\n"
+
+# ifndef MIN
+# define MIN(x,y) ((x)<(y)?(x):(y))
+# endif
+# ifndef MAX
+# define MAX(x,y) ((x)>(y)?(x):(y))
+# endif
+
+#ifndef STREAM_TYPE
+#define STREAM_TYPE double
+#endif
+
+static STREAM_TYPE	a[STREAM_ARRAY_SIZE+OFFSET],
+			b[STREAM_ARRAY_SIZE+OFFSET],
+			c[STREAM_ARRAY_SIZE+OFFSET];
+
+static double	avgtime[4] = {0}, maxtime[4] = {0},
+		mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
+
+static char	*label[4] = {"Copy:      ", "Scale:     ",
+    "Add:       ", "Triad:     "};
+
+static double	bytes[4] = {
+    2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
+    };
+
+extern double mysecond();
+extern void checkSTREAMresults();
+#ifdef TUNED
+extern void tuned_STREAM_Copy();
+extern void tuned_STREAM_Scale(STREAM_TYPE scalar);
+extern void tuned_STREAM_Add();
+extern void tuned_STREAM_Triad(STREAM_TYPE scalar);
+#endif
+#ifdef _OPENMP
+extern int omp_get_num_threads();
+#endif
+int
+main()
+    {
+    int			quantum, checktick();
+    int			BytesPerWord;
+    int			k;
+    ssize_t		j;
+    STREAM_TYPE		scalar;
+    double		t, times[4][NTIMES];
+
+    /* --- SETUP --- determine precision and check timing --- */
+
+    printf(HLINE);
+    printf("STREAM version $Revision: 5.10 $\n");
+    printf(HLINE);
+    BytesPerWord = sizeof(STREAM_TYPE);
+    printf("This system uses %d bytes per array element.\n",
+	BytesPerWord);
+
+    printf(HLINE);
+#ifdef N
+    printf("*****  WARNING: ******\n");
+    printf("      It appears that you set the preprocessor variable N when compiling this code.\n");
+    printf("      This version of the code uses the preprocesor variable STREAM_ARRAY_SIZE to control the array size\n");
+    printf("      Reverting to default value of STREAM_ARRAY_SIZE=%llu\n",(unsigned long long) STREAM_ARRAY_SIZE);
+    printf("*****  WARNING: ******\n");
+#endif
+
+    printf("Array size = %llu (elements), Offset = %d (elements)\n" , (unsigned long long) STREAM_ARRAY_SIZE, OFFSET);
+    printf("Memory per array = %.1f MiB (= %.1f GiB).\n", 
+	BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0),
+	BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0/1024.0));
+    printf("Total memory required = %.1f MiB (= %.1f GiB).\n",
+	(3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.),
+	(3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024./1024.));
+    printf("Each kernel will be executed %d times.\n", NTIMES);
+    printf(" The *best* time for each kernel (excluding the first iteration)\n"); 
+    printf(" will be used to compute the reported bandwidth.\n");
+
+#ifdef _OPENMP
+    printf(HLINE);
+#pragma omp parallel 
+    {
+#pragma omp master
+	{
+	    k = omp_get_num_threads();
+	    printf ("Number of Threads requested = %i\n",k);
+        }
+    }
+#endif
+
+#ifdef _OPENMP
+	k = 0;
+#pragma omp parallel
+#pragma omp atomic 
+		k++;
+    printf ("Number of Threads counted = %i\n",k);
+#endif
+
+    /* Get initial value for system clock. */
+#pragma omp parallel for
+    for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+	    a[j] = 1.0;
+	    b[j] = 2.0;
+	    c[j] = 0.0;
+	}
+
+    printf(HLINE);
+
+    if  ( (quantum = checktick()) >= 1) 
+	printf("Your clock granularity/precision appears to be "
+	    "%d microseconds.\n", quantum);
+    else {
+	printf("Your clock granularity appears to be "
+	    "less than one microsecond.\n");
+	quantum = 1;
+    }
+
+    t = mysecond();
+#pragma omp parallel for
+    for (j = 0; j < STREAM_ARRAY_SIZE; j++)
+		a[j] = 2.0E0 * a[j];
+    t = 1.0E6 * (mysecond() - t);
+
+    printf("Each test below will take on the order"
+	" of %d microseconds.\n", (int) t  );
+    printf("   (= %d clock ticks)\n", (int) (t/quantum) );
+    printf("Increase the size of the arrays if this shows that\n");
+    printf("you are not getting at least 20 clock ticks per test.\n");
+
+    printf(HLINE);
+
+    printf("WARNING -- The above is only a rough guideline.\n");
+    printf("For best results, please be sure you know the\n");
+    printf("precision of your system timer.\n");
+    printf(HLINE);
+    
+    /*	--- MAIN LOOP --- repeat test cases NTIMES times --- */
+
+    scalar = 3.0;
+    for (k=0; k<NTIMES; k++)
+	{
+	times[0][k] = mysecond();
+#ifdef TUNED
+        tuned_STREAM_Copy();
+#else
+#pragma omp parallel for
+	for (j=0; j<STREAM_ARRAY_SIZE; j++)
+	    c[j] = a[j];
+#endif
+	times[0][k] = mysecond() - times[0][k];
+	
+	times[1][k] = mysecond();
+#ifdef TUNED
+        tuned_STREAM_Scale(scalar);
+#else
+#pragma omp parallel for
+	for (j=0; j<STREAM_ARRAY_SIZE; j++)
+	    b[j] = scalar*c[j];
+#endif
+	times[1][k] = mysecond() - times[1][k];
+	
+	times[2][k] = mysecond();
+#ifdef TUNED
+        tuned_STREAM_Add();
+#else
+#pragma omp parallel for
+	for (j=0; j<STREAM_ARRAY_SIZE; j++)
+	    c[j] = a[j]+b[j];
+#endif
+	times[2][k] = mysecond() - times[2][k];
+	
+	times[3][k] = mysecond();
+#ifdef TUNED
+        tuned_STREAM_Triad(scalar);
+#else
+#pragma omp parallel for
+	for (j=0; j<STREAM_ARRAY_SIZE; j++)
+	    a[j] = b[j]+scalar*c[j];
+#endif
+	times[3][k] = mysecond() - times[3][k];
+	}
+
+    /*	--- SUMMARY --- */
+
+    for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
+	{
+	for (j=0; j<4; j++)
+	    {
+	    avgtime[j] = avgtime[j] + times[j][k];
+	    mintime[j] = MIN(mintime[j], times[j][k]);
+	    maxtime[j] = MAX(maxtime[j], times[j][k]);
+	    }
+	}
+    
+    printf("Function    Best Rate MB/s  Avg time     Min time     Max time\n");
+    for (j=0; j<4; j++) {
+		avgtime[j] = avgtime[j]/(double)(NTIMES-1);
+
+		printf("%s%12.1f  %11.6f  %11.6f  %11.6f\n", label[j],
+	       1.0E-06 * bytes[j]/mintime[j],
+	       avgtime[j],
+	       mintime[j],
+	       maxtime[j]);
+    }
+    printf(HLINE);
+
+    /* --- Check Results --- */
+    checkSTREAMresults();
+    printf(HLINE);
+
+    return 0;
+}
+
+# define	M	20
+
+int
+checktick()
+    {
+    int		i, minDelta, Delta;
+    double	t1, t2, timesfound[M];
+
+/*  Collect a sequence of M unique time values from the system. */
+
+    for (i = 0; i < M; i++) {
+	t1 = mysecond();
+	while( ((t2=mysecond()) - t1) < 1.0E-6 )
+	    ;
+	timesfound[i] = t1 = t2;
+	}
+
+/*
+ * Determine the minimum difference between these M values.
+ * This result will be our estimate (in microseconds) for the
+ * clock granularity.
+ */
+
+    minDelta = 1000000;
+    for (i = 1; i < M; i++) {
+	Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
+	minDelta = MIN(minDelta, MAX(Delta,0));
+	}
+
+   return(minDelta);
+    }
+
+
+
+/* A gettimeofday routine to give access to the wall
+   clock timer on most UNIX-like systems.  */
+
+#include <sys/time.h>
+
+double mysecond()
+{
+        struct timeval tp;
+        struct timezone tzp;
+        int i;
+
+        i = gettimeofday(&tp,&tzp);
+        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
+}
+
+#ifndef abs
+#define abs(a) ((a) >= 0 ? (a) : -(a))
+#endif
+void checkSTREAMresults ()
+{
+	STREAM_TYPE aj,bj,cj,scalar;
+	STREAM_TYPE aSumErr,bSumErr,cSumErr;
+	STREAM_TYPE aAvgErr,bAvgErr,cAvgErr;
+	double epsilon;
+	ssize_t	j;
+	int	k,ierr,err;
+
+    /* reproduce initialization */
+	aj = 1.0;
+	bj = 2.0;
+	cj = 0.0;
+    /* a[] is modified during timing check */
+	aj = 2.0E0 * aj;
+    /* now execute timing loop */
+	scalar = 3.0;
+	for (k=0; k<NTIMES; k++)
+        {
+            cj = aj;
+            bj = scalar*cj;
+            cj = aj+bj;
+            aj = bj+scalar*cj;
+        }
+
+    /* accumulate deltas between observed and expected results */
+	aSumErr = 0.0;
+	bSumErr = 0.0;
+	cSumErr = 0.0;
+	for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+		aSumErr += abs(a[j] - aj);
+		bSumErr += abs(b[j] - bj);
+		cSumErr += abs(c[j] - cj);
+		// if (j == 417) printf("Index 417: c[j]: %f, cj: %f\n",c[j],cj);	// MCCALPIN
+	}
+	aAvgErr = aSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;
+	bAvgErr = bSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;
+	cAvgErr = cSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;
+
+	if (sizeof(STREAM_TYPE) == 4) {
+		epsilon = 1.e-6;
+	}
+	else if (sizeof(STREAM_TYPE) == 8) {
+		epsilon = 1.e-13;
+	}
+	else {
+		printf("WEIRD: sizeof(STREAM_TYPE) = %lu\n",sizeof(STREAM_TYPE));
+		epsilon = 1.e-6;
+	}
+
+	err = 0;
+	if (abs(aAvgErr/aj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array a[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",aj,aAvgErr,abs(aAvgErr)/aj);
+		ierr = 0;
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+			if (abs(a[j]/aj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array a: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,aj,a[j],abs((aj-a[j])/aAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array a[], %d errors were found.\n",ierr);
+	}
+	if (abs(bAvgErr/bj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array b[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",bj,bAvgErr,abs(bAvgErr)/bj);
+		printf ("     AvgRelAbsErr > Epsilon (%e)\n",epsilon);
+		ierr = 0;
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+			if (abs(b[j]/bj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array b: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,bj,b[j],abs((bj-b[j])/bAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array b[], %d errors were found.\n",ierr);
+	}
+	if (abs(cAvgErr/cj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array c[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",cj,cAvgErr,abs(cAvgErr)/cj);
+		printf ("     AvgRelAbsErr > Epsilon (%e)\n",epsilon);
+		ierr = 0;
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+			if (abs(c[j]/cj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array c: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,cj,c[j],abs((cj-c[j])/cAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array c[], %d errors were found.\n",ierr);
+	}
+	if (err == 0) {
+		printf ("Solution Validates: avg error less than %e on all three arrays\n",epsilon);
+	}
+#ifdef VERBOSE
+	printf ("Results Validation Verbose Results: \n");
+	printf ("    Expected a(1), b(1), c(1): %f %f %f \n",aj,bj,cj);
+	printf ("    Observed a(1), b(1), c(1): %f %f %f \n",a[1],b[1],c[1]);
+	printf ("    Rel Errors on a, b, c:     %e %e %e \n",abs(aAvgErr/aj),abs(bAvgErr/bj),abs(cAvgErr/cj));
+#endif
+}
+
+#ifdef TUNED
+/* stubs for "tuned" versions of the kernels */
+void tuned_STREAM_Copy()
+{
+	ssize_t j;
+#pragma omp parallel for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            c[j] = a[j];
+}
+
+void tuned_STREAM_Scale(STREAM_TYPE scalar)
+{
+	ssize_t j;
+#pragma omp parallel for
+	for (j=0; j<STREAM_ARRAY_SIZE; j++)
+	    b[j] = scalar*c[j];
+}
+
+void tuned_STREAM_Add()
+{
+	ssize_t j;
+#pragma omp parallel for
+	for (j=0; j<STREAM_ARRAY_SIZE; j++)
+	    c[j] = a[j]+b[j];
+}
+
+void tuned_STREAM_Triad(STREAM_TYPE scalar)
+{
+	ssize_t j;
+#pragma omp parallel for
+	for (j=0; j<STREAM_ARRAY_SIZE; j++)
+	    a[j] = b[j]+scalar*c[j];
+}
+/* end of stubs for the "tuned" versions of the kernels */
+#endif
+
diff --git a/microbenchmarks/stream/stream.f b/microbenchmarks/stream/stream.f
new file mode 100644
index 00000000..5b85205a
--- /dev/null
+++ b/microbenchmarks/stream/stream.f
@@ -0,0 +1,464 @@
+*=======================================================================
+* Program: STREAM
+* Programmer: John D. McCalpin
+* RCS Revision: $Id: stream.f,v 5.6 2005/10/04 00:20:48 mccalpin Exp
+* mccalpin $
+*-----------------------------------------------------------------------
+* Copyright 1991-2003: John D. McCalpin
+*-----------------------------------------------------------------------
+* License:
+*  1. You are free to use this program and/or to redistribute
+*     this program.
+*  2. You are free to modify this program for your own use,
+*     including commercial use, subject to the publication
+*     restrictions in item 3.
+*  3. You are free to publish results obtained from running this
+*     program, or from works that you derive from this program,
+*     with the following limitations:
+*     3a. In order to be referred to as "STREAM benchmark results",
+*         published results must be in conformance to the STREAM
+*         Run Rules, (briefly reviewed below) published at
+*         http://www.cs.virginia.edu/stream/ref.html
+*         and incorporated herein by reference.
+*         As the copyright holder, John McCalpin retains the
+*         right to determine conformity with the Run Rules.
+*     3b. Results based on modified source code or on runs not in
+*         accordance with the STREAM Run Rules must be clearly
+*         labelled whenever they are published.  Examples of
+*         proper labelling include:
+*         "tuned STREAM benchmark results" 
+*         "based on a variant of the STREAM benchmark code"
+*         Other comparable, clear and reasonable labelling is
+*         acceptable.
+*     3c. Submission of results to the STREAM benchmark web site
+*         is encouraged, but not required.
+*  4. Use of this program or creation of derived works based on this
+*     program constitutes acceptance of these licensing restrictions.
+*  5. Absolutely no warranty is expressed or implied.
+*-----------------------------------------------------------------------
+* This program measures sustained memory transfer rates in MB/s for
+* simple computational kernels coded in FORTRAN.
+*
+* The intent is to demonstrate the extent to which ordinary user
+* code can exploit the main memory bandwidth of the system under
+* test.
+*=======================================================================
+* The STREAM web page is at:
+*          http://www.streambench.org
+*
+* Most of the content is currently hosted at:
+*          http://www.cs.virginia.edu/stream/
+*
+* BRIEF INSTRUCTIONS: 
+*       0) See http://www.cs.virginia.edu/stream/ref.html for details
+*       1) STREAM requires a timing function called mysecond().
+*          Several examples are provided in this directory.
+*          "CPU" timers are only allowed for uniprocessor runs.
+*          "Wall-clock" timers are required for all multiprocessor runs.
+*       2) The STREAM array sizes must be set to size the test.
+*          The value "N" must be chosen so that each of the three
+*          arrays is at least 4x larger than the sum of all the last-
+*          level caches used in the run, or 1 million elements, which-
+*          ever is larger.
+*          ------------------------------------------------------------
+*          Note that you are free to use any array length and offset
+*          that makes each array 4x larger than the last-level cache.
+*          The intent is to determine the *best* sustainable bandwidth
+*          available with this simple coding.  Of course, lower values
+*          are usually fairly easy to obtain on cached machines, but 
+*          by keeping the test to the *best* results, the answers are
+*          easier to interpret.
+*          You may put the arrays in common or not, at your discretion.
+*          There is a commented-out COMMON statement below.
+*          Fortran90 "allocatable" arrays are fine, too.
+*          ------------------------------------------------------------
+*       3) Compile the code with full optimization.  Many compilers
+*          generate unreasonably bad code before the optimizer tightens
+*          things up.  If the results are unreasonably good, on the
+*          other hand, the optimizer might be too smart for me
+*          Please let me know if this happens.
+*       4) Mail the results to mccalpin@cs.virginia.edu
+*          Be sure to include:
+*               a) computer hardware model number and software revision
+*               b) the compiler flags
+*               c) all of the output from the test case.
+*          Please let me know if you do not want your name posted along
+*          with the submitted results.
+*       5) See the web page for more comments about the run rules and
+*          about interpretation of the results.
+*
+* Thanks,
+*   Dr. Bandwidth
+*=========================================================================
+*
+      PROGRAM stream
+*     IMPLICIT NONE
+C     .. Parameters ..
+      INTEGER n,offset,ndim,ntimes
+      PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
+C     ..
+C     .. Local Scalars ..
+      DOUBLE PRECISION scalar,t
+      INTEGER j,k,nbpw,quantum
+C     ..
+C     .. Local Arrays ..
+      DOUBLE PRECISION maxtime(4),mintime(4),avgtime(4),
+     $                 times(4,ntimes)
+      INTEGER bytes(4)
+      CHARACTER label(4)*11
+C     ..
+C     .. External Functions ..
+      DOUBLE PRECISION mysecond
+      INTEGER checktick,realsize
+      EXTERNAL mysecond,checktick,realsize
+!$    INTEGER omp_get_num_threads
+!$    EXTERNAL omp_get_num_threads
+C     ..
+C     .. Intrinsic Functions ..
+C
+      INTRINSIC dble,max,min,nint,sqrt
+C     ..
+C     .. Arrays in Common ..
+      DOUBLE PRECISION a(ndim),b(ndim),c(ndim)
+C     ..
+C     .. Common blocks ..
+*     COMMON a,b,c
+C     ..
+C     .. Data statements ..
+      DATA avgtime/4*0.0D0/,mintime/4*1.0D+36/,maxtime/4*0.0D0/
+      DATA label/'Copy:      ','Scale:     ','Add:       ',
+     $     'Triad:     '/
+      DATA bytes/2,2,3,3/
+C     ..
+
+*       --- SETUP --- determine precision and check timing ---
+
+      nbpw = realsize()
+
+      PRINT *,'----------------------------------------------'
+      PRINT *,'STREAM Version $Revision: 5.6 $'
+      PRINT *,'----------------------------------------------'
+      WRITE (*,FMT=9010) 'Array size = ',n
+      WRITE (*,FMT=9010) 'Offset     = ',offset
+      WRITE (*,FMT=9020) 'The total memory requirement is ',
+     $  3*nbpw*n/ (1024*1024),' MB'
+      WRITE (*,FMT=9030) 'You are running each test ',ntimes,' times'
+      WRITE (*,FMT=9030) '--'
+      WRITE (*,FMT=9030) 'The *best* time for each test is used'
+      WRITE (*,FMT=9030) '*EXCLUDING* the first and last iterations'
+
+!$OMP PARALLEL
+!$OMP MASTER
+      PRINT *,'----------------------------------------------'
+!$    PRINT *,'Number of Threads = ',OMP_GET_NUM_THREADS()
+!$OMP END MASTER
+!$OMP END PARALLEL
+
+      PRINT *,'----------------------------------------------'
+!$OMP PARALLEL
+      PRINT *,'Printing one line per active thread....'
+!$OMP END PARALLEL
+
+!$OMP PARALLEL DO
+      DO 10 j = 1,n
+          a(j) = 2.0d0
+          b(j) = 0.5D0
+          c(j) = 0.0D0
+   10 CONTINUE
+      t = mysecond()
+!$OMP PARALLEL DO
+      DO 20 j = 1,n
+          a(j) = 0.5d0*a(j)
+   20 CONTINUE
+      t = mysecond() - t
+      PRINT *,'----------------------------------------------------'
+      quantum = checktick()
+      WRITE (*,FMT=9000)
+     $  'Your clock granularity/precision appears to be ',quantum,
+     $  ' microseconds'
+      PRINT *,'----------------------------------------------------'
+
+*       --- MAIN LOOP --- repeat test cases NTIMES times ---
+      scalar = 0.5d0*a(1)
+      DO 70 k = 1,ntimes
+
+          t = mysecond()
+          a(1) = a(1) + t
+!$OMP PARALLEL DO
+          DO 30 j = 1,n
+              c(j) = a(j)
+   30     CONTINUE
+          t = mysecond() - t
+          c(n) = c(n) + t
+          times(1,k) = t
+
+          t = mysecond()
+          c(1) = c(1) + t
+!$OMP PARALLEL DO
+          DO 40 j = 1,n
+              b(j) = scalar*c(j)
+   40     CONTINUE
+          t = mysecond() - t
+          b(n) = b(n) + t
+          times(2,k) = t
+
+          t = mysecond()
+          a(1) = a(1) + t
+!$OMP PARALLEL DO
+          DO 50 j = 1,n
+              c(j) = a(j) + b(j)
+   50     CONTINUE
+          t = mysecond() - t
+          c(n) = c(n) + t
+          times(3,k) = t
+
+          t = mysecond()
+          b(1) = b(1) + t
+!$OMP PARALLEL DO
+          DO 60 j = 1,n
+              a(j) = b(j) + scalar*c(j)
+   60     CONTINUE
+          t = mysecond() - t
+          a(n) = a(n) + t
+          times(4,k) = t
+   70 CONTINUE
+
+*       --- SUMMARY ---
+      DO 90 k = 2,ntimes
+          DO 80 j = 1,4
+              avgtime(j) = avgtime(j) + times(j,k)
+              mintime(j) = min(mintime(j),times(j,k))
+              maxtime(j) = max(maxtime(j),times(j,k))
+   80     CONTINUE
+   90 CONTINUE
+      WRITE (*,FMT=9040)
+      DO 100 j = 1,4
+          avgtime(j) = avgtime(j)/dble(ntimes-1)
+          WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6,
+     $      avgtime(j),mintime(j),maxtime(j)
+  100 CONTINUE
+      PRINT *,'----------------------------------------------------'
+      CALL checksums (a,b,c,n,ntimes)
+      PRINT *,'----------------------------------------------------'
+
+ 9000 FORMAT (1x,a,i6,a)
+ 9010 FORMAT (1x,a,i10)
+ 9020 FORMAT (1x,a,i4,a)
+ 9030 FORMAT (1x,a,i3,a,a)
+ 9040 FORMAT ('Function',5x,'Rate (MB/s)  Avg time   Min time  Max time'
+     $       )
+ 9050 FORMAT (a,4 (f10.4,2x))
+      END
+
+*-------------------------------------
+* INTEGER FUNCTION dblesize()
+*
+* A semi-portable way to determine the precision of DOUBLE PRECISION
+* in Fortran.
+* Here used to guess how many bytes of storage a DOUBLE PRECISION
+* number occupies.
+*
+      INTEGER FUNCTION realsize()
+*     IMPLICIT NONE
+
+C     .. Local Scalars ..
+      DOUBLE PRECISION result,test
+      INTEGER j,ndigits
+C     ..
+C     .. Local Arrays ..
+      DOUBLE PRECISION ref(30)
+C     ..
+C     .. External Subroutines ..
+      EXTERNAL confuse
+C     ..
+C     .. Intrinsic Functions ..
+      INTRINSIC abs,acos,log10,sqrt
+C     ..
+
+C       Test #1 - compare single(1.0d0+delta) to 1.0d0
+
+   10 DO 20 j = 1,30
+          ref(j) = 1.0d0 + 10.0d0** (-j)
+   20 CONTINUE
+
+      DO 30 j = 1,30
+          test = ref(j)
+          ndigits = j
+          CALL confuse(test,result)
+          IF (test.EQ.1.0D0) THEN
+              GO TO 40
+          END IF
+   30 CONTINUE
+      GO TO 50
+
+   40 WRITE (*,FMT='(a)')
+     $  '----------------------------------------------'
+      WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ',
+     $  ndigits,' digits of accuracy'
+      IF (ndigits.LE.8) THEN
+          realsize = 4
+      ELSE
+          realsize = 8
+      END IF
+      WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize,
+     $  ' bytes per DOUBLE PRECISION word'
+      WRITE (*,FMT='(a)')
+     $  '----------------------------------------------'
+      RETURN
+
+   50 PRINT *,'Hmmmm.  I am unable to determine the size.'
+      PRINT *,'Please enter the number of Bytes per DOUBLE PRECISION',
+     $  ' number : '
+      READ (*,FMT=*) realsize
+      IF (realsize.NE.4 .AND. realsize.NE.8) THEN
+          PRINT *,'Your answer ',realsize,' does not make sense.'
+          PRINT *,'Try again.'
+          PRINT *,'Please enter the number of Bytes per ',
+     $      'DOUBLE PRECISION number : '
+          READ (*,FMT=*) realsize
+      END IF
+      PRINT *,'You have manually entered a size of ',realsize,
+     $  ' bytes per DOUBLE PRECISION number'
+      WRITE (*,FMT='(a)')
+     $  '----------------------------------------------'
+      END
+
+      SUBROUTINE confuse(q,r)
+*     IMPLICIT NONE
+C     .. Scalar Arguments ..
+      DOUBLE PRECISION q,r
+C     ..
+C     .. Intrinsic Functions ..
+      INTRINSIC cos
+C     ..
+      r = cos(q)
+      RETURN
+      END
+
+* A semi-portable way to determine the clock granularity
+* Adapted from a code by John Henning of Digital Equipment Corporation
+*
+      INTEGER FUNCTION checktick()
+*     IMPLICIT NONE
+
+C     .. Parameters ..
+      INTEGER n
+      PARAMETER (n=20)
+C     ..
+C     .. Local Scalars ..
+      DOUBLE PRECISION t1,t2
+      INTEGER i,j,jmin
+C     ..
+C     .. Local Arrays ..
+      DOUBLE PRECISION timesfound(n)
+C     ..
+C     .. External Functions ..
+      DOUBLE PRECISION mysecond
+      EXTERNAL mysecond
+C     ..
+C     .. Intrinsic Functions ..
+      INTRINSIC max,min,nint
+C     ..
+      i = 0
+
+   10 t2 = mysecond()
+      IF (t2.EQ.t1) GO TO 10
+
+      t1 = t2
+      i = i + 1
+      timesfound(i) = t1
+      IF (i.LT.n) GO TO 10
+
+      jmin = 1000000
+      DO 20 i = 2,n
+          j = nint((timesfound(i)-timesfound(i-1))*1d6)
+          jmin = min(jmin,max(j,0))
+   20 CONTINUE
+
+      IF (jmin.GT.0) THEN
+          checktick = jmin
+      ELSE
+          PRINT *,'Your clock granularity appears to be less ',
+     $      'than one microsecond'
+          checktick = 1
+      END IF
+      RETURN
+
+*      PRINT 14, timesfound(1)*1d6
+*      DO 20 i=2,n
+*         PRINT 14, timesfound(i)*1d6,
+*     &       nint((timesfound(i)-timesfound(i-1))*1d6)
+*   14    FORMAT (1X, F18.4, 1X, i8)
+*   20 CONTINUE
+
+      END
+
+
+
+
+      SUBROUTINE checksums(a,b,c,n,ntimes)
+*     IMPLICIT NONE
+C     ..
+C     .. Arguments ..
+      DOUBLE PRECISION a(*),b(*),c(*)
+      INTEGER n,ntimes
+C     ..
+C     .. Local Scalars ..
+      DOUBLE PRECISION aa,bb,cc,scalar,suma,sumb,sumc,epsilon
+      INTEGER k
+C     ..
+
+C     Repeat the main loop, but with scalars only.
+C     This is done to check the sum & make sure all
+C     iterations have been executed correctly.
+
+      aa = 2.0D0
+      bb = 0.5D0
+      cc = 0.0D0
+      aa = 0.5D0*aa
+      scalar = 0.5d0*aa
+      DO k = 1,ntimes
+          cc = aa
+          bb = scalar*cc
+          cc = aa + bb
+          aa = bb + scalar*cc
+      END DO
+      aa = aa*DBLE(n-2)
+      bb = bb*DBLE(n-2)
+      cc = cc*DBLE(n-2)
+
+C     Now sum up the arrays, excluding the first and last
+C     elements, which are modified using the timing results
+C     to confuse aggressive optimizers.
+
+      suma = 0.0d0
+      sumb = 0.0d0
+      sumc = 0.0d0
+!$OMP PARALLEL DO REDUCTION(+:suma,sumb,sumc)
+      DO 110 j = 2,n-1
+          suma = suma + a(j)
+          sumb = sumb + b(j)
+          sumc = sumc + c(j)
+  110 CONTINUE
+
+      epsilon = 1.D-6
+
+      IF (ABS(suma-aa)/suma .GT. epsilon) THEN
+          PRINT *,'Failed Validation on array a()'
+          PRINT *,'Target   Sum of a is = ',aa
+          PRINT *,'Computed Sum of a is = ',suma
+      ELSEIF (ABS(sumb-bb)/sumb .GT. epsilon) THEN
+          PRINT *,'Failed Validation on array b()'
+          PRINT *,'Target   Sum of b is = ',bb
+          PRINT *,'Computed Sum of b is = ',sumb
+      ELSEIF (ABS(sumc-cc)/sumc .GT. epsilon) THEN
+          PRINT *,'Failed Validation on array c()'
+          PRINT *,'Target   Sum of c is = ',cc
+          PRINT *,'Computed Sum of c is = ',sumc
+      ELSE
+          PRINT *,'Solution Validates!'
+      ENDIF
+
+      END
+
+
diff --git a/microbenchmarks/stream/stream_mpi.c b/microbenchmarks/stream/stream_mpi.c
new file mode 100644
index 00000000..86612e47
--- /dev/null
+++ b/microbenchmarks/stream/stream_mpi.c
@@ -0,0 +1,832 @@
+/*-----------------------------------------------------------------------*/
+/* Program: STREAM                                                       */
+/* Revision: $Id: stream_mpi.c,v 1.8 2016/07/28 16:00:50 mccalpin Exp mccalpin $ */
+/* Original code developed by John D. McCalpin                           */
+/* Programmers: John D. McCalpin                                         */
+/*              Joe R. Zagar                                             */
+/*                                                                       */
+/* This program measures memory transfer rates in MB/s for simple        */
+/* computational kernels coded in C.                                     */
+/*-----------------------------------------------------------------------*/
+/* Copyright 1991-2013: John D. McCalpin                                 */
+/*-----------------------------------------------------------------------*/
+/* License:                                                              */
+/*  1. You are free to use this program and/or to redistribute           */
+/*     this program.                                                     */
+/*  2. You are free to modify this program for your own use,             */
+/*     including commercial use, subject to the publication              */
+/*     restrictions in item 3.                                           */
+/*  3. You are free to publish results obtained from running this        */
+/*     program, or from works that you derive from this program,         */
+/*     with the following limitations:                                   */
+/*     3a. In order to be referred to as "STREAM benchmark results",     */
+/*         published results must be in conformance to the STREAM        */
+/*         Run Rules, (briefly reviewed below) published at              */
+/*         http://www.cs.virginia.edu/stream/ref.html                    */
+/*         and incorporated herein by reference.                         */
+/*         As the copyright holder, John McCalpin retains the            */
+/*         right to determine conformity with the Run Rules.             */
+/*     3b. Results based on modified source code or on runs not in       */
+/*         accordance with the STREAM Run Rules must be clearly          */
+/*         labelled whenever they are published.  Examples of            */
+/*         proper labelling include:                                     */
+/*           "tuned STREAM benchmark results"                            */
+/*           "based on a variant of the STREAM benchmark code"           */
+/*         Other comparable, clear, and reasonable labelling is          */
+/*         acceptable.                                                   */
+/*     3c. Submission of results to the STREAM benchmark web site        */
+/*         is encouraged, but not required.                              */
+/*  4. Use of this program or creation of derived works based on this    */
+/*     program constitutes acceptance of these licensing restrictions.   */
+/*  5. Absolutely no warranty is expressed or implied.                   */
+/*-----------------------------------------------------------------------*/
+
+# define _XOPEN_SOURCE 600
+
+# include <stdio.h>
+# include <stdlib.h>
+# include <unistd.h>
+# include <math.h>
+# include <float.h>
+# include <string.h>
+# include <limits.h>
+# include <sys/time.h>
+# include "mpi.h"
+
+/*-----------------------------------------------------------------------
+ * INSTRUCTIONS:
+ *
+ *	1) STREAM requires different amounts of memory to run on different
+ *           systems, depending on both the system cache size(s) and the
+ *           granularity of the system timer.
+ *     You should adjust the value of 'STREAM_ARRAY_SIZE' (below)
+ *           to meet *both* of the following criteria:
+ *       (a) Each array must be at least 4 times the size of the
+ *           available cache memory. I don't worry about the difference
+ *           between 10^6 and 2^20, so in practice the minimum array size
+ *           is about 3.8 times the cache size.
+ *           Example 1: One Xeon E3 with 8 MB L3 cache
+ *               STREAM_ARRAY_SIZE should be >= 4 million, giving
+ *               an array size of 30.5 MB and a total memory requirement
+ *               of 91.5 MB.  
+ *           Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
+ *               STREAM_ARRAY_SIZE should be >= 20 million, giving
+ *               an array size of 153 MB and a total memory requirement
+ *               of 458 MB.  
+ *       (b) The size should be large enough so that the 'timing calibration'
+ *           output by the program is at least 20 clock-ticks.  
+ *           Example: most versions of Windows have a 10 millisecond timer
+ *               granularity.  20 "ticks" at 10 ms/tic is 200 milliseconds.
+ *               If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
+ *               This means the each array must be at least 1 GB, or 128M elements.
+ *
+ *      Version 5.10 increases the default array size from 2 million
+ *          elements to 10 million elements in response to the increasing
+ *          size of L3 caches.  The new default size is large enough for caches
+ *          up to 20 MB. 
+ *      Version 5.10 changes the loop index variables from "register int"
+ *          to "ssize_t", which allows array indices >2^32 (4 billion)
+ *          on properly configured 64-bit systems.  Additional compiler options
+ *          (such as "-mcmodel=medium") may be required for large memory runs.
+ *
+ *      Array size can be set at compile time without modifying the source
+ *          code for the (many) compilers that support preprocessor definitions
+ *          on the compile line.  E.g.,
+ *                gcc -O -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream.100M
+ *          will override the default size of 10M with a new size of 100M elements
+ *          per array.
+ */
+
+// ----------------------- !!! NOTE CHANGE IN DEFINITION !!! ------------------
+// For the MPI version of STREAM, the three arrays with this many elements
+// each will be *distributed* across the MPI ranks.  
+//
+// Be careful when computing the array size needed for a particular target
+// system to meet the minimum size requirement to ensure overflowing the caches.
+//
+// Example:
+//    Assume 4 nodes with two Intel Xeon E5-2680 processors (20 MiB L3) each.
+//    The *total* L3 cache size is 4*2*20 = 160 MiB, so each array must be
+//    at least 640 MiB, or at least 80 million 8 Byte elements. 
+// Note that it does not matter whether you use one MPI rank per node or 
+//    16 MPI ranks per node -- only the total array size and the total
+//    cache size matter.
+//
+#ifndef STREAM_ARRAY_SIZE
+#   define STREAM_ARRAY_SIZE	10000000
+#endif
+
+/*  2) STREAM runs each kernel "NTIMES" times and reports the *best* result
+ *         for any iteration after the first, therefore the minimum value
+ *         for NTIMES is 2.
+ *      There are no rules on maximum allowable values for NTIMES, but
+ *         values larger than the default are unlikely to noticeably
+ *         increase the reported performance.
+ *      NTIMES can also be set on the compile line without changing the source
+ *         code using, for example, "-DNTIMES=7".
+ */
+#ifdef NTIMES
+#if NTIMES<=1
+#   define NTIMES	10
+#endif
+#endif
+#ifndef NTIMES
+#   define NTIMES	10
+#endif
+
+// Make the scalar coefficient modifiable at compile time.
+// The old value of 3.0 cause floating-point overflows after a relatively small
+// number of iterations.  The new default of 0.42 allows over 2000 iterations for
+// 32-bit IEEE arithmetic and over 18000 iterations for 64-bit IEEE arithmetic.
+// The growth in the solution can be eliminated (almost) completely by setting 
+// the scalar value to 0.41421445, but this also means that the error checking
+// code no longer triggers an error if the code does not actually execute the
+// correct number of iterations!
+#ifndef SCALAR
+#define SCALAR 0.42
+#endif
+
+
+// ----------------------- !!! NOTE CHANGE IN DEFINITION !!! ------------------
+// The OFFSET preprocessor variable is not used in this version of the benchmark.
+// The user must change the code at or after the "posix_memalign" array allocations
+//    to change the relative alignment of the pointers.
+// ----------------------- !!! NOTE CHANGE IN DEFINITION !!! ------------------
+#ifndef OFFSET
+#   define OFFSET	0
+#endif
+
+
+/*
+ *	3) Compile the code with optimization.  Many compilers generate
+ *       unreasonably bad code before the optimizer tightens things up.  
+ *     If the results are unreasonably good, on the other hand, the
+ *       optimizer might be too smart for me!
+ *
+ *     For a simple single-core version, try compiling with:
+ *            cc -O stream.c -o stream
+ *     This is known to work on many, many systems....
+ *
+ *     To use multiple cores, you need to tell the compiler to obey the OpenMP
+ *       directives in the code.  This varies by compiler, but a common example is
+ *            gcc -O -fopenmp stream.c -o stream_omp
+ *       The environment variable OMP_NUM_THREADS allows runtime control of the 
+ *         number of threads/cores used when the resulting "stream_omp" program
+ *         is executed.
+ *
+ *     To run with single-precision variables and arithmetic, simply add
+ *         -DSTREAM_TYPE=float
+ *     to the compile line.
+ *     Note that this changes the minimum array sizes required --- see (1) above.
+ *
+ *     The preprocessor directive "TUNED" does not do much -- it simply causes the 
+ *       code to call separate functions to execute each kernel.  Trivial versions
+ *       of these functions are provided, but they are *not* tuned -- they just 
+ *       provide predefined interfaces to be replaced with tuned code.
+ *
+ *
+ *	4) Optional: Mail the results to mccalpin@cs.virginia.edu
+ *	   Be sure to include info that will help me understand:
+ *		a) the computer hardware configuration (e.g., processor model, memory type)
+ *		b) the compiler name/version and compilation flags
+ *      c) any run-time information (such as OMP_NUM_THREADS)
+ *		d) all of the output from the test case.
+ *
+ * Thanks!
+ *
+ *-----------------------------------------------------------------------*/
+
+# define HLINE "-------------------------------------------------------------\n"
+
+# ifndef MIN
+# define MIN(x,y) ((x)<(y)?(x):(y))
+# endif
+# ifndef MAX
+# define MAX(x,y) ((x)>(y)?(x):(y))
+# endif
+
+#ifndef STREAM_TYPE
+#define STREAM_TYPE double
+#endif
+
+//static STREAM_TYPE	a[STREAM_ARRAY_SIZE+OFFSET],
+//			b[STREAM_ARRAY_SIZE+OFFSET],
+//			c[STREAM_ARRAY_SIZE+OFFSET];
+
+// Some compilers require an extra keyword to recognize the "restrict" qualifier.
+double * restrict a, * restrict b, * restrict c;
+
+size_t		array_elements, array_bytes, array_alignment;
+static double	avgtime[4] = {0}, maxtime[4] = {0},
+		mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
+
+static char	*label[4] = {"Copy:      ", "Scale:     ",
+    "Add:       ", "Triad:     "};
+
+static double	bytes[4] = {
+    2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
+    };
+
+extern void checkSTREAMresults(STREAM_TYPE *AvgErrByRank, int numranks);
+extern void computeSTREAMerrors(STREAM_TYPE *aAvgErr, STREAM_TYPE *bAvgErr, STREAM_TYPE *cAvgErr);
+#ifdef TUNED
+extern void tuned_STREAM_Copy();
+extern void tuned_STREAM_Scale(STREAM_TYPE scalar);
+extern void tuned_STREAM_Add();
+extern void tuned_STREAM_Triad(STREAM_TYPE scalar);
+#endif
+#ifdef _OPENMP
+extern int omp_get_num_threads();
+#endif
+int
+main()
+    {
+    int			quantum, checktick();
+    int			BytesPerWord;
+    int			i,k;
+    ssize_t		j;
+    STREAM_TYPE		scalar;
+    double		t, times[4][NTIMES];
+	double		*TimesByRank;
+	double		t0,t1,tmin;
+	int         rc, numranks, myrank;
+	STREAM_TYPE	AvgError[3] = {0.0,0.0,0.0};
+	STREAM_TYPE *AvgErrByRank;
+
+    /* --- SETUP --- call MPI_Init() before anything else! --- */
+
+    rc = MPI_Init(NULL, NULL);
+	t0 = MPI_Wtime();
+    if (rc != MPI_SUCCESS) {
+       printf("ERROR: MPI Initialization failed with return code %d\n",rc);
+       exit(1);
+    }
+	// if either of these fail there is something really screwed up!
+	MPI_Comm_size(MPI_COMM_WORLD, &numranks);
+	MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
+
+    /* --- NEW FEATURE --- distribute requested storage across MPI ranks --- */
+	array_elements = STREAM_ARRAY_SIZE / numranks;		// don't worry about rounding vs truncation
+    array_alignment = 64;						// Can be modified -- provides partial support for adjusting relative alignment
+
+	// Dynamically allocate the three arrays using "posix_memalign()"
+	// NOTE that the OFFSET parameter is not used in this version of the code!
+    array_bytes = array_elements * sizeof(STREAM_TYPE);
+    k = posix_memalign((void **)&a, array_alignment, array_bytes);
+    if (k != 0) {
+        printf("Rank %d: Allocation of array a failed, return code is %d\n",myrank,k);
+		MPI_Abort(MPI_COMM_WORLD, 2);
+        exit(1);
+    }
+    k = posix_memalign((void **)&b, array_alignment, array_bytes);
+    if (k != 0) {
+        printf("Rank %d: Allocation of array b failed, return code is %d\n",myrank,k);
+		MPI_Abort(MPI_COMM_WORLD, 2);
+        exit(1);
+    }
+    k = posix_memalign((void **)&c, array_alignment, array_bytes);
+    if (k != 0) {
+        printf("Rank %d: Allocation of array c failed, return code is %d\n",myrank,k);
+		MPI_Abort(MPI_COMM_WORLD, 2);
+        exit(1);
+    }
+
+	// Initial informational printouts -- rank 0 handles all the output
+	if (myrank == 0) {
+		printf(HLINE);
+		printf("STREAM version $Revision: 1.8 $\n");
+		printf(HLINE);
+		BytesPerWord = sizeof(STREAM_TYPE);
+		printf("This system uses %d bytes per array element.\n",
+		BytesPerWord);
+
+		printf(HLINE);
+#ifdef N
+		printf("*****  WARNING: ******\n");
+		printf("      It appears that you set the preprocessor variable N when compiling this code.\n");
+		printf("      This version of the code uses the preprocesor variable STREAM_ARRAY_SIZE to control the array size\n");
+		printf("      Reverting to default value of STREAM_ARRAY_SIZE=%llu\n",(unsigned long long) STREAM_ARRAY_SIZE);
+		printf("*****  WARNING: ******\n");
+#endif
+		if (OFFSET != 0) {
+			printf("*****  WARNING: ******\n");
+			printf("   This version ignores the OFFSET parameter.\n");
+			printf("*****  WARNING: ******\n");
+		}
+
+		printf("Total Aggregate Array size = %llu (elements)\n" , (unsigned long long) STREAM_ARRAY_SIZE);
+		printf("Total Aggregate Memory per array = %.1f MiB (= %.1f GiB).\n", 
+			BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0),
+			BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0/1024.0));
+		printf("Total Aggregate memory required = %.1f MiB (= %.1f GiB).\n",
+			(3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.),
+			(3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024./1024.));
+		printf("Data is distributed across %d MPI ranks\n",numranks);
+		printf("   Array size per MPI rank = %llu (elements)\n" , (unsigned long long) array_elements);
+		printf("   Memory per array per MPI rank = %.1f MiB (= %.1f GiB).\n", 
+			BytesPerWord * ( (double) array_elements / 1024.0/1024.0),
+			BytesPerWord * ( (double) array_elements / 1024.0/1024.0/1024.0));
+		printf("   Total memory per MPI rank = %.1f MiB (= %.1f GiB).\n",
+			(3.0 * BytesPerWord) * ( (double) array_elements / 1024.0/1024.),
+			(3.0 * BytesPerWord) * ( (double) array_elements / 1024.0/1024./1024.));
+
+		printf(HLINE);
+		printf("Each kernel will be executed %d times.\n", NTIMES);
+		printf(" The *best* time for each kernel (excluding the first iteration)\n"); 
+		printf(" will be used to compute the reported bandwidth.\n");
+		printf("The SCALAR value used for this run is %f\n",SCALAR);
+
+#ifdef _OPENMP
+		printf(HLINE);
+#pragma omp parallel 
+		{
+#pragma omp master
+		{
+			k = omp_get_num_threads();
+			printf ("Number of Threads requested for each MPI rank = %i\n",k);
+			}
+		}
+#endif
+
+#ifdef _OPENMP
+		k = 0;
+#pragma omp parallel
+#pragma omp atomic 
+			k++;
+		printf ("Number of Threads counted for rank 0 = %i\n",k);
+#endif
+
+	}
+
+    /* --- SETUP --- initialize arrays and estimate precision of timer --- */
+
+#pragma omp parallel for
+    for (j=0; j<array_elements; j++) {
+	    a[j] = 1.0;
+	    b[j] = 2.0;
+	    c[j] = 0.0;
+	}
+
+	// Rank 0 needs to allocate arrays to hold error data and timing data from
+	// all ranks for analysis and output.
+	// Allocate and instantiate the arrays here -- after the primary arrays 
+	// have been instantiated -- so there is no possibility of having these 
+	// auxiliary arrays mess up the NUMA placement of the primary arrays.
+
+	if (myrank == 0) {
+		// There are 3 average error values for each rank (using STREAM_TYPE).
+		AvgErrByRank = (double *) malloc(3 * sizeof(STREAM_TYPE) * numranks);
+		if (AvgErrByRank == NULL) {
+			printf("Ooops -- allocation of arrays to collect errors on MPI rank 0 failed\n");
+			MPI_Abort(MPI_COMM_WORLD, 2);
+		}
+		memset(AvgErrByRank,0,3*sizeof(STREAM_TYPE)*numranks);
+
+		// There are 4*NTIMES timing values for each rank (always doubles)
+		TimesByRank = (double *) malloc(4 * NTIMES * sizeof(double) * numranks);
+		if (TimesByRank == NULL) {
+			printf("Ooops -- allocation of arrays to collect timing data on MPI rank 0 failed\n");
+			MPI_Abort(MPI_COMM_WORLD, 3);
+		}
+		memset(TimesByRank,0,4*NTIMES*sizeof(double)*numranks);
+	}
+
+	// Simple check for granularity of the timer being used
+	if (myrank == 0) {
+		printf(HLINE);
+
+		if  ( (quantum = checktick()) >= 1) 
+		printf("Your timer granularity/precision appears to be "
+			"%d microseconds.\n", quantum);
+		else {
+		printf("Your timer granularity appears to be "
+			"less than one microsecond.\n");
+		quantum = 1;
+		}
+	}
+
+    /* Get initial timing estimate to compare to timer granularity. */
+	/* All ranks need to run this code since it changes the values in array a */
+    t = MPI_Wtime();
+#pragma omp parallel for
+    for (j = 0; j < array_elements; j++)
+		a[j] = 2.0E0 * a[j];
+    t = 1.0E6 * (MPI_Wtime() - t);
+
+	if (myrank == 0) {
+		printf("Each test below will take on the order"
+		" of %d microseconds.\n", (int) t  );
+		printf("   (= %d timer ticks)\n", (int) (t/quantum) );
+		printf("Increase the size of the arrays if this shows that\n");
+		printf("you are not getting at least 20 timer ticks per test.\n");
+
+		printf(HLINE);
+
+		printf("WARNING -- The above is only a rough guideline.\n");
+		printf("For best results, please be sure you know the\n");
+		printf("precision of your system timer.\n");
+		printf(HLINE);
+#ifdef VERBOSE
+		t1 = MPI_Wtime();
+		printf("VERBOSE: total setup time for rank 0 = %f seconds\n",t1-t0);
+		printf(HLINE);
+#endif
+	}
+    
+    /*	--- MAIN LOOP --- repeat test cases NTIMES times --- */
+
+    // This code has more barriers and timing calls than are actually needed, but
+    // this should not cause a problem for arrays that are large enough to satisfy
+    // the STREAM run rules.
+	// MAJOR FIX!!!  Version 1.7 had the start timer for each loop *after* the
+	// MPI_Barrier(), when it should have been *before* the MPI_Barrier().
+    // 
+
+    scalar = SCALAR;
+    for (k=0; k<NTIMES; k++)
+	{
+		// kernel 1: Copy
+		t0 = MPI_Wtime();
+		MPI_Barrier(MPI_COMM_WORLD);
+#ifdef TUNED
+        tuned_STREAM_Copy();
+#else
+#pragma omp parallel for
+		for (j=0; j<array_elements; j++)
+			c[j] = a[j];
+#endif
+		MPI_Barrier(MPI_COMM_WORLD);
+		t1 = MPI_Wtime();
+		times[0][k] = t1 - t0;
+
+		// kernel 2: Scale
+		t0 = MPI_Wtime();
+		MPI_Barrier(MPI_COMM_WORLD);
+#ifdef TUNED
+        tuned_STREAM_Scale(scalar);
+#else
+#pragma omp parallel for
+		for (j=0; j<array_elements; j++)
+			b[j] = scalar*c[j];
+#endif
+		MPI_Barrier(MPI_COMM_WORLD);
+		t1 = MPI_Wtime();
+		times[1][k] = t1-t0;
+	
+		// kernel 3: Add
+		t0 = MPI_Wtime();
+		MPI_Barrier(MPI_COMM_WORLD);
+#ifdef TUNED
+        tuned_STREAM_Add();
+#else
+#pragma omp parallel for
+		for (j=0; j<array_elements; j++)
+			c[j] = a[j]+b[j];
+#endif
+		MPI_Barrier(MPI_COMM_WORLD);
+		t1 = MPI_Wtime();
+		times[2][k] = t1-t0;
+	
+		// kernel 4: Triad
+		t0 = MPI_Wtime();
+		MPI_Barrier(MPI_COMM_WORLD);
+#ifdef TUNED
+        tuned_STREAM_Triad(scalar);
+#else
+#pragma omp parallel for
+		for (j=0; j<array_elements; j++)
+			a[j] = b[j]+scalar*c[j];
+#endif
+		MPI_Barrier(MPI_COMM_WORLD);
+		t1 = MPI_Wtime();
+		times[3][k] = t1-t0;
+	}
+
+	t0 = MPI_Wtime();
+
+    /*	--- SUMMARY --- */
+
+	// Because of the MPI_Barrier() calls, the timings from any thread are equally valid. 
+    // The best estimate of the maximum performance is the minimum of the "outside the barrier"
+    // timings across all the MPI ranks.
+
+	// Gather all timing data to MPI rank 0
+	MPI_Gather(times, 4*NTIMES, MPI_DOUBLE, TimesByRank, 4*NTIMES, MPI_DOUBLE, 0, MPI_COMM_WORLD);
+
+	// Rank 0 processes all timing data
+	if (myrank == 0) {
+		// for each iteration and each kernel, collect the minimum time across all MPI ranks
+		// and overwrite the rank 0 "times" variable with the minimum so the original post-
+		// processing code can still be used.
+		for (k=0; k<NTIMES; k++) {
+			for (j=0; j<4; j++) {
+				tmin = 1.0e36;
+				for (i=0; i<numranks; i++) {
+					// printf("DEBUG: Timing: iter %d, kernel %lu, rank %d, tmin %f, TbyRank %f\n",k,j,i,tmin,TimesByRank[4*NTIMES*i+j*NTIMES+k]);
+					tmin = MIN(tmin, TimesByRank[4*NTIMES*i+j*NTIMES+k]);
+				}
+				// printf("DEBUG: Final Timing: iter %d, kernel %lu, final tmin %f\n",k,j,tmin);
+				times[j][k] = tmin;
+			}
+		}
+
+	// Back to the original code, but now using the minimum global timing across all ranks
+		for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
+		{
+		for (j=0; j<4; j++)
+			{
+			avgtime[j] = avgtime[j] + times[j][k];
+			mintime[j] = MIN(mintime[j], times[j][k]);
+			maxtime[j] = MAX(maxtime[j], times[j][k]);
+			}
+		}
+    
+		// note that "bytes[j]" is the aggregate array size, so no "numranks" is needed here
+		printf("Function    Best Rate MB/s  Avg time     Min time     Max time\n");
+		for (j=0; j<4; j++) {
+			avgtime[j] = avgtime[j]/(double)(NTIMES-1);
+
+			printf("%s%11.1f  %11.6f  %11.6f  %11.6f\n", label[j],
+			   1.0E-06 * bytes[j]/mintime[j],
+			   avgtime[j],
+			   mintime[j],
+			   maxtime[j]);
+		}
+		printf(HLINE);
+	}
+
+    /* --- Every Rank Checks its Results --- */
+#ifdef INJECTERROR
+	a[11] = 100.0 * a[11];
+#endif
+	computeSTREAMerrors(&AvgError[0], &AvgError[1], &AvgError[2]);
+	/* --- Collect the Average Errors for Each Array on Rank 0 --- */
+	MPI_Gather(AvgError, 3, MPI_DOUBLE, AvgErrByRank, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
+
+	/* -- Combined averaged errors and report on Rank 0 only --- */
+	if (myrank == 0) {
+#ifdef VERBOSE
+		for (k=0; k<numranks; k++) {
+			printf("VERBOSE: rank %d, AvgErrors %e %e %e\n",k,AvgErrByRank[3*k+0],
+				AvgErrByRank[3*k+1],AvgErrByRank[3*k+2]);
+		}
+#endif
+		checkSTREAMresults(AvgErrByRank,numranks);
+		printf(HLINE);
+	}
+
+#ifdef VERBOSE
+	if (myrank == 0) {
+		t1 = MPI_Wtime();
+		printf("VERBOSE: total shutdown time for rank %d = %f seconds\n",myrank,t1-t0);
+	}
+#endif
+
+	free(a);
+	free(b);
+	free(c);
+	if (myrank == 0) {
+		free(TimesByRank);
+		free(AvgErrByRank);
+	}
+
+    MPI_Finalize();
+	return(0);
+}
+
+# define	M	20
+
+int
+checktick()
+    {
+    int		i, minDelta, Delta;
+    double	t1, t2, timesfound[M];
+
+/*  Collect a sequence of M unique time values from the system. */
+
+    for (i = 0; i < M; i++) {
+	t1 = MPI_Wtime();
+	while( ((t2=MPI_Wtime()) - t1) < 1.0E-6 )
+	    ;
+	timesfound[i] = t1 = t2;
+	}
+
+/*
+ * Determine the minimum difference between these M values.
+ * This result will be our estimate (in microseconds) for the
+ * clock granularity.
+ */
+
+    minDelta = 1000000;
+    for (i = 1; i < M; i++) {
+	Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
+	minDelta = MIN(minDelta, MAX(Delta,0));
+	}
+
+   return(minDelta);
+    }
+
+
+// ----------------------------------------------------------------------------------
+// For the MPI code I separate the computation of errors from the error
+// reporting output functions (which are handled by MPI rank 0).
+// ----------------------------------------------------------------------------------
+#ifndef abs
+#define abs(a) ((a) >= 0 ? (a) : -(a))
+#endif
+void computeSTREAMerrors(STREAM_TYPE *aAvgErr, STREAM_TYPE *bAvgErr, STREAM_TYPE *cAvgErr)
+{
+	STREAM_TYPE aj,bj,cj,scalar;
+	STREAM_TYPE aSumErr,bSumErr,cSumErr;
+	ssize_t	j;
+	int	k;
+
+    /* reproduce initialization */
+	aj = 1.0;
+	bj = 2.0;
+	cj = 0.0;
+    /* a[] is modified during timing check */
+	aj = 2.0E0 * aj;
+    /* now execute timing loop */
+	scalar = SCALAR;
+	for (k=0; k<NTIMES; k++)
+        {
+            cj = aj;
+            bj = scalar*cj;
+            cj = aj+bj;
+            aj = bj+scalar*cj;
+        }
+
+    /* accumulate deltas between observed and expected results */
+	aSumErr = 0.0;
+	bSumErr = 0.0;
+	cSumErr = 0.0;
+	for (j=0; j<array_elements; j++) {
+		aSumErr += abs(a[j] - aj);
+		bSumErr += abs(b[j] - bj);
+		cSumErr += abs(c[j] - cj);
+	}
+	*aAvgErr = aSumErr / (STREAM_TYPE) array_elements;
+	*bAvgErr = bSumErr / (STREAM_TYPE) array_elements;
+	*cAvgErr = cSumErr / (STREAM_TYPE) array_elements;
+}
+
+
+
+void checkSTREAMresults (STREAM_TYPE *AvgErrByRank, int numranks)
+{
+	STREAM_TYPE aj,bj,cj,scalar;
+	STREAM_TYPE aSumErr,bSumErr,cSumErr;
+	STREAM_TYPE aAvgErr,bAvgErr,cAvgErr;
+	double epsilon;
+	ssize_t	j;
+	int	k,ierr,err;
+
+	// Repeat the computation of aj, bj, cj because I am lazy
+    /* reproduce initialization */
+	aj = 1.0;
+	bj = 2.0;
+	cj = 0.0;
+    /* a[] is modified during timing check */
+	aj = 2.0E0 * aj;
+    /* now execute timing loop */
+	scalar = SCALAR;
+	for (k=0; k<NTIMES; k++)
+        {
+            cj = aj;
+            bj = scalar*cj;
+            cj = aj+bj;
+            aj = bj+scalar*cj;
+        }
+
+	// Compute the average of the average errors contributed by each MPI rank
+	aSumErr = 0.0;
+	bSumErr = 0.0;
+	cSumErr = 0.0;
+	for (k=0; k<numranks; k++) {
+		aSumErr += AvgErrByRank[3*k + 0];
+		bSumErr += AvgErrByRank[3*k + 1];
+		cSumErr += AvgErrByRank[3*k + 2];
+	}
+	aAvgErr = aSumErr / (STREAM_TYPE) numranks;
+	bAvgErr = bSumErr / (STREAM_TYPE) numranks;
+	cAvgErr = cSumErr / (STREAM_TYPE) numranks;
+
+	if (sizeof(STREAM_TYPE) == 4) {
+		epsilon = 1.e-6;
+	}
+	else if (sizeof(STREAM_TYPE) == 8) {
+		epsilon = 1.e-13;
+	}
+	else {
+		printf("WEIRD: sizeof(STREAM_TYPE) = %lu\n",sizeof(STREAM_TYPE));
+		epsilon = 1.e-6;
+	}
+
+	err = 0;
+	if (abs(aAvgErr/aj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array a[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",aj,aAvgErr,abs(aAvgErr)/aj);
+		ierr = 0;
+		for (j=0; j<array_elements; j++) {
+			if (abs(a[j]/aj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array a: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,aj,a[j],abs((aj-a[j])/aAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array a[], %d errors were found.\n",ierr);
+	}
+	if (abs(bAvgErr/bj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array b[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",bj,bAvgErr,abs(bAvgErr)/bj);
+		printf ("     AvgRelAbsErr > Epsilon (%e)\n",epsilon);
+		ierr = 0;
+		for (j=0; j<array_elements; j++) {
+			if (abs(b[j]/bj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array b: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,bj,b[j],abs((bj-b[j])/bAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array b[], %d errors were found.\n",ierr);
+	}
+	if (abs(cAvgErr/cj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array c[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",cj,cAvgErr,abs(cAvgErr)/cj);
+		printf ("     AvgRelAbsErr > Epsilon (%e)\n",epsilon);
+		ierr = 0;
+		for (j=0; j<array_elements; j++) {
+			if (abs(c[j]/cj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array c: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,cj,c[j],abs((cj-c[j])/cAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array c[], %d errors were found.\n",ierr);
+	}
+	if (err == 0) {
+		printf ("Solution Validates: avg error less than %e on all three arrays\n",epsilon);
+	}
+#ifdef VERBOSE
+	printf ("Results Validation Verbose Results: \n");
+	printf ("    Expected a(1), b(1), c(1): %f %f %f \n",aj,bj,cj);
+	printf ("    Observed a(1), b(1), c(1): %f %f %f \n",a[1],b[1],c[1]);
+	printf ("    Rel Errors on a, b, c:     %e %e %e \n",abs(aAvgErr/aj),abs(bAvgErr/bj),abs(cAvgErr/cj));
+#endif
+}
+
+#ifdef TUNED
+/* stubs for "tuned" versions of the kernels */
+void tuned_STREAM_Copy()
+{
+	ssize_t j;
+#pragma omp parallel for
+        for (j=0; j<array_elements; j++)
+            c[j] = a[j];
+}
+
+void tuned_STREAM_Scale(STREAM_TYPE scalar)
+{
+	ssize_t j;
+#pragma omp parallel for
+	for (j=0; j<array_elements; j++)
+	    b[j] = scalar*c[j];
+}
+
+void tuned_STREAM_Add()
+{
+	ssize_t j;
+#pragma omp parallel for
+	for (j=0; j<array_elements; j++)
+	    c[j] = a[j]+b[j];
+}
+
+void tuned_STREAM_Triad(STREAM_TYPE scalar)
+{
+	ssize_t j;
+#pragma omp parallel for
+	for (j=0; j<array_elements; j++)
+	    a[j] = b[j]+scalar*c[j];
+}
+/* end of stubs for the "tuned" versions of the kernels */
+#endif
+
diff --git a/microbenchmarks/stream/xrds-stream.c b/microbenchmarks/stream/xrds-stream.c
new file mode 100644
index 00000000..a551e835
--- /dev/null
+++ b/microbenchmarks/stream/xrds-stream.c
@@ -0,0 +1,401 @@
+/*-----------------------------------------------------------------------*/
+/* 
+   Crossroads STREAM Bandwidth Benchmark
+   Adapted from the original STREAM Benchmark by John D. McCalpin
+   Original copyright is 1991 - 2013: John D. McCalpin
+   
+   Modifications for Crossroads Benchmark:
+   - Array allocations use the heap and are no longer static
+   - Modifications to OpenMP function calls
+*/
+/*-----------------------------------------------------------------------*/
+
+#include <stdio.h>
+#include <unistd.h>
+#include <math.h>
+#include <float.h>
+#include <limits.h>
+#include <sys/time.h>
+#include <stdlib.h>
+
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+#ifndef STREAM_ARRAY_SIZE
+#define STREAM_ARRAY_SIZE 100000000
+#endif
+
+#define NTIMES 10
+#define OFFSET	0
+
+# define HLINE "-------------------------------------------------------------\n"
+
+#ifndef MIN
+#define MIN(x,y) ((x)<(y)?(x):(y))
+#endif
+#ifndef MAX
+#define MAX(x,y) ((x)>(y)?(x):(y))
+#endif
+
+#ifndef STREAM_TYPE
+#define STREAM_TYPE double
+#endif
+
+#define STREAM_RESTRICT __restrict__
+
+static STREAM_TYPE* STREAM_RESTRICT a;
+static STREAM_TYPE* STREAM_RESTRICT b;
+static STREAM_TYPE* STREAM_RESTRICT c;
+
+static double	avgtime[4] = {0}, maxtime[4] = {0},
+		mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
+
+static char	*label[4] = {"Copy:      ", "Scale:     ",
+    "Add:       ", "Triad:     "};
+
+static double	bytes[4] = {
+    2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
+    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
+    };
+
+double mysecond();
+void checkSTREAMresults();
+
+
+int main() {
+    int			quantum, checktick();
+    int			BytesPerWord;
+    int			k;
+    ssize_t		j;
+    STREAM_TYPE		scalar;
+    double		t, times[4][NTIMES];
+
+    /* --- SETUP --- determine precision and check timing --- */
+
+    printf(HLINE);
+    printf("STREAM Crossroads Memory Bandwidth\n");
+    printf("(Based on the original STREAM benchmark by John D. McCalpin)\n");
+    printf(HLINE);
+    BytesPerWord = sizeof(STREAM_TYPE);
+    printf("This system uses %d bytes per array element.\n",
+	BytesPerWord);
+
+    printf(HLINE);
+    
+    const unsigned long long int array_size = (sizeof(STREAM_TYPE) * (STREAM_ARRAY_SIZE + OFFSET));
+        
+    printf("Array size = %llu (elements), Offset = %d (elements)\n" , (unsigned long long) array_size, OFFSET);
+    printf("Memory per array = %.1f MiB (= %.1f GiB).\n", 
+	BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0),
+	BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0/1024.0));
+    printf("Total memory required = %.1f MiB (= %.1f GiB).\n",
+	(3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.),
+	(3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024./1024.));
+    printf("Each kernel will be executed %d times.\n", NTIMES);
+    printf(" The *best* time for each kernel (excluding the first iteration)\n"); 
+    printf(" will be used to compute the reported bandwidth.\n");
+    
+    printf("Allocating arrays ...\n");
+    
+	///////////////////////////////////////////////////////////////////////////////
+    // VENDOR NOTIFICATION
+    // 
+    // Memory allocation routines should be changed to reflect allocation in memory
+    // hierarchy level. All modifications should be reported in benchmark response
+    //
+    ///////////////////////////////////////////////////////////////////////////////
+    
+    posix_memalign((void**) &a, 64, array_size );
+    posix_memalign((void**) &b, 64, array_size );
+    posix_memalign((void**) &c, 64, array_size );
+    
+    ///////////////////////////////////////////////////////////////////////////////
+    // END VENDOR NOTIFICATION
+    ///////////////////////////////////////////////////////////////////////////////
+
+#ifdef _OPENMP
+	k = 0;
+#pragma omp parallel
+#pragma omp atomic
+		k++;
+    printf ("Number of Threads counted = %i\n",k);
+#endif
+
+    printf("Populating values and performing first touch ... \n");
+    /* Get initial value for system clock. */
+#pragma omp parallel for
+    for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+	    a[j] = 1.0;
+	    b[j] = 2.0;
+	    c[j] = 0.0;
+	}
+    printf("Population of values is complete.\n");
+
+    printf(HLINE);
+
+    if  ( (quantum = checktick()) >= 1) 
+	printf("Your clock granularity/precision appears to be "
+	    "%d microseconds.\n", quantum);
+    else {
+	printf("Your clock granularity appears to be "
+	    "less than one microsecond.\n");
+	quantum = 1;
+    }
+
+    t = mysecond();
+#pragma omp parallel for
+    for (j = 0; j < STREAM_ARRAY_SIZE; j++)
+		a[j] = 2.0E0 * a[j];
+    t = 1.0E6 * (mysecond() - t);
+
+    printf("Each test below will take on the order"
+	" of %d microseconds.\n", (int) t  );
+    printf("   (= %d clock ticks)\n", (int) (t/quantum) );
+    printf("Increase the size of the arrays if this shows that\n");
+    printf("you are not getting at least 20 clock ticks per test.\n");
+
+    printf(HLINE);
+
+    printf("WARNING -- The above is only a rough guideline.\n");
+    printf("For best results, please be sure you know the\n");
+    printf("precision of your system timer.\n");
+    printf(HLINE);
+    
+    /*	--- MAIN LOOP --- repeat test cases NTIMES times --- */
+
+    scalar = 3.0;
+    for (k=0; k < NTIMES; k++)
+	{
+		times[0][k] = mysecond();
+		#pragma omp parallel for
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+			c[j] = a[j];
+		}
+		times[0][k] = mysecond() - times[0][k];
+	
+		times[1][k] = mysecond();
+		#pragma omp parallel for
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+	    	b[j] = scalar*c[j];
+	    }
+		times[1][k] = mysecond() - times[1][k];
+	
+		times[2][k] = mysecond();
+		#pragma omp parallel for
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+	    	c[j] = a[j]+b[j];
+	    }
+		times[2][k] = mysecond() - times[2][k];
+	
+		times[3][k] = mysecond();
+		#pragma omp parallel for
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+	    	a[j] = b[j]+scalar*c[j];
+	    }
+		times[3][k] = mysecond() - times[3][k];
+		
+	}
+
+    /*	--- SUMMARY --- */
+
+    for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
+	{
+		for (j=0; j<4; j++)
+		{
+	    	avgtime[j] = avgtime[j] + times[j][k];
+		    mintime[j] = MIN(mintime[j], times[j][k]);
+		    maxtime[j] = MAX(maxtime[j], times[j][k]);
+	    }
+	}
+    
+    printf("Function    Best Rate MB/s  Avg time     Min time     Max time\n");
+    for (j=0; j<4; j++) {
+		avgtime[j] = avgtime[j]/(double)(NTIMES-1);
+
+		printf("%s%12.1f  %11.6f  %11.6f  %11.6f\n", label[j],
+	       1.0E-06 * bytes[j]/mintime[j],
+	       avgtime[j],
+	       mintime[j],
+	       maxtime[j]);
+    }
+    printf(HLINE);
+
+    /* --- Check Results --- */
+    checkSTREAMresults();
+    printf(HLINE);
+
+    return 0;
+}
+
+# define	M	20
+
+int
+checktick()
+    {
+    int		i, minDelta, Delta;
+    double	t1, t2, timesfound[M];
+
+/*  Collect a sequence of M unique time values from the system. */
+
+    for (i = 0; i < M; i++) {
+	t1 = mysecond();
+	while( ((t2=mysecond()) - t1) < 1.0E-6 )
+	    ;
+	timesfound[i] = t1 = t2;
+	}
+
+/*
+ * Determine the minimum difference between these M values.
+ * This result will be our estimate (in microseconds) for the
+ * clock granularity.
+ */
+
+    minDelta = 1000000;
+    for (i = 1; i < M; i++) {
+	Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
+	minDelta = MIN(minDelta, MAX(Delta,0));
+	}
+
+   return(minDelta);
+    }
+
+
+
+/* A gettimeofday routine to give access to the wall
+   clock timer on most UNIX-like systems.  */
+
+#include <sys/time.h>
+
+double mysecond()
+{
+        struct timeval tp;
+        struct timezone tzp;
+        int i;
+
+        i = gettimeofday(&tp,&tzp);
+        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
+}
+
+#ifndef abs
+#define abs(a) ((a) >= 0 ? (a) : -(a))
+#endif
+void checkSTREAMresults ()
+{
+	STREAM_TYPE aj,bj,cj,scalar;
+	STREAM_TYPE aSumErr,bSumErr,cSumErr;
+	STREAM_TYPE aAvgErr,bAvgErr,cAvgErr;
+	double epsilon;
+	ssize_t	j;
+	int	k,ierr,err;
+
+    /* reproduce initialization */
+	aj = 1.0;
+	bj = 2.0;
+	cj = 0.0;
+    /* a[] is modified during timing check */
+	aj = 2.0E0 * aj;
+    /* now execute timing loop */
+	scalar = 3.0;
+	for (k=0; k<NTIMES; k++)
+        {
+            cj = aj;
+            bj = scalar*cj;
+            cj = aj+bj;
+            aj = bj+scalar*cj;
+        }
+
+    /* accumulate deltas between observed and expected results */
+	aSumErr = 0.0;
+	bSumErr = 0.0;
+	cSumErr = 0.0;
+	for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+		aSumErr += abs(a[j] - aj);
+		bSumErr += abs(b[j] - bj);
+		cSumErr += abs(c[j] - cj);
+		// if (j == 417) printf("Index 417: c[j]: %f, cj: %f\n",c[j],cj);	// MCCALPIN
+	}
+	aAvgErr = aSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;
+	bAvgErr = bSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;
+	cAvgErr = cSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;
+
+	if (sizeof(STREAM_TYPE) == 4) {
+		epsilon = 1.e-6;
+	}
+	else if (sizeof(STREAM_TYPE) == 8) {
+		epsilon = 1.e-13;
+	}
+	else {
+		printf("WEIRD: sizeof(STREAM_TYPE) = %lu\n",sizeof(STREAM_TYPE));
+		epsilon = 1.e-6;
+	}
+
+	err = 0;
+	if (abs(aAvgErr/aj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array a[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",aj,aAvgErr,abs(aAvgErr)/aj);
+		ierr = 0;
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+			if (abs(a[j]/aj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array a: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,aj,a[j],abs((aj-a[j])/aAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array a[], %d errors were found.\n",ierr);
+	}
+	if (abs(bAvgErr/bj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array b[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",bj,bAvgErr,abs(bAvgErr)/bj);
+		printf ("     AvgRelAbsErr > Epsilon (%e)\n",epsilon);
+		ierr = 0;
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+			if (abs(b[j]/bj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array b: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,bj,b[j],abs((bj-b[j])/bAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array b[], %d errors were found.\n",ierr);
+	}
+	if (abs(cAvgErr/cj) > epsilon) {
+		err++;
+		printf ("Failed Validation on array c[], AvgRelAbsErr > epsilon (%e)\n",epsilon);
+		printf ("     Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",cj,cAvgErr,abs(cAvgErr)/cj);
+		printf ("     AvgRelAbsErr > Epsilon (%e)\n",epsilon);
+		ierr = 0;
+		for (j=0; j<STREAM_ARRAY_SIZE; j++) {
+			if (abs(c[j]/cj-1.0) > epsilon) {
+				ierr++;
+#ifdef VERBOSE
+				if (ierr < 10) {
+					printf("         array c: index: %ld, expected: %e, observed: %e, relative error: %e\n",
+						j,cj,c[j],abs((cj-c[j])/cAvgErr));
+				}
+#endif
+			}
+		}
+		printf("     For array c[], %d errors were found.\n",ierr);
+	}
+	if (err == 0) {
+		printf ("Solution Validates: avg error less than %e on all three arrays\n",epsilon);
+	}
+#ifdef VERBOSE
+	printf ("Results Validation Verbose Results: \n");
+	printf ("    Expected a(1), b(1), c(1): %f %f %f \n",aj,bj,cj);
+	printf ("    Observed a(1), b(1), c(1): %f %f %f \n",a[1],b[1],c[1]);
+	printf ("    Rel Errors on a, b, c:     %e %e %e \n",abs(aAvgErr/aj),abs(bAvgErr/bj),abs(cAvgErr/cj));
+#endif
+}
diff --git a/parthenon b/parthenon
index 11c53d1c..c75ce20f 160000
--- a/parthenon
+++ b/parthenon
@@ -1 +1 @@
-Subproject commit 11c53d1cd4ada0629e06d069b70b410234ed0bde
+Subproject commit c75ce20f938a4adaedb4425584954c3e74d56868
diff --git a/sparta b/sparta
index 83d5f3a9..ca0ce28f 160000
--- a/sparta
+++ b/sparta
@@ -1 +1 @@
-Subproject commit 83d5f3a92c5fc0b59d4d973c6b1dddc4d77a7147
+Subproject commit ca0ce28fd76080d8b2828db77adde14fdc382c76
diff --git a/trilinos b/trilinos
index f3ff0b54..5aaae1ad 160000
--- a/trilinos
+++ b/trilinos
@@ -1 +1 @@
-Subproject commit f3ff0b54c5158790295daff089ff0d286bda3c2c
+Subproject commit 5aaae1ada6fe1ce777e671a0ff84fdc4f0779406
diff --git a/utils/pav_config/hosts/crossroads.yaml b/utils/pav_config/hosts/crossroads.yaml
index 4e8b115c..e94ebcf9 100755
--- a/utils/pav_config/hosts/crossroads.yaml
+++ b/utils/pav_config/hosts/crossroads.yaml
@@ -6,14 +6,15 @@ variables:
     crayversion: "16.0.0"
     craympichversion: "8.1.26"
     partn: 'standard'
+
     mpis:
         - { name: "cray-mpich", version: "{{craympichversion}}", mpicc: "cc", mpicxx: "CC", mpifc: "ftn", mpival: "cray"}
     compilers:
-        - { name: "intel-classic", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn" }
-        - { name: "intel-oneapi", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn" }
-        - { name: "intel", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn" }
-        - { name: "cce", version: "{{crayversion}}", cc: "cc", cxx: "CC", pe_env: cray, fc: "ftn" }
-
+        - { name: "intel-classic", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -qopenmp', arch_opt: ''}
+        - { name: "intel-oneapi", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -fopenmp', arch_opt: ''}
+        - { name: "intel", version: "{{intelversion}}", cc: "cc", cxx: "CC", pe_env: intel, fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -qopenmp', arch_opt: ''}
+        - { name: "cce", version: "{{crayversion}}", cc: "cc", cxx: "CC", pe_env: cray, fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -openmp', arch_opt: ''}
+        # - { name: "gcc", version: "12.2.0", pe_env: "PrgEnv-gnu", cc: "cc", cxx: "CC", fc: "ftn", blas_cflags: '-DUSE_CBLAS=1 -O3 -fopenmp', arch_opt: ''}
     scratch:
         - name: xrscratch
           path: "/lustre/xrscratch1/{{pav.user}}"
diff --git a/utils/pav_config/test_src/dgemm b/utils/pav_config/test_src/dgemm
new file mode 120000
index 00000000..8c5d18b9
--- /dev/null
+++ b/utils/pav_config/test_src/dgemm
@@ -0,0 +1 @@
+../../../microbenchmarks/dgemm
\ No newline at end of file
diff --git a/utils/pav_config/test_src/ior b/utils/pav_config/test_src/ior
new file mode 120000
index 00000000..06a5443f
--- /dev/null
+++ b/utils/pav_config/test_src/ior
@@ -0,0 +1 @@
+../../../microbenchmarks/ior
\ No newline at end of file
diff --git a/utils/pav_config/test_src/mdtest b/utils/pav_config/test_src/mdtest
new file mode 120000
index 00000000..e7d2e746
--- /dev/null
+++ b/utils/pav_config/test_src/mdtest
@@ -0,0 +1 @@
+../../../microbenchmarks/mdtest
\ No newline at end of file
diff --git a/utils/pav_config/test_src/osumb b/utils/pav_config/test_src/osumb
new file mode 120000
index 00000000..1315ac3b
--- /dev/null
+++ b/utils/pav_config/test_src/osumb
@@ -0,0 +1 @@
+../../../microbenchmarks/osumb
\ No newline at end of file
diff --git a/utils/pav_config/test_src/stream b/utils/pav_config/test_src/stream
new file mode 120000
index 00000000..55789187
--- /dev/null
+++ b/utils/pav_config/test_src/stream
@@ -0,0 +1 @@
+../../../microbenchmarks/stream
\ No newline at end of file
diff --git a/utils/pav_config/tests/stream.yaml b/utils/pav_config/tests/stream.yaml
index fd438cc8..ed62f732 100644
--- a/utils/pav_config/tests/stream.yaml
+++ b/utils/pav_config/tests/stream.yaml
@@ -75,30 +75,33 @@ _base:
 
             STREAM ARRAY SIZE CALCULATIONS:
             ###############
-            STREAM
-            XRDS DOCUMENTATION: 4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements = 15000000
+            FORMULA: 
+            4 x ((cache / socket) x (num sockets)) / (num arrays) / 8 (size of double) = 15 Mi elements = 15e6
             *****************************************************************************************************
             HASWELL: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
             CACHE: 40M
             SOCKETS: 2
-            4 * ( 40M * 2 ) / 3 ARRAYS / 8 Bytes/element =  13.4 Mi elements = 13400000 
+            4 * ( 40M * 2 ) / 3 ARRAYS / 8 =  13.4 Mi elements = 13.4e6
             *****************************************************************************************************
             BROADWELL: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
             CACHE: 45M
             SOCKETS: 2
-            4 * ( 45M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 15.0 Mi elements = 15000000
+            4 * ( 45M * 2 ) / 3 ARRAYS / 8 = 15.0 Mi elements = 15e6
             *****************************************************************************************************
             SAPPHIRE RAPIDS: Intel(R) Xeon(R) Platinum 8480+
-            CACHE: 105
+            CACHE: 105M
             SOCKETS: 2
-            4 x (105M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 35 Mi elements = 35000000
+            4 x ( 105M * 2 ) / 3 ARRAYS / 8 = 35 Mi elements = 35e6
 
     scheduler: slurm
     schedule:
         nodes: '10' # 'ALL'
         tasks_per_node: 1
         share_allocation: false
+
+    variables:
         ntimes: '10'
+        stream_array_size: '40'
 
     permute_on:
         - compilers
@@ -129,14 +132,14 @@ _base:
             - 'fi'
             - 'export PAV_CC PAV_FC PAV_CFLAGS PAV_FFLAGS'
             - 'make clean'
-            - 'make {{target}} || exit 1'
+            - 'make all || exit 1'
             - '[ -x {{target}} ] || exit 1'
 
     run:
         env:
             CC: '{{compilers.cc}}'
             OMP_NUM_THREADS: '{{omp_num_threads}}'
-            GOMP_CPU_AFFINITY: '0-{{omp_num_threads-1}}'
+#            GOMP_CPU_AFFINITY: '0-{{omp_num_threads-1}}'
         preamble:
             - 'module load friendly-testing'
             - 'module load {{compilers.name}}/{{compilers.version}}'
@@ -144,7 +147,7 @@ _base:
         cmds:
             - 'NTIMES={{ntimes}}'
             - 'N={{stream_array_size}}000000'
-            - 'echo "GOMP_CPU_AFFINITY: $GOMP_CPU_AFFINITY"'
+            # - 'echo "GOMP_CPU_AFFINITY: $GOMP_CPU_AFFINITY"'
             - 'echo "OMP_NUM_THREADS: $OMP_NUM_THREADS"'
             - 'echo "NTIMES=$NTIMES"'
             - 'echo "N=${N}"'
@@ -292,7 +295,7 @@ spr_ddr5_xrds:
         "{{sys_name}}": [ darwin ]
     variables:
         arch: "spr"
-        stream_array_size: '105'
+        stream_array_size: '40'
         target: "xrds-stream.exe"
         omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
         omp_places: [cores, sockets]
@@ -321,7 +324,7 @@ spr_hbm_xrds:
         "{{sys_name}}": [ darwin ]
     variables:
         arch: "spr"
-        stream_array_size: '105'
+        stream_array_size: '40'
         target: "xrds-stream.exe"
         omp_num_threads: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
         omp_places: [cores, sockets]
@@ -360,16 +363,13 @@ cts1_ats5:
         numnodes: '1'
         omp_num_threads: '1'
         stream_array_size: '40'
+        target: "stream_mpi.exe"
 
     schedule:
         nodes: "{{numnodes}}"
         share_allocation: true
         tasks_per_node: "{{tpn}}"
 
-    run:
-        env:
-            GOMP_CPU_AFFINITY: ''
-
     result_parse:
         regex:
             triad_once:
@@ -396,8 +396,8 @@ xrds_ats5:
     variables:
         tpn: [8, 32, 56, 88, 112]
         arch: "spr"
-        target: "xrds-stream.exe"
-        stream_array_size: '105'
+        target: "xrds_stream.exe"
+        stream_array_size: '40'
         ntimes: 20
         #omp_places: [cores, sockets]
         #omp_proc_bind: [true]
@@ -407,7 +407,6 @@ xrds_ats5:
     chunk: '{{chunk_ids.0}}'
 
     schedule:
-        partition: 'hbm'
         nodes: "{{numnodes}}"
         share_allocation: true
         tasks_per_node: "{{tpn}}"
@@ -430,9 +429,6 @@ xrds_ats5:
             - 'module load {{compilers.name}}/{{compilers.version}}'
             - 'module load {{mpis.name}}/{{mpis.version}}'
 
-        env:
-            GOMP_CPU_AFFINITY: ''
-
     result_parse:
         regex:
             triad_once:
@@ -444,3 +440,9 @@ xrds_ats5:
     result_evaluate:
         per_proc_bw: 'sum(triad_once)/len(triad_once)'
         total_bw: 'sum(triad_once)'
+
+roci_ats5:
+    inherits_from: xrds_ats5
+
+    schedule:
+        partition: 'hbm'
diff --git a/utils/pavilion b/utils/pavilion
index 69b2d45d..f502ca86 160000
--- a/utils/pavilion
+++ b/utils/pavilion
@@ -1 +1 @@
-Subproject commit 69b2d45d696e623127c106b50525ba65daa23d76
+Subproject commit f502ca86fa27f4bc894aa19232c9f1f42361e269