Contributors: David A. Bader (Georgia Institute of Technology), Jonathan Berry (Sandia National Laboratories), Simon Kahan (Pacific Northwest National Laboratory and University of Washington), Richard Murphy (Micron Technology), E. Jason Riedy (Georgia Institute of Technology), and Jeremiah Willcock (Indiana University).
Version History:
- V0.1
- Draft, created 28 July 2010
- V0.2
- Draft, created 29 September 2010
- V0.3
- Draft, created 30 September 2010
- V1.0
- Created 1 October 2010
- V1.1
- Created 3 October 2010
- V3.0
- Created XXX 2013
Version 0.1 of this document was part of the Graph 500 community benchmark effort, led by Richard Murphy (then at Sandia National Laboratories). The intent is that there will be at least three variants of implementations, on shared memory and threaded systems, on distributed memory clusters, and on external memory map-reduce clouds. This specification is for the first of potentially several benchmark problems. The version number jumped to three to synchronize with the reference code.
Data-intensive supercomputer applications are an increasingly important workload, but are ill-suited for platforms designed for 3D physics simulations. Application performance cannot be improved without a meaningful benchmark. Graphs are a core part of most analytics workloads. Backed by a steering committee of 30 international HPC experts from academia, industry, and national laboratories, this specification establishes a large-scale benchmark for these applications. It will offer a forum for the community and provide a rallying point for data-intensive supercomputing problems. This is the first serious approach to augment the Top 500 with data-intensive applications.
The intent of this benchmark problem (“Search”) is to develop a compact application that has multiple analysis techniques (multiple kernels) accessing a single data structure representing a weighted, undirected graph. In addition to a kernel to construct the graph from the input tuple list, there is one additional computational kernel to operate on the graph.
This benchmark includes a scalable, reproducible data generator which produces edge tuples containing the start vertex and end vertex for each edge. The first kernel constructs an undirected graph in a format usable by all subsequent kernels. No subsequent modifications are permitted to benefit specific kernels. The second kernel performs multiple breadth-first searches of the graph. The third kernel performs multiple single-source shortest path computations on the graph. Each run of the second and third kernel is independent of the others and uses only the output of the initial construction from the first kernel.
All kernels are timed and reported. The ranking used for the official Graph500 listing are provided in the section on Evaluation Criteria. The other data is useful both for explaining the results as well as how graph searches behave on available platforms.
This benchmark also specifies a reference implementation. This implementation is not tuned for any particular system or hardware platform. The reference implementation defines the edge list generator. The generator can be used separately from the timed kernels.
Submissions are required to include performance of the reference implementation if the reference implementation runs on their platform. The performance of the reference implementation gives some indication of the performance of portable graph search code written by typical programmers. Submissions are encouraged to include results from platform-tuned and optimized codes along with the results from the most applicable reference code. If no reference code applies to a particular platform, the reference code performance results need not be included, although the graph data must be generated correctly as with the reference generator.
- Generator:
- Changed graph generator parameters.
- Begin with a tree to connect all vertices.
- Use a location-based hash for a PRNG. All implementations should produce identical graphs. The edge list need not be generated explicitly but may be computed on-the-fly.
- “Permute” edge list locations by index multiplication rather than a full permutation. This scatters the tree edges around the edge list without excess data motion.
- All kernels:
- Reduced number of search roots to eight from 64 because the graph is fully connected.
- Both search kernels (2 and 3) use a single, unified, and simplified validation routine.
- Kernel 1, graph construction:
- Removed restrictions on internal data structure.
- No longer computes the number of vertices.
- Kernel 2, BFS:
- No significant changes to the specification, but the reference implementation should be faster.
- Kernel 3, single-source shortest paths:
- New kernel.
- Results:
- New submission format. Submissions provide sizes and times but do not need to compute their own statistics.
- Require running the reference code if possible as in the Top500 list.
- D.A. Bader, J. Feo, J. Gilbert, J. Kepner, D. Koester, E. Loh, K. Madduri, W. Mann, Theresa Meuse, HPCS Scalable Synthetic Compact Applications #2 Graph Analysis (SSCA#2 v2.2 Specification), 5 September 2007.
- Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, James A. Ang, “Introducing the Graph 500,” Cray User’s Group (CUG), May 5, 2010.
- Richard C. Murphy, Jonathan Berry, William McLendon, Bruce Hendrickson, Douglas Gregor, Andrew Lumsdaine, “DFS: A Simple to Write Yet Difficult to Execute Benchmark,” IEEE International Symposium on Workload Characterizations 2006 (IISWC06), San Jose, CA, 25-27 October 2006.
The benchmark performs the following steps, where BFS refers to a breadth-first search and SSSP refers to a single-source shortest path to be described below:
- Generate the random edge list.
- Randomly sample 8 unique initial vertices.
- Construct a graph from the edge list (timed, Kernel 1).
- For each initial vertex:
- Compute the BFS parent array (timed, Kernel 2).
- Validate that the parent array is a correct BFS search tree for the given search tree.
- For each initial vertex:
- Compute the SSSP parent array and distances (timed, Kernel 3).
- Validate that the parent array and distance vector is a correct SSSP search tree for the given search tree.
- Compute and output performance information.
Only the sections marked as timed are included in the performance information. Note that the Kernel 2 and Kernel 3 are run in separate loops and not consecutively off the same initial vertex. All mentions of “random” refer to the reproducible pseudo-random number generator included in the reference implementation. This benchmark is an artificial system measurement and not a direct representation of actual applications. Therefore no extra information like optimal parameter settings may be passed between kernel invocations, although Kernel 1 may pre-compute reasonable data statistics and parameters used by all later kernels without further changes.
The benchmark takes only one parameter as input:
- SCALE
- The SCALE parameter controls the overall size of the graph. The generated graph contains 2^SCALE vertices. The number of entries in the generated edge list is 2^SCALE * edgefactor, where edgefactor is an internal parameter described below.
The benchmark also contains internal parameters with required settings for submission. Experimenting with different setting is useful for testing and exploration but not permitted for submitted results.
- edgefactor = 16
- The average number of entries in the generated edge list containing each vertex.
- maxweight = 255
- The maximum edge weight in the generated edge list. Because edges may appear multiple times, this is not the maximum weight of the edge in the graph.
- A = 0.55, B = 0.1
- The parameters A and B control quadrant probabilities in the RMAT edge generator subject to the restrictions that 0 ≤ A ≤ 1, 0 ≤ B ≤ 1, and A+2B ≤ 1.
- noisefact = 0.1
- The RMAT generator perturbs A and B by a random quantity weighted by noisefact.
- nroots = 16
- The number of search roots used for running Kernels 2 and 3.
The rest of the specification uses two parameters for the graph size rather than repeating the expressions above.
- NV = 2^SCALE
- The number of vertices.
- NE = edgefactor * NV
- The number of entries in the edge list.
The pseudo-random number generator (PRNG) used in this benchmark,
threefry32x4_10
from the Random123 package referenced below,
essentially hashes a location-based counter into four random 32-bit
bitstrings. Each use of the PRNG will provide the mapping from the
use’s location to two PRNG parameters. Given two 64-bit integers I and
J, PRNG(I, J) return four 32-bit floating-point numbers.
A location-based PRNG guarantees the numbers will be reproducible across different platforms. We use floating-point numbers to spread bias across the interval rather than defining how to iterate for rejection sampling.
- John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw. 2011. Parallel random numbers: as easy as 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC ‘11). ACM, New York, NY, USA. http://dx.doi.org/10.1145/2063384.2063405
- Random123 software distribution: http://www.deshawresearch.com/resources_random123.html
The benchmark defines a list of NE undirected edges that represent a fully connected, undirected graph on NV vertices. The edges are permuted in a pseudo-random fashion with a computable and invertable permutation. The list locations will dictate where edge entries appear, and the list indices (unpermuted locations) will determine the edge kind and provide the input for the PRNG. Vertex numbers are scrambled to eliminate generator locality. This list of edges may be generated explicitly, read from storage, or may be generated on-the-fly within Kernel 1. All edge generation must occur before Kernel 2.
Our goal is to provide a natural yet even starting line for all implementations. If an implementation distinguishes between available memory spaces, the edge list must suffer a balanced mapping onto those memory spaces. The edge list also must be reproducible for the same size across such different platforms yet be sufficiently permuted not to allow “cheating” by knowledge of edge index.
If in doubt, use the reference implementation edge generation routines.
The list of NE edges abstractly is a single array of edge entries. Each edge entry consists of two vertices and a weight. Each vertex is an integer at least zero and less than NV. The weight is an integer at least zero and at most maxweight. The edge list must provide at least 48 bits per vertex and eight bits per weight. Each implementation maps the array onto its representation of distinguished memory spaces in a balanced manner.
An implementation using a single, undistinguished memory space (e.g. OpenMP, Cilk, Cray XMT) maps the edge array into the global edge list directly. Entry k of the global array is entry k of the edge list. While many of these programming systems are implemented on top of distinguished memory spaces (e.g. NUMA systems), the programming system itself abstracts the mapping from memory spaces to appear uniform. The programming system may optimize for the mapping, but the benchmark code must not include those optimizations unless the code does count the spaces as distinguished. Programming systems may not special-case Graph500 code for submitted results.
An implementation with separate, distinguished memory spaces (e.g. MPI, UPC, OpenCL with multiple devices) maps the local arrays into the global edge list in a contiguous, balanced fashion. Assume all memory spaces are equally sized and are enumerated with integers starting at zero. Given NP total memory spaces, let
- NE_space(i) = floor(NE/i) + (i < NE%NP? 1 : 0) be the number of entries in space i,
- NE_begin(i) = floor(NE/i) + (i < NE%NP? i : NE%NP) be the first index stored in space i, and
- NE_end(i) = NE_begin(i+1) be one past the last index stored in space i.
Then memory space i stores list locations starting with NE_begin(i) up to but not including NE_end(i). This specification does not dictate the mapping of memory spaces to distinguished memories.
This allocation guarantees that no memory space holds more than one edge more or less than any other memory space. If the memory spaces are not equally sized, the edge list must be allocated in each proportional to the memory space’s share of the total memory size, and no memory space of the same size may hold more than one more or less than one fewer edge than any other space of the same size.
This benchmark permutes edge list indices from the list location k’ to an edge list index k based on the group structure of integers modulo NE. This section is written more generally than required for result submission.
Given an integer Z constant for a given SCALE and edgefactor that is relatively prime to the number of edge list entries NE, let
- k = Z * k’ mod NE, and
- k’ = Zinv * k mod NE.
Here Zinv is the integer inverse of Z modulo NE.
For submitted results, Z is the first integer relatively prime to NE such that Z > floor(3*NE/4). Z and Zinv can be computed in many ways. One simple way as in Algorithm \ref{alg:compute.perm} below relies on the Euclidean algorithm for computing the greatest common divisor of a proposed Z and the given NE.
The first NV-1 unpermuted list entries, those with 0 ≤ k < NV-1, are tree edges that guarantee the graph is connected. The remaining unpermuted edges are RMAT edges. The case when NV-1 > NE will never occur in submitted results and is left unspecified.
Given an unpermuted edge index k, the vertices for index k are floor(k/2) and k+1. The weight is given by maxweight * PRNG(NV,k). Algorithm \ref{alg:tree.edge} provides the high-level implementation of the tree edge generator.
The additional edges come from a RMAT edge generator similar to the Recursive MATrix (R-MAT) scale-free graph generation algorithm [Chakrabarti, et al., 2004]. For ease of discussion, the description of this R-MAT generator uses an adjacency matrix data structure; however, implementations may use any alternate approach that outputs the equivalent list of edge tuples. This model recursively sub-divides the adjacency matrix of the graph into four equal-sized partitions and distributes edges within these partitions with unequal probabilities.
Each edge chooses one of the four partitions with probabilities A, B, C, and D, respectively. These probabilities, the initiator parameters, are provided in Table \ref{tbl:initiator}. For this undirected graph, only parameters A and B are independent. The parameters are perturbed for each level as in [Seshadhri, et al., 2011]. Algorithm \ref{alg:rmat.edge} provides the high-level listing for generating RMAT edges and shows the mapping from edge index to PRNG arguments.
A = 0.55 | B = 0.1 |
C = B = 0.1 | D = 1-(A+B+C) = 0.25 |
The RMAT generator is parallel down to the bit level. The location-based PRNG guarantees any parallelization produces the same result up to differences in floating-point arithmetic. All IEEE-754-conforming platforms should produce identical results; this benchmark is to be run with the default rounding direction.
The PRNG takes two arguments and returns four pseudo-random numbers. For edge index k, the first PRNG argument always is k. The weight is generated by using zero for the second argument. For each bit level s from 0 to SCALE-1 in least- to most-significant bit order, the second argument in the PRNG is 1+floor(s / 2). Given four returned pseudo-random values labeled 0 through 3, bit level s uses two values. The first value, labeled 2*(s % 2), provides the parameter perturbation. The second, labeled 1+2*(s % 2), provides the quadrant. The example code in Algorithm \ref{alg:rmat.edge} and the reference implementation implement this more efficiently by generating all the 2*SCALE pseudo-random numbers into one column-major array.
To remove vertex numbering locality, vertex numbers are scrambled. The scrambled numbers remain in the range [0, 2^SCALE). The exact scrambling algorithm is provided in the reference code. The scrambling uses two 64-bit seed values derived from PRNG(-1, -1).
- D. Chakrabarti, Y. Zhan, and C. Faloutsos, R-MAT: A recursive model for graph mining, SIAM Data Mining 2004.
- C. Seshadhri, A. Pinar, and T.G. Kolda, “An In-depth Study of Stochastic Kronecker Graphs,” 2011 IEEE 11th International Conference on Data Mining (ICDM), pp.587-596, 11-14 Dec. 2011 doi: 10.1109/ICDM.2011.23. Pre-print at http://arxiv.org/abs/1102.5046 .
The first kernel may transform the edge list to any data structures (held in internal or external memory) that are used unmodified for the remaining kernels. For instance, Kernel 1 may construct a (sparse) graph from a list of tuples; each tuple contains endpoint vertex identifiers for an edge, and a weight that represents data assigned to the edge.
The graph may be represented in any manner, but it must not be modified by or between subsequent kernels. A constant number of scalars (not proportional to the graph size) may be collected during construction for later use. The general graph structure must not include information proportional to the graph size for algorithms implementing Kernel 2 or Kernel 3 like short-cut edges or spanning trees. Using Kernel 3 as an example, pre-computing the Δ for Δ-stepping algorithms is permitted, but pre-computing the component hierarchies for Thorup’s algorithm is not.
There are various internal memory representations for sparse graphs, including (but not limited to) sparse matrices and (multi-level) linked lists. For the purposes of this application, the kernel is provided only the total number of vertices, the edge list, and the edge list’s size. Further information must be computed within this kernel. Algorithm \ref{alg:kernel.1} provides a high-level sample implementation of Kernel 1.
The process of constructing the graph data structure (in internal or external memory) from the set of tuples must be timed and is reported in the output.
The search keys must be randomly sampled without replacement from the vertices in the graph. If there are fewer than eight vertices, select all vertices. This should never occur with the graph sizes in this benchmark. The number of vertices selected is included in the output, but this step is untimed. These vertices are used in all kernels below and need be sampled only once. The search vertices are derived from the output of PRNG(NE, k) for k > 0, treating the random 128 bits as a pair of double-precision floating point numbers. Algorithm \ref{alg:sample.roots} shows a high-level sample implementation.
A Breadth-First Search (BFS) of a graph starts with a single source vertex, then, in phases, finds and labels its neighbors, then the neighbors of its neighbors, etc. This is a fundamental method on which many graph algorithms are based. A formal description of BFS can be found in Cormen, Leiserson, and Rivest. We specify the input and output for a BFS benchmark, and we impose some constraints on the computation. However, we do not constrain the choice of BFS algorithm itself, as long as the implementation produces a correct BFS tree as output.
This benchmark’s memory access pattern (internal or external) is data-dependent with small average prefetch depth. As in a simple concurrent linked-list traversal benchmark, performance reflects an architecture’s throughput when executing concurrent threads, each of low memory concurrency and high memory reference density. Unlike such a benchmark, this one also measures resilience to hot-spotting when many of the memory references are to the same location; efficiency when every thread’s execution path depends on the asynchronous side-effects of others; and the ability to dynamically load balance unpredictably sized work units. Measuring synchronization performance is not a primary goal here.
You may not search from multiple initial vertices concurrently. No information can be passed between different invocations of this kernel. The kernel may return a depth array to be used in validation.
ALGORITHM NOTE We allow a benign race condition when vertices at BFS level k are discovering vertices at level k + 1. Specifically, we do not require synchronization to ensure that the first visitor must become the parent while locking out subsequent visitors. As long as the discovered BFS tree is correct at the end, the algorithm is considered to be correct.
For each initial vertex, the routine must return the valid breadth-first search parent information per vertex in the graph. The parent of the initial vertex is itself. The graph is fully connected, so all vertices have parents. Algorithm \ref{alg:kernel.2} provides a sample (and inefficient) high-level implementation of Kernel 2.
A single-source shortest paths (SSSP) computation finds the shortest distance from a given starting vertex to every other vertex in the graph. A formal description of SSSP on graphs with non-negative weights also can be found in Cormen, Leiserson, and Rivest. We specify the input and output for a SSSP benchmark, and we impose some constraints on the computation. However, we do not constrain the choice of SSSP algorithm itself, as long as the implementation produces a correct SSSP distance vector and parent tree as output. This is a separate kernel and cannot use data computed by Kernel 2 (BFS).
This kernel extends the overall benchmark with additional tests and data access per vertex. Many but not all algorithms for SSSP are similar to BFS and suffer from similar issues of hot-spotting and duplicate memory references.
You may not search from multiple initial vertices concurrently. No information can be passed between different invocations of this kernel.
ALGORITHM NOTE We allow benign race conditions within SSSP as well. We do not require that a first visitor must prevent subsequent visitors from taking the parent slot. As long as the SSSP distances and parent tree are correct at the end, the algorithm is considered to be correct.
For each initial vertex, the routine must return a the distance of each vertex from the initial vertex and the parent of each vertex in a valid single-source shortest path tree. The parent of the initial vertex is itself. The graph is fully connected, so all vertices have parents. Algorithm \ref{alg:kernel.3} provides a sample (and inefficient) high-level implementation of Kernel 3.
The Shortest Path Problem: Ninth DIMACS Implementation Challenge. C. Demetrescu, A.V. Goldberg, and D.S. Johnson, eds. DIMACS series in discrete mathematics and theoretical computer science, American Mathematical Society, 2009.
9th DIMACS Implementation Challenge - Shortest Paths. http://www.dis.uniroma1.it/~challenge9/
It is not intended that the results of full-scale runs of this benchmark can be validated by exact comparison to a standard reference result. At full scale, the data set is enormous, and its exact details depend on the BFS or SSSP algorithm used. Therefore, the validation of an implementation of the benchmark uses soft checking of the results. Validation is not part of the timed results.
The executable specification verifies its results by comparing them with results computed directly from the tuple list. Note that the SSSP kernel uses the sum of the weights for a particular edge that may have multiple entries in the input tuple list. Because the edges need re-collapsed into a graph form, only a sampling of the edges are checked. Every tree edge must be checked. Additionally, select 2*SCALE vertices similarly to sampling the root vertices. Instead of using PRNG(NE, k), use PRNG(ceil(NE / kerneltime), k) where kerneltime is the time required by the kernel being verified. This produces a less predictable sequence of vertices. For each such vertex, only the lesser of edgefactor and the vertex’s degree adjacent edges need to be checked. This specification does not require any particular adjacent edges.
Here we specify the validation for the SSSP computation (Kernel 3) and treat BFS (Kernel 2) as a special case. Let w(u, v) be the weight of an edge {u, v}, and let d(u) be the distance of vertex u from the source. After each search, run a function that ensures that the discovered SSSP tree of parents is correct by ensuring that:
- the SSSP tree is a tree rooted at the search vertex and without cycles,
- a node and its parent are joined by an edge of the original graph,
- w(u, v) + d(u) - d(v) ≤ 0 for all unordered input edges {u, v} where d(u) < d(v),
- abs(w(u, v) + d(u) - d(v)) ≤ 1 for a BFS tree, and
- w(u, v) + d(u) - d(v) == 0 when u is a parent of v.
A BFS tree is a SSSP tree with all total edge distances set to one and a maximum constraint gap (line three above) of one. The distance d(u) is the depth of vertex u.
Algorithm \ref{alg:verify} shows a sample validation routine. This
sample optionally takes a distance vector d
and parameter is_bfs
.
The latter, is_bfs
, is set to 1 (true) for BFS validation and 0
(false) for SSSP validation. The same core validation routine may be
used for both kernels.
Start the time for a search immediately prior to visiting the search root. Stop the time for that search when the output has been written to memory. Do not time any I/O outside of the search routine. If your algorithm relies on problem-specific data like a degree threshold in Kernel 2 or a setting for Δ or short-cut edges in a Δ-stepping algorithm for Kernel 3, you must include the setup time for such structures in each search. The spirit of the benchmark is to gauge the performance of a single search. We run many searches in order to compute means and variances, not to amortize data analysis time.
VERY MUCH IN REVISION but something easily machine parse-able for simple submission.
Each submission for the Graph500 list consists of a collection of headers followed by per-search-root data. The submission system will accept US-ASCII; use other character sets at your own risk. Submissions must include a reference implementation if possible and may include a tuned implementation. These need not be run on the same scale data; custom implementations can use more efficient structures to scale to larger data. All times and rates must be provided to at least 8 significant digits.
Each timing data set is preceded by the following header information in a simple tagged format similar to message headers. The tag is followed by a colon character, whitespace, and then the information. The following tags are defined:
- MACHINE
- The name used for the entry on the Graph500 list.
- COMMENT
- An optional comment on the machine.
- IMPLEMENTATION
- Denotes the implementation used. If the
implementation is a reference implementation, the
information begins with “Reference” and will be
one of the following. Otherwise the
implementation is considered custom and the
information is an optional description.
- Reference sequential
- Reference OpenMP
- Reference MPI
- Reference MPI+OpenMP
- Reference UPC (if available)
- Reference OpenCL (if available)
- Reference MPI+OpenCL (if available)
- SCALE
- Graph generation parameter
- EDGEFACTOR
- Graph generation parameter, 16 for current submitted results
- NROOT
- Number of searches run, 8 for current submitted results
- K1TIME
- Time required for Kernel 1, graph construction.
- PRNGCHECK
- The first 32-bit integer produced by the pseudo-random number generator when given SCALE and EDGEFACTOR as its two inputs.
The line-oriented timing data set includes both times and data for external verification. Each line consists of comma-separated fields. The first line defines the order of the columns using the names below. Each subsequent line collects the following data in a comma-separated format:
- root
- the search root,
- k2time
- the time for Kernel 2,
- k2max
- the largest depth found in Kernel 2,
- k3time
- the time for Kernel 3, and
- k3max
- the longest path length found in Kernel 3.
Additional columns will be ignored but could include verification time or other information. If a kernel is not run, output -1 for the time and max data.
An example submission as formatted by Algorithm \ref{alg:output}’s high-level sample code:
MACHINE: An old server COMMENT: Utterly unoptimized. IMPLEMENTATION: Pseudo-reference, unoptimized Octave SCALE: 13 EDGEFACTOR: 16 NROOT: 8 PRNGCHECK: 2125733328 K1TIME: 3.29949856e-02 K2TEPSMEAN: 1.93051023e+05 K2TEPSSTDDEV: 3.20682627e+02 K3TEPSMEAN: 1.29841800e+04 K3TEPSSTDDEV: 4.75862957e+00 root,k2time,k2max,k2vtime,k3time,k3max,k3vtime 1035,6.83776140e-01,6,3.69050503e-02,1.01047730e+01,509,4.41679955e-02 2009,6.76754951e-01,7,3.07211876e-02,1.00874569e+01,617,4.44149971e-02 3098,6.76012993e-01,6,3.07238102e-02,1.00858800e+01,513,4.43210602e-02 3123,6.80670023e-01,7,3.09062004e-02,1.00944400e+01,607,4.44040298e-02 6102,6.77286148e-01,6,3.05948257e-02,1.00998359e+01,503,4.43980694e-02 6136,6.77664995e-01,7,3.06560993e-02,1.00921929e+01,515,4.43840027e-02 7263,6.82636023e-01,7,3.10678482e-02,1.01108069e+01,518,4.29000854e-02 8013,6.76799059e-01,6,3.07610035e-02,1.00825830e+01,540,4.38771248e-02
In approximate order of importance, the goals of this benchmark are to promote the following:
- fair adherence to the intent of the benchmark specification
- minimum execution time for a given problem size, and
- maximum problem size for a given machine.
The Graph500 ranking is defined by the performance metric TEPS defined below. Ties with respect to TEPS are broken in favor of the larger problem.
There are many other possible metrics and ranking options available. Other possible rankings include considering size first and various combined metrics to balance both size and performance. The Graph500 ranking is based on TEPS because current platforms require large data sizes to achieve high TEPS.
Graph500 encourages submitting results for varying sizes and not just the highest performing entry. These submissions will be made available analysis.
In order to compare the performance of Graph 500 “Search” implementations across a variety of architectures, programming models, and productivity languages and frameworks, we adopt a the performance metric described in this section. In the spirit of well-known computing rates floating-point operations per second (FLOPS) measured by the LINPACK benchmark and global updates per second (GUPS) measured by the HPCC RandomAccess benchmark, we define a rate called traversed edges per second (TEPS). We measure TEPS through the benchmarking of Kernels 2 and 3 as follows. Let time_k(n) be the measured execution time for Kernel 2 or Kernel 3. We define the normalized performance rate (number of edge traversals per second) as:
TEPS(n) = NE / time_k(n) .
The generator in this specification produces a fully connected, undirected graph, so the results of every kernel depend on the entire graph with NE = edgefactor * 2^SCALE edges. Using NE rather than counting individual traversals is analogous to defining the FLOPS of matrix multiplication as 2 * n^3 or LU decomposition as 4/3 * n^3 rather than counting the actual operations performed in optimized kernels.
A high-level sample driver for the above routines is given in Algorithm \ref{alg:driver}.