MLPerf Inference Rules

Table of Contents

1. Overview
- 1.1. Definitions (read this section carefully)
2. General rules
3. Scenarios
4. Benchmarks
- 4.1. Benchmarks
  - 4.1.1. Constraints for the Closed division
  - 4.1.2. Relaxed constraints for the Open division
5. Load Generator
- 5.1. LoadGen Operation
6. Divisions
- 6.1. Closed Division
- 6.2. Open Division
7. Data Sets
- 7.1. Pre- and post-processing
- 7.2. Test Data Traversal Order
8. Model
- 8.1. Weight Definition and Quantization
- 8.2. Model Equivalence
9. FAQ
10. Appendix Number of Queries

Version 0.7 Updated July 17, 2019 This version has been updated, but is not yet final.

Points of contact: David Kanter (david@mlcommons.org), Vijay Janapa Reddi (vjreddi@g.harvard.edu)

1. Overview

This document describes how to implement one or more benchmarks in the MLPerf Inference Suite and how to use those implementations to measure the performance of an ML system performing inference.

There are seperate rules for the submission, review, and publication process for all MLPerf benchmarks here.

The MLPerf name and logo are trademarks. In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. The MLPerf organization reserves the right to solely determine if a use of its name or logo is acceptable.

1.1. Definitions (read this section carefully)

The following definitions are used throughout this document:

A sample is the unit on which inference is run. E.g., an image, or a sentence.

A query is a set of N samples that are issued to an inference system together. N is a positive integer. For example, a single query contains 8 images.

Quality always refers to a model’s ability to produce “correct” outputs.

A system under test consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influences the running time of a benchmark.

A reference implementation is a specific implementation of a benchmark provided by the MLPerf organization. The reference implementation is the canonical implementation of a benchmark. All valid submissions of a benchmark must be equivalent to the reference implementation.

A run is a complete execution of a benchmark implementation on a system under the control of the load generator that consists of completing a set of inference queries, including data pre- and post-processing, meeting a latency requirement and a quality requirement in accordance with a scenario.

A run result consists of the scenario-specific metric.

2. General rules

The following rules apply to all benchmark implementations.

2.1. Strive to be fair

Benchmarking should be conducted to measure the framework and system performance as fairly as possible. Ethics and reputation matter.

2.2. System and framework must be consistent

The same system and framework must be used for a suite result or set of benchmark results reported in a single context.

2.3. Benchmark implementations must be shared

Source code used for the benchmark implementations must be open-sourced under a license that permits a commercial entity to freely use the implementation for benchmarking. The code must be available as long as the results are actively used.

2.4. Non-determinism is restricted

The only forms of acceptable non-determinism are:

Floating point operation order
Random traversal of the inputs
Rounding

All random numbers must be based on fixed random seeds and a deterministic random number generator. The deterministic random number generator is the Mersenne Twister 19937 generator ([std::mt19937](http://www.cplusplus.com/reference/random/mt19937/)). The random seeds will be announced two weeks before the benchmark submission deadline.

2.5. Benchmark detection is not allowed

The framework and system should not detect and behave differently for benchmarks.

2.6. Input-based optimization is not allowed

The implementation should not encode any information about the content of the input dataset in any form.

2.7. Replicability is mandatory

Results that cannot be replicated are not valid results.

2.8. Audit Process

In each round, two submissions will be audited: one selected by the review committee, and one at random from all submissions. A "submission" for audit purposes shall denote a combination of a submitter and a platform (equivalent to a line in the results table). Only Available submissions in Closed division are auditable.

The process of random selection is in two stages: first a submitter is randomly chosen from all submitters with auditable submissions, then one of those submissions is randomly chosen.

An auditor shall be chosen by the review committee who has no conflict of interest with the submitter.

The burden is on the submitter to provide sufficient materials to demonstrate that the submission is compliant with the rules. Any such materials, including software, documentation, testing results and machine access will be provided to the auditor under NDA.

The submitter shall provide two days of hardware access, at a time mutually agreed with the auditor. The first day will be used to run a pre-agreed list of tests, and to verify other system parameters if needed. The second day will allow the auditor to run additional tests based on outcome of the first day.

The auditor shall write a report describing the work that was performed, a list of unresolved issues, and a recommendation on whether the submission is compliant.

The submitter will provide the auditor an NDA within seven days of the auditor’s selection. The auditor and submitter will negotiate and execute the NDA within 14 days of the auditor’s selection.

The auditor will submit their report to the submitter no more than thirty days after executing all relevant NDAs. The submitter will make any necessary redactions due to NDAs and forward the finalized report to the review committee within seven days. The auditor will confirm the accuracy of the forwarded report.

Submissions that fail the audit at a material level will be moved to open or removed, by review committee decision.

3. Scenarios

In order to enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described in the table below.

Scenario	Query Generation	Duration	Samples/query	Latency Constraint	Tail Latency	Performance Metric
Single stream	LoadGen sends next query as soon as SUT completes the previous query	1024 queries and 600 seconds	1	None	90%	90%-ile measured latency
Server	LoadGen sends new queries to the SUT according to a Poisson distribution	270,336 queries and 600 seconds	1	Benchmark specific	99%	Maximum Poisson throughput parameter supported
Offline	LoadGen sends all samples to the SUT at start in a single query	1 query and 600 seconds	At least 24,576	None	N/A	Measured throughput

The number of queries is selected to ensure sufficient statistical confidence in the reported metric. Specifically, the top line in the following table. Lower lines are being evaluated for future versions of MLPerf Inference (e.g., 95% tail latency for v0.6 and 99% tail latency for v0.7).

Tail Latency Percentile	Confidence Interval	Margin-of-Error	Inferences	Rounded Inferences
90%	99%	0.50%	23,886	3*2^13 = 24,576
95%	99%	0.25%	50,425	7*2^13 = 57,344
97%	99%	0.15%	85,811	11*2^13 = 90,112
99%	99%	0.05%	262,742	33*2^13 = 270,336

A submission may comprise any combination of benchmark and scenario results.

The number of runs required for each scenario is defined below:

Single Stream: 1
Server: 1
Offline: 1

Each sample has the following definition:

Model	definition of one sample
Resnet50-v1.5	one image
SSD-ResNet34	one image
SSD-MobileNet-v1	one image
3D UNET	one image
RNNT	one raw speech sample up to 15 seconds
BERT	one sequence
DLRM	up to 700 user-item pairs (more details in FAQ)

4. Benchmarks

The MLPerf organization provides a reference implementation of each benchmark, which includes the following elements: Code that implements the model in a framework. A plain text “README.md” file that describes:

Problem
- Dataset/Environment
- Publication/Attribution
- Data pre- and post-processing
- Performance, accuracy, and calibration data sets
- Test data traversal order (CHECK)
Model
- Publication/Attribution
- List of layers
- Weights and biases
Quality and latency
- Quality target
- Latency target(s)
Directions
- Steps to configure machine
- Steps to download and verify data
- Steps to run and time

A “download_dataset” script that downloads the accuracy, speed, and calibration datasets.

A “verify_dataset” script that verifies the dataset against the checksum.

A “run_and_time” script that executes the benchmark and reports the wall-clock time.

4.1. Benchmarks

4.1.1. Constraints for the Closed division

There are two benchmark suites, one for Datacenter systems and one for Edge (defined herein as non-datacenter) systems. A Datacenter submission must use ECC in their DRAM and HBM memories, and ECC must be enabled for all performance and accuracy runs. No requirements are imposed on SRAM. The suites share multiple benchmarks, but characterize them with different requirements. Read the specifications carefully.

The Datacenter suite includes the following benchmarks:

Area	Task	Model	Dataset	QSL Size	Quality	Server latency constraint
Vision	Image classification	Resnet50-v1.5	ImageNet (224x224)	1024	99% of FP32 (76.46%)	15 ms
Vision	Object detection (large)	SSD-ResNet34	COCO (1200x1200)	64	99% of FP32 (0.20 mAP)	100 ms
Vision	Medical image segmentation	3D UNET	BraTS 2019 (224x224x160)	16	99% of FP32 and 99.9% of FP32 (0.85300 mean DICE score)	N/A
Speech	Speech-to-text	RNNT	Librispeech dev-clean (samples < 15 seconds)	2513	99% of FP32 (1 - WER, where WER=7.452253714852645%)	1000 ms
Language	Language processing	BERT	SQuAD v1.1 (max_seq_len=384)	10833	99% of FP32 and 99.9% of FP32 (f1_score=90.874%)	130 ms
Commerce	Recommendation	DLRM	1TB Click Logs	204800	99% of FP32 and 99.9% of FP32 (AUC=80.25%)	30 ms

Each Datacenter benchmark requires the following scenarios:

Area	Task	Required Scenarios
Vision	Image classification	Server, Offline
Vision	Object detection (large)	Server, Offline
Vision	Medical image segmentation	Offline
Speech	Speech-to-text	Server, Offline
Language	Language processing	Server, Offline
Commerce	Recommendation	Server, Offline

The Edge suite includes the following benchmarks:

Area	Task	Model	Dataset	QSL Size	Quality
Vision	Image classification	Resnet50-v1.5	ImageNet (224x224)	1024	99% of FP32 (76.46%)
Vision	Object detection (large)	SSD-ResNet34	COCO (1200x1200)	64	99% of FP32 (0.20 mAP)
Vision	Object detection (small)	SSD-MobileNets-v1	COCO (300x300)	256	99% of FP32 (0.22 mAP)
Vision	Medical image segmentation	3D UNET	BraTS 2019 (224x224x160)	16	99% of FP32 and 99.9% of FP32 (0.85300 mean DICE score)
Speech	Speech-to-text	RNNT	Librispeech dev-clean (samples < 15 seconds)	2513	99% of FP32 (1 - WER, where WER=7.452253714852645%)
Language	Language processing	BERT	SQuAD v1.1 (max_seq_len=384)	10833	99% of FP32 (f1_score=90.874%)

Each Edge benchmark requires the following scenarios, and sometimes permit an optional scenario:

Area	Task	Required Scenarios
Vision	Image classification	Single Stream, Offline
Vision	Object detection (large)	Single Stream, Offline
Vision	Object detection (small)	Single Stream, Offline
Vision	Medical image segmentation	Single Stream, Offline
Speech	Speech-to-text	Single Stream, Offline
Language	Language processing	Single Stream, Offline

For Edge submitters' convenience, they are allowed to obtain only a Single Stream result and have Offline results inferred according to the following rules:

The corresponding Offline result is 1000 divided by the Mean latency in milliseconds. For example, a Single Stream mean latency of 20 milliseconds would imply an Offline result of 50 samples per second. The inferred Offline accuracy will be the same as the Single Stream accuracy.

To simplify automated processing of inferred results, the submitter should create copies of the correponding Single Stream directories under results/ and measurements/, calling them offline for the Offline scenario.

Accuracy results must be reported to five significant figures with round to even. For example, 98.9995% should be recorded as 99.000%.

For performance runs, the LoadGen will select queries uniformly at random (with replacement) from a test set. The minimum size of the performance test set for each benchmark is listed as 'QSL Size' in the table above. However, the accuracy test must be run with one copy of the MLPerf specified validation dataset.

For 3DUNet, the logical destination for the benchmark output is considered to be the network.

4.1.2. Relaxed constraints for the Open division

An Open benchmark must perform a task matching an existing Closed benchmark, and be substitutable in LoadGen for that benchmark.
The accuracy dataset must be the same as an existing Closed benchmark.
Accuracy constraints are not applicable: instead the submission must report the accuracy obtained.
Latency constraints are not applicable: instead the submission must report the latency constraints under which the reported performance was obtained.
The minimum number of queries should be set using the formula in Appendix Number of Queries.
Scenario constraints are not applicable: any combination of scenarios is permitted.
A open submission must be classified as "Available", "Preview", or "Research, Development, or Internal".
The model can be of any origin (trained on any dataset, quantized in any way, and sparsified in anyway).

5. Load Generator

5.1. LoadGen Operation

The LoadGen is provided in C++ with Python bindings and must be used by all submissions. The LoadGen is responsible for:

Generating the queries according to one of the scenarios.
Tracking the latency of queries.
Validating the accuracy of the results.
Computing final metrics.

Latency is defined as the time from when the LoadGen was scheduled to pass a query to the SUT, to the time it receives a reply.

Single-stream: LoadGen measures average latency using a single test run. For the test run, LoadGen sends an initial query then continually sends the next query as soon as the previous query is processed.
Server: LoadGen determines the system throughput using multiple test runs. Each test run evaluates a specific throughput value in queries-per-second (QPS). For a specific throughput value, queries are generated at that QPS using a Poisson distribution. LoadGen will use a binary search to find a candidate value. If a run fails, it will reduce the value by a small delta then try again.
Offline: LoadGen measures throughput using a single test run. For the test run, LoadGen sends all samples at once in a single query.

The run procedure is as follows:

LoadGen signals system under test (SUT).
SUT starts up and signals readiness.
LoadGen starts clock and begins generating queries.
LoadGen stops generating queries as soon as the benchmark-specific minimum number of queries have been generated and the benchmark specific minimum time has elapsed.
LoadGen waits for all queries to complete, and errors if all queries fail to complete.
LoadGen computes metrics for the run.

The execution of LoadGen is restricted as follows:

LoadGen must run on the processor that most faithfully simulates queries arriving from the most logical source, which is usually the network or an I/O device such as a camera. For example, if the most logical source is the network and the system is characterized as host - accelerator, then LoadGen should run on the host unless the accelerator incorporates a NIC.
The trace generated by LoadGen must be stored in the DRAM that most faithfully simulates queries arriving from the most logical source, which is usually the network or an I/O device such as a camera. It may be pinned. Similarly, the response provided to Loadgen must be stored in the DRAM that most faithfully simulates transfer to the most logical destination, which is a CPU process unless otherwise specified for the benchmark.
```
Submitters seeking to use anything other than the DRAM attached to the processor on which loadgen is running must
seek prior approval, and must provide with their submission sufficient details system architecture and software to
show how the I/O bandwidth utilized by each benchmark/scenario combination can be transferred between that memory and
the network or I/O device.
```
Caching values derived from the shapes of input tensors is allowed. Caching of any other queries, query parameters, or intermediate results is prohibited. In particular, caching values derived from activations is prohibited.
The LoadGen must be compiled from a tagged approved revision of the mlperf/inference GitHub repository without alteration. Pull requests addressing portability issues and adding new functionality are welcome.

LoadGen generates queries based on trace. The trace is constructed by uniformly sampling (with replacement) from a library based on a fixed random seed and deterministic generator. The size of the library is listed in as 'QSL Size' in the 'Benchmarks' table above. The trace is usually pre-generated, but may optionally be incrementally generated if it does not fit in memory. LoadGen validates accuracy via a separate test run that use each sample in the test library exactly once but is otherwise identical to the above normal metric run.

One LoadGen validation run is required for each submitted performance result even if two or more performance results share the same source code.

Note: The same code must be run for both the accuracy and performance LoadGen modes. This means the same output should be passed in QuerySampleComplete in both modes.

6. Divisions

There are two divisions of the benchmark suite, the Closed division and the Open division.

6.1. Closed Division

The Closed division requires using pre-processing, post-processing, and model that is equivalent to the reference or alternative implementation. The closed division allows calibration for quantization and does not allow any retraining.

The unqualified name “MLPerf” must be used when referring to a Closed Division suite result, e.g. “a MLPerf result of 4.5.”

6.2. Open Division

The Open division allows using arbitrary pre- or post-processing and model, including retraining. The qualified name “MLPerf Open” must be used when referring to an Open Division suite result, e.g. “a MLPerf Open result of 7.2.”

In 0.7 Restricted retraining rules characterize a subset of Open division retraining possibilities that are expected to be straightforward for customers to use. The restrictions are optional; conformance will be indicated by a tag on the submission.

7. Data Sets

For each benchmark, MLPerf will provide pointers to:

An accuracy data set, to be used to determine whether a submission meets the quality target, and used as a validation set
A speed/performance data set that is a subset of the accuracy data set to be used to measure performance

For each benchmark, MLPerf will provide pointers to:

A calibration data set, to be used for quantization (see quantization section), that is a small subset of the training data set used to generate the weights

Each reference implementation shall include a script to verify the datasets using a checksum. The dataset must be unchanged at the start of each run.

7.1. Pre- and post-processing

As input, before preprocessing:

all imaging benchmarks take uncropped uncompressed bitmap
BERT takes text
RNN-T takes a waveform
DLRM takes a variable sized set of items, each a sequence of embedding indices

Sample-independent pre-processing that matches the reference model is untimed. However, it must be pre-approved and added to the following list:

May resize to processed size (e.g. SSD-large)
May reorder channels / do arbitrary transpositions
May pad to arbitrary size (don’t be creative)
May do a single, consistent crop
Mean subtraction and normalization provided reference model expect those to be done
May convert data among numerical formats

Any other pre- and post-processing time is included in the wall-clock time for a run result.

7.2. Test Data Traversal Order

Test data is determined by the LoadGen. For scenarios where processing multiple samples can occur (i.e., and offline), any ordering is allowed subject to latency requirements.

8. Model

CLOSED: MLPerf provides a reference implementation of each benchmark. The benchmark implementation must use a model that is equivalent, as defined in these rules, to the model used in the reference implementation.

OPEN: The benchmark implementation may use a different model to perform the same task. Retraining is allowed.

8.1. Weight Definition and Quantization

CLOSED: MLPerf will provide trained weights and biases in fp32 format for both the reference and alternative implementations.

MLPerf will provide a calibration data set for all models except GNMT. Submitters may do arbitrary purely mathematical, reproducible quantization using only the calibration data and weight and bias tensors from the benchmark owner provided model to any numerical format that achieves the desired quality. The quantization method must be publicly described at a level where it could be reproduced.

To be considered principled, the description of the quantization method must be much much smaller than the non-zero weights it produces.

Calibration is allowed and must only use the calibration data set provided by the benchmark owner. Submitters may choose to use only a subset of the calibration data set.

Additionally, MLPerf may provide an INT8 reference for some models. Model weights and input activations are scaled per tensor, and must preserve the same shape modulo padding. Convolution layers are allowed to be in either NCHW or NHWC format. No other retraining is allowed.

OPEN: Weights and biases must be initialized to the same values for each run, any quantization scheme is allowed that achieves the desired quality.

8.2. Model Equivalence

All implementations are allowed as long as the latency and accuracy bounds are met and the reference weights are used. Reference weights may be modified according to the quantization rules.

Examples of allowed techniques include, but are not limited to:

Arbitrary frameworks and runtimes: TensorFlow, TensorFlow-lite, ONNX, PyTorch, etc, provided they conform to the rest of the rules
Running any given control flow or operations on or off an accelerator
Arbitrary data arrangement
Different in-memory representations of inputs, weights, activations, and outputs
Variation in matrix-multiplication or convolution algorithm provided the algorithm produces asymptotically accurate results when evaluated with asymptotic precision
Mathematically equivalent transformations (e.g. Tanh versus Logistic, ReluX versus ReluY, any linear transformation of an activation function)
Approximations (e.g. replacing a transcendental function with a polynomial)
Processing queries out-of-order within discretion provided by scenario
Replacing dense operations with mathematically equivalent sparse operations
Hand picking different numerical precisions for different operations
Fusing or unfusing operations
Dynamically switching between one or more batch sizes
Different implementations based on scenario (e.g., single stream vs. offline) or dynamically determined batch size or input size
Mixture of experts combining differently quantized weights
Stochastic quantization algorithms with seeds for reproducibility
Reducing ImageNet classifiers with 1001 classes to 1000 classes
Dead code elimination
Sorting samples in a query when it improves performance even when all samples are distinct
Incorporating explicit statistical information about the calibration set (eg. min, max, mean, distribution)
Empirical performance and accuracy tuning based on the performance and accuracy set (eg. selecting batch sizes or numerics experimentally)
Sorting an embedding table based on frequency of access in the training set. (Submtters should include in their submission details of how the ordering was derived.)

The following techniques are disallowed:

Wholesale weight replacement or supplements
Discarding non-zero weight elements, including pruning
Caching queries or responses
Coalescing identical queries
Modifying weights during the timed portion of an inference run (no online learning or related techniques)
Weight quantization algorithms that are similar in size to the non-zero weights they produce
Hard coding the total number of queries
Techniques that boost performance for fixed length experiments but are inapplicable to long-running services except in the offline scenario
Using knowledge of the LoadGen implementation to predict upcoming lulls or spikes in the server scenario
Treating beams in a beam search differently. For example, employing different precision for different beams
Changing the number of beams per beam search relative to the reference
Incorporating explicit statistical information about the performance or accuracy sets (eg. min, max, mean, distribution)
Techniques that take advantage of upsampled images. For example, downsampling inputs and kernels for the first convolution.
Techniques that only improve performance when there are identical samples in a query. For example, sorting samples in SSD.

9. FAQ

Q: Do I have to use the reference implementation framework?

A: No, you can use another framework provided that it matches the reference in the required areas.

Q: Do I have to use the reference implementation scripts?

A: No, you don’t have to use the reference scripts. The reference is there to settle conformance questions - with a few exceptions, a submission to the closed division must match what the reference is doing.

Q: Can I submit a single benchmark (e.g., object detection) in a suite (e.g., data center), or do I have to submit all benchmarks?

A: You can submit any of the benchmarks that are interesting, from just one benchmark to the entire set of benchmarks. Keep in mind that submitting one benchmark typically requires running several scenarios as described in Section 4. For example, submitting object detection in the data center suite requires the server and offline scenario and submitting object detection in the edge suite requires the single stream and offline scenarios.

Q: Why does a run require so many individual inference queries?

A: The numbers were selected to be sufficiently large to statistically verify that the system meets the latency requirements.

Q: For my submission, I am going to use a different model format (e.g., ONNX vs TensorFlow Lite). Should the conversion routine/script be included in the submission? Or is it sufficient to submit the converted model?

A: The goal is reproducibility, so you should include the conversion routine/scripts.

Q: Is it permissible to exceed both the minimum number of queries and minimum time duration in a valid test run?

A: Yes.

Q: Can we give the driver a hint to preload the image data to somewhere closer to the accelerator?

A: No.

Q: Can we preload image data somewhere closer to the accelerator that is mapped into host memory?

A: No.

Q: Can we preload image data in host memory somewhere that is mapped into accelerator memory?

A: Yes, provided the image data isn’t eventually cached on the device.

Q: For the server scenario, there are 'Scheduled samples per second', 'Completed samples per second', and the user input target QPS. Which one is reported as the final metric?

A: Scheduled samples per second

Q: What can I cache based on the query indices?

A: Query indices are an artifact of using a finite set of samples to represent an infinite set, and would have no counterpart in production scenarios. As such, the system under test should not cache any information associated with query indices.

9.1. SSD

Q: Is non-maximal suppression (NMS) timed?

A: Yes. NMS is a per image operation. NMS is used to make sure that in object detection, a particular object is identified only once. Production systems need NMS to ensure high-quality inference.

Q: Is COCO eval timed?

A: No. COCO eval compares the proposed boxes and classes in all the images against ground truth in COCO dataset. COCO eval is not possible in production.

9.2. Softmax

Q: In classification and segmentation models (ResNet50, 3DUNet) the final softmax does not change the order of class probabilities. Can it be omitted?

A: Yes.

9.3. DLRM

Q: For DLRM, what’s the distribution of user-item pairs per sample for all scenarios?

A: For all scenarios, the distribution of user-item pairs per sample is specified by dist_quantile.txt. To verify that your sample aggregation trace matches the reference, please follow the steps in dist_trace_verification.txt. Or simply download the reference dlrm_trace_of_aggregated_samples.txt from Zenodo (MD5:3db90209564316f2506c99cc994ad0b2).

Q: What is dist_trace_verification.txt?

The benchmark provides a pre-defined quantile distribution in ./tools/dist_quantile.txt from which the samples will be drawn using the inverse transform algorithm. This algorithm relies on randomly drawn numbers from the interval [0,1) and that depend on the --numpy-rand-seed, which specific value will be provided shortly before MLPerf inference submissions.

Q: What is the rational for the distribution of user-item pairs?

In the case of DLRM we have agreed that we should use multiple samples drawn from a distribution, similar to the one shown on Fig. 5: "Queries for personalized recommendation models" in the DeepRecSys paper.

Q: Generating dlrm_trace_of_aggregated_samples.txt uses a pseudo-random number generator. How can submitters verify their system pseudo-random number generator is compatible?

Submitters can verify their compatibility by using the default --numpy-rand-seed and comparing the trace generated on their system with ./tools/dist_trace_verification.txt using the following command

./run_local.sh pytorch dlrm terabyte cpu --count-samples=100 --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=128

Q: I understand that --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt is the only compliant setting for MLPerf, but what are the alternative settings and what do they do?

The DLRM MLPerf inference code has an option to aggregate multiple consecutive samples together into a single aggregated sample. The number of samples to be aggregated can be selected using either of the following options

fixed [--samples-to-aggregate-fix]
drawn uniformly from interval [--samples-to-aggregate-min, --samples-to-aggregate-max]
drawn from a custom distribution, with its quantile (inverse of CDP) specified in --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt.

9.4. Audit

Q: What characteristics of my submission will make it more likely to be audited?

A: A submission is more likely to be audited if:

the submission’s performance is not consistent with the known or expected characteristics of the hardware
the review committee lacks insight into how the measured performance was achieved
the hardware and software is not reasonably available to the general public

Q: What should I be expected to provide for audit?

A: You should expect to provide the following:

An explanation of the hardware and software mechanisms required to achieve the measured performance
Hardware access to enable the auditor to replicate submission runs (or partial runs in the case of very long-running submission)
Hardware access to enable performance tests through the APIs used in the submission, to verify that performance-critical elements perform as claimed

The auditor may also request source code access to binary elements of the submission software. Where information or access is not provided, the auditor’s report will list the issues that could not be resolved.

Q: Is it expected that an audit will be concluded during the review period? A: No. We should try to finish the audit before the publication date.

10. Appendix Number of Queries

In order to be statistically valid, a certain number of queries are necessary to verify a given latency-bound performance result. How many queries are necessary? Every query either meets the latency bound or exceeds the latency bound. The math for determining the appropriate sample size for a latency bound throughput experiment is exactly the same as determining the appropriate sample size for an electoral poll given an infinite electorate. Three variables determine the sample size: the tail latency percentage, confidence, and margin. Confidence is the probability that a latency bound is within a particular margin of the reported result.

A 99% confidence bound was somewhat arbitrarily selected. For systems with noisy latencies, it is possible to obtain better MLPerf results by cherry picking the best runs. Approximately 1 in 100 runs will be marginally better. Please don’t do this. It is very naughty and will make the MLPerf community feel sad.

The margin should be set to a value much less than the difference between the tail latency percentage and one. Conceptually, the margin ought to be small compared to the distance between the tail latency percentile and 100%. A margin of 0.5% was selected. This margin is one twentieth of the difference between the tail latency percentage and one. In the future, when the tail latency percentage rises, the margin should fall by a proportional amount. The full equation is:

Margin = (1 - TailLatency) / 20

NumQueries = NormsInv((1 - Confidence) / 2)^2 * (TailLatency)(TailLatency - 1) / Margin^2

Concretely:

NumQueries = NormsInv((1 - 0.99) / 2)^2 * (0.9)(1 - 0.9) / 0.005^2 = NormsInv(0.005)^2 * 3600 = (-2.58)^2 * 3,600 = 23,886

To keep the numbers nice, the sample sizes are rounded up. Here is a table showing proposed sample sizes for subsequent rounds of MLPerf:

Tail Latency Percentile	Confidence Interval	Margin-of-Error	Inferences	Rounded Inferences
90%	99%	0.50%	23,886	3*2^13 = 24,576
95%	99%	0.25%	50,425	7*2^13 = 57,344
99%	99%	0.05%	262,742	33*2^13 = 270,336

These are mostly for the Server scenario which has tight bounds for tail latency. The other scenario may continue to use lower samples sizes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference_rules.adoc

inference_rules.adoc

MLPerf Inference Rules

1. Overview

1.1. Definitions (read this section carefully)

2. General rules

2.1. Strive to be fair

2.2. System and framework must be consistent

2.3. Benchmark implementations must be shared

2.4. Non-determinism is restricted

2.5. Benchmark detection is not allowed

2.6. Input-based optimization is not allowed

2.7. Replicability is mandatory

2.8. Audit Process

3. Scenarios

4. Benchmarks

4.1. Benchmarks

4.1.1. Constraints for the Closed division

4.1.2. Relaxed constraints for the Open division

5. Load Generator

5.1. LoadGen Operation

6. Divisions

6.1. Closed Division

6.2. Open Division

7. Data Sets

7.1. Pre- and post-processing

7.2. Test Data Traversal Order

8. Model

8.1. Weight Definition and Quantization

8.2. Model Equivalence

9. FAQ

9.1. SSD

9.2. Softmax

9.3. DLRM

9.4. Audit

10. Appendix Number of Queries

Files

inference_rules.adoc

Latest commit

History

inference_rules.adoc

File metadata and controls

MLPerf Inference Rules

1. Overview

1.1. Definitions (read this section carefully)

2. General rules

2.1. Strive to be fair

2.2. System and framework must be consistent

2.3. Benchmark implementations must be shared

2.4. Non-determinism is restricted

2.5. Benchmark detection is not allowed

2.6. Input-based optimization is not allowed

2.7. Replicability is mandatory

2.8. Audit Process

3. Scenarios

4. Benchmarks

4.1. Benchmarks

4.1.1. Constraints for the Closed division

4.1.2. Relaxed constraints for the Open division

5. Load Generator

5.1. LoadGen Operation

6. Divisions

6.1. Closed Division

6.2. Open Division

7. Data Sets

7.1. Pre- and post-processing

7.2. Test Data Traversal Order

8. Model

8.1. Weight Definition and Quantization

8.2. Model Equivalence

9. FAQ

9.1. SSD

9.2. Softmax

9.3. DLRM

9.4. Audit

10. Appendix Number of Queries