Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update training_rules.adoc #448

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .idea/workspace.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion training_rules.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ March 25, 2021
== Overview
This document describes how to implement the MLPerf Training Suite using an ML framework and how to use that implementation to measure the performance of an ML software framework or hardware.

There are seperate rules for the submission, review, and publication process for all MLPerf benchmarks https://github.com/mlperf/policies/blob/master/submission_rules.adoc[here].
There are separate rules for the submission, review, and publication process for all MLPerf benchmarks https://github.com/mlperf/policies/blob/master/submission_rules.adoc[here].

The MLPerf name and logo are trademarks. In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. The MLPerf organization reserves the right to solely determine if a use of its name or logo is acceptable.

Expand Down Expand Up @@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once.
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.
Copy link
Contributor

@ShriyaPalsamudram ShriyaPalsamudram Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parmitam can we state explictly that this only applies to Bert because this rule does not apply to any other benchmark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This section should say only that padding/un-padding is allowed but that packing should be done if and only if it is done by the reference. And the packing algorithm should be the one the reference uses.

This is an exception that was added for the bert benchmark since GraphCore needed it at the last minute, and unfortunately the packing code was never put into the reference. This paragraph should be moved to Section 14, "Appendix: Benchmark Specific Rules".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the wording from "be as a random as the reference" to "be at least as random as the reference". There are bugs in the BERT reference where it does not fully randomize when run on a small number of accelerators (I think the crossover point is 32 accelerators).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using packing the number of samples per batch becomes variable. And the batch size impacts (a) which RCP is used, (b) the LR schedule, (c) the eval schedule. With the packing algorithm proposed by GraphCore (and used by NVIDIA and NVIDIA's partners since 2021), it was empirically measured that ~2.0x as many samples are processed per batch, and so it was agreed by the cmte that for GraphCore's packing algorithm the code would report batch size as 2x larger when using the packed data set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GraphCore's algorithm uses "Non-negative Least Squares Histogram-Packing", which is described in a power point slide that was shared with the cmte in 2021. I don't think that slide ever got uploaded to the Google Drive, so I've forwarded a copy of it to Shriya. There may have also been a simpler greedy algorithm evaluated at the same time that achieved similar packing ratios, but I can't find any documentation about that.


For DLRM the submissions are allowed to use a preshuffled dataset and are not obligated to shuffle the data once more during training. However, the reference implementation uses both preshuffled data and an approximate "batch shuffle" performed on-the-fly. Reference runs should also use a different seed in each run, so that the order of the training batches in each reference run is different. Even though the submissions are allowed to not shuffle the data on-the-fly, they are obligated to match the convergence behavior of the reference which does perform on-the-fly "batch-shuffle". Using a preshuffled dataset with a hand-crafted, advantageous data ordering is disallowed.

Expand Down