-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update training_rules.adoc #448
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,7 @@ March 25, 2021 | |
== Overview | ||
This document describes how to implement the MLPerf Training Suite using an ML framework and how to use that implementation to measure the performance of an ML software framework or hardware. | ||
|
||
There are seperate rules for the submission, review, and publication process for all MLPerf benchmarks https://github.com/mlperf/policies/blob/master/submission_rules.adoc[here]. | ||
There are separate rules for the submission, review, and publication process for all MLPerf benchmarks https://github.com/mlperf/policies/blob/master/submission_rules.adoc[here]. | ||
|
||
The MLPerf name and logo are trademarks. In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. The MLPerf organization reserves the right to solely determine if a use of its name or logo is acceptable. | ||
|
||
|
@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th | |
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order. | ||
|
||
Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once. | ||
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we change the wording from "be as a random as the reference" to "be at least as random as the reference". There are bugs in the BERT reference where it does not fully randomize when run on a small number of accelerators (I think the crossover point is 32 accelerators). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When using packing the number of samples per batch becomes variable. And the batch size impacts (a) which RCP is used, (b) the LR schedule, (c) the eval schedule. With the packing algorithm proposed by GraphCore (and used by NVIDIA and NVIDIA's partners since 2021), it was empirically measured that ~2.0x as many samples are processed per batch, and so it was agreed by the cmte that for GraphCore's packing algorithm the code would report batch size as 2x larger when using the packed data set. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GraphCore's algorithm uses "Non-negative Least Squares Histogram-Packing", which is described in a power point slide that was shared with the cmte in 2021. I don't think that slide ever got uploaded to the Google Drive, so I've forwarded a copy of it to Shriya. There may have also been a simpler greedy algorithm evaluated at the same time that achieved similar packing ratios, but I can't find any documentation about that. |
||
|
||
For DLRM the submissions are allowed to use a preshuffled dataset and are not obligated to shuffle the data once more during training. However, the reference implementation uses both preshuffled data and an approximate "batch shuffle" performed on-the-fly. Reference runs should also use a different seed in each run, so that the order of the training batches in each reference run is different. Even though the submissions are allowed to not shuffle the data on-the-fly, they are obligated to match the convergence behavior of the reference which does perform on-the-fly "batch-shuffle". Using a preshuffled dataset with a hand-crafted, advantageous data ordering is disallowed. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parmitam can we state explictly that this only applies to Bert because this rule does not apply to any other benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This section should say only that padding/un-padding is allowed but that packing should be done if and only if it is done by the reference. And the packing algorithm should be the one the reference uses.
This is an exception that was added for the
bert
benchmark since GraphCore needed it at the last minute, and unfortunately the packing code was never put into the reference. This paragraph should be moved to Section 14, "Appendix: Benchmark Specific Rules".