-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update training_rules.adoc #448
base: master
Are you sure you want to change the base?
Update training_rules.adoc #448
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
recheck |
1 similar comment
recheck |
@johntran-nv @petermattson Could you review this? |
Closer! :-) IMO, the only change needed is to add this paragraph: (Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run. I'd revert the changes and stick this on the end of the first CLOSED: para in the section. WDYT? |
Way more elegant! I’ll send an update
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Peter Mattson ***@***.***>
Sent: Thursday, April 29, 2021 6:53:58 PM
To: mlcommons/training_policies ***@***.***>
Cc: Mrinal Iyer ***@***.***>; Mention ***@***.***>
Subject: Re: [mlcommons/training_policies] Update training_rules.adoc (#448)
Closer! :-)
IMO, the only change needed is to add this paragraph:
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.
I'd revert the changes and stick this on the end of the first CLOSED: para in the section. WDYT?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#448 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AT4F5X3SNNAKF6YFT5KWY63TLIETNANCNFSM432V5XAQ>.
** We have updated our privacy policy, which contains important information about how we collect and process your personal data. To read the policy, please click here<http://www.graphcore.ai/privacy> **
This email and its attachments are intended solely for the addressed recipients and may contain confidential or legally privileged information.
If you are not the intended recipient you must not copy, distribute or disseminate this email in any way; to do so may be unlawful.
Any personal data/special category personal data herein are processed in accordance with UK data protection legislation.
All associated feasible security measures are in place. Further details are available from the Privacy Notice on the website and/or from the Company.
Graphcore Limited (registered in England and Wales with registration number 10185006) is registered at 107 Cheapside, London, UK, EC2V 6DN.
This message was scanned for viruses upon transmission. However Graphcore accepts no liability for any such transmission.
|
This reverts commit 7eec2fd.
@petermattson done! Thanks |
@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th | |||
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order. | |||
|
|||
Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once. | |||
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parmitam can we state explictly that this only applies to Bert because this rule does not apply to any other benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This section should say only that padding/un-padding is allowed but that packing should be done if and only if it is done by the reference. And the packing algorithm should be the one the reference uses.
This is an exception that was added for the bert
benchmark since GraphCore needed it at the last minute, and unfortunately the packing code was never put into the reference. This paragraph should be moved to Section 14, "Appendix: Benchmark Specific Rules".
@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th | |||
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order. | |||
|
|||
Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once. | |||
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the wording from "be as a random as the reference" to "be at least as random as the reference". There are bugs in the BERT reference where it does not fully randomize when run on a small number of accelerators (I think the crossover point is 32 accelerators).
@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th | |||
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order. | |||
|
|||
Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once. | |||
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using packing the number of samples per batch becomes variable. And the batch size impacts (a) which RCP is used, (b) the LR schedule, (c) the eval schedule. With the packing algorithm proposed by GraphCore (and used by NVIDIA and NVIDIA's partners since 2021), it was empirically measured that ~2.0x as many samples are processed per batch, and so it was agreed by the cmte that for GraphCore's packing algorithm the code would report batch size as 2x larger when using the packed data set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GraphCore's algorithm uses "Non-negative Least Squares Histogram-Packing", which is described in a power point slide that was shared with the cmte in 2021. I don't think that slide ever got uploaded to the Google Drive, so I've forwarded a copy of it to Shriya. There may have also been a simpler greedy algorithm evaluated at the same time that achieved similar packing ratios, but I can't find any documentation about that.
packing rule update