Adjust test tolerance #19947

tianleiwu · 2024-03-16T01:43:42Z

Description

Improve the precision of tests.

Changes include:
(1) Update checkers.cc to use consistent default tolerance.
(2) Allow different default tolerances for different providers at runtime (Previously, threshold of a test is decided during compiling).
(3) Explicitly set absolute and relative error tolerances for tests that failed to pass new default threshold.

Default Thresholds Change

Note that the formula of testing is abs(expected - value) < absolute + relative * abs(expected)

Default test thresholds when both absolute and relative tolerance are not set:

type	provider	absolute (before)	absolute (after)	relative (before)	relative (after)
double	CPU	0.001	0.00001	0	0.00001
double	CUDA	0.005	0.00001	0	0.00001
double	TRT	0.005	0.00001	0	0.00001
double	ROCM	0.005	0.00001	0	0.00001
double	DML	0.005	0.00001	0	0.00001

float	CPU	0.0001	0.00001	0	0.0001
float	CUDA	0.005	0.00001	0	0.0001
float	TRT	0.005	0.00001	0	0.0001
float	ROCM	0.005	0.00001	0	0.0001
float	DML	0.005	0.00001	0	0.0001
float	Training*	0.005	0.001	0	0.0001

half	CPU	0.001	0.0025	0	0.001
half	CUDA	0.005	0.0025	0	0.001
half	TRT	0.005	0.0025	0	0.001
half	ROCM	0.005	0.0025	0	0.001
half	DML	0.02	0.005	0	0.001
half	Training*	0.005	0.005	0	0.001

bfloat16	CPU	0.0001	0.02	0	0.01
bfloat16	CUDA	0.0001	0.02	0.05	0.01
bfloat16	TRT	0.0001	0.02	0.05	0.01
bfloat16	ROCM	0.0001	0.02	0.05	0.01
bfloat16	DML	0.0001	0.02	0.05	0.01
bfloat16	Training*	0.0001	0.02	0.05	0.01

*Training mean a build flag ENABLE_TRAINING_CORE is defined. The provider can be any one.

Threshold for provider

Previously, the threshold might change according to build flags:

#if defined(USE_CUDA) || defined(USE_ROCM) || defined(USE_DML)
  constexpr float threshold = 0.005f;
#else
  constexpr float threshold = 0.0001f;
#endif

For a cpu only build, the threshold is 0.0001. For a cuda build, the threshold for CPU provider (some tests in cuda build actually run with CPU provider) is changed to 0.005.

After this change, the threshold only depends on data type and provider used in the test. It will not change by build flags for non-training builds.

Default thresholds for training might be different from inference (please refer to the above table). There are a few factors there: Training has gradient outputs; TF32 is not disabled in training; Some training tests has iterations, and error might accumulate. How to set different thresholds based on these factors could be a future task.

Motivation and Context

### Description Improve the precision of tests. Changes include: (1) Update checkers.cc to use consistent default tolerance. (2) Allow different default tolerances for different providers at runtime (Previously, threshold of a test is decided during compiling). (3) Explicitly set absolute and relative error tolerances for tests that failed to pass new default threshold. #### Default Thresholds Change Note that the formula of testing is `abs(expected - value) < absolute + relative * expected` Default test thresholds when both absolute and relative tolerance are not set: type | provider | absolute (before) | absolute (after) | relative (before) | relative (after) -- | -- | -- | -- | -- | -- double | CPU | 0.001 | 0.00001 | 0 | 0.00001 double | CUDA | 0.005 | 0.00001 | 0 | 0.00001 double | TRT | 0.005 | 0.00001 | 0 | 0.00001 double | ROCM | 0.005 | 0.00001 | 0 | 0.00001 double | DML | 0.005 | 0.00001 | 0 | 0.00001 | | | | | float | CPU | 0.0001 | 0.00001 | 0 | 0.0001 float | CUDA | 0.005 | 0.00001 | 0 | 0.0001 float | TRT | 0.005 | 0.00001 | 0 | 0.0001 float | ROCM | 0.005 | 0.00001 | 0 | 0.0001 float | DML | 0.005 | 0.00001 | 0 | 0.0001 float | Training* | 0.005 | 0.001 | 0 | 0.0001 | | | | | half | CPU | 0.001 | 0.0025 | 0 | 0.001 half | CUDA | 0.005 | 0.0025 | 0 | 0.001 half | TRT | 0.005 | 0.0025 | 0 | 0.001 half | ROCM | 0.005 | 0.0025 | 0 | 0.001 half | DML | 0.02 | 0.005 | 0 | 0.001 half | Training* | 0.005 | 0.005 | 0 | 0.001 | | | | | bfloat16 | CPU | 0.0001 | 0.02 | 0 | 0.01 bfloat16 | CUDA | 0.0001 | 0.02 | 0.05 | 0.01 bfloat16 | TRT | 0.0001 | 0.02 | 0.05 | 0.01 bfloat16 | ROCM | 0.0001 | 0.02 | 0.05 | 0.01 bfloat16 | DML | 0.0001 | 0.02 | 0.05 | 0.01 bfloat16 | Training* | 0.0001 | 0.02 | 0.05 | 0.01 *Training mean a build flag ENABLE_TRAINING_CORE is defined. The provider can be any one. #### Threshold for provider Previously, the threshold might change according to build flags: ``` #if defined(USE_CUDA) || defined(USE_ROCM) || defined(USE_DML) constexpr float threshold = 0.005f; #else constexpr float threshold = 0.0001f; #endif ``` For a cpu only build, the threshold is 0.0001. For a cuda build, the threshold for CPU provider (some tests in cuda build actually run with CPU provider) is changed to 0.005. After this change, the threshold only depends on data type and provider used in the test. It will not change by build flags for non-training builds. Default thresholds for training might be different from inference (please refer to the above table). There are a few factors there: Training has gradient outputs; TF32 is not disabled in training; Some training tests has iterations, and error might accumulate. How to set different thresholds based on these factors could be a future task.

tianleiwu marked this pull request as draft March 16, 2024 01:43

tianleiwu force-pushed the tlwu/default_test_threshold branch 2 times, most recently from 2f5575a to a16da8a Compare March 18, 2024 06:36

change test thresholds

5e2a247

tianleiwu force-pushed the tlwu/default_test_threshold branch from a16da8a to 5e2a247 Compare March 18, 2024 07:05

tianleiwu changed the title ~~Adjust test tolerance and disable tf32 in tests~~ Adjust test tolerance Mar 18, 2024

tianleiwu marked this pull request as ready for review March 18, 2024 07:22

tianleiwu marked this pull request as draft March 18, 2024 16:35

tianleiwu added 6 commits March 18, 2024 16:53

float threshold for training

8a7d43a

use default rtol

4b6b2ab

Merge branch 'main' into tlwu/default_test_threshold

c6f66af

update attention test threshold

8d3b231

threshold for attention and gemm_fastgelu

a290e2d

add comments

d7ec47e

tianleiwu marked this pull request as ready for review March 19, 2024 16:41

adjust attention op threshold

5b0eac9

snnn previously approved these changes Mar 19, 2024

View reviewed changes

Merge branch 'main' into tlwu/default_test_threshold

b1f0b64

tianleiwu dismissed snnn’s stale review via b1f0b64 March 19, 2024 18:39

update comment

943795f

tianleiwu requested a review from snnn March 19, 2024 20:38

snnn approved these changes Mar 19, 2024

View reviewed changes

tianleiwu merged commit 597e828 into main Mar 19, 2024
93 of 95 checks passed

tianleiwu deleted the tlwu/default_test_threshold branch March 19, 2024 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust test tolerance #19947

Adjust test tolerance #19947

tianleiwu commented Mar 16, 2024 •

edited

Loading

Adjust test tolerance #19947

Adjust test tolerance #19947

Conversation

tianleiwu commented Mar 16, 2024 • edited Loading

Description

Default Thresholds Change

Threshold for provider

Motivation and Context

tianleiwu commented Mar 16, 2024 •

edited

Loading