Roialign fix and half_pixel mode support #3482

bpickrel · 2024-09-26T15:46:08Z

Fix bugs in the implementation of ROIAlign operation which were found when attempting to run it with the half_pixel coordinate conversion mode, to include more thorough tests. Some bugs are mode-specific and some are not.

The ROIAlign operation was first proposed in a paper at https://arxiv.org/abs/1703.06870v3 which introduced the Mask R-CNN model. It was a variant of the ROIPool operation which was found to give significantly better accuracy. In the implementations in Torch, Onnxruntime, and Migraphx, ROIPool and ROIAlign are implemented in the same op. with different choices for the mode attribute, with output_half_pixel for ROIPool and half_pixel for ROIAlign; thus, there is no ROIAlign op without fixing the half_pixel mode.

Note, by the way, that these same coordinate conversion modes are also attributes of the Resize op.

MIGraphX uses the Onnxruntime implementation of ROIAlign as its functional specification and should give identical results.

This change is prerequisite for torch-migraphx PR #143 but does not close it.

…ng in a dimension of size 1

…tatic.

…X into stride_ordering_for_mlir

…code. Tests need to be completed, including updating generated onnx test files.

…_half_pixel_verify_test for first roi but fails for second

…k in progress with debug code.

…ive correct result only for ROI in bounds.

bpickrel · 2024-10-22T20:54:05Z

test/onnx/parse/roialign_test.cpp

@@ -41,7 +41,7 @@ TEST_CASE(roialign_test)
                          {{"coordinate_transformation_mode", "output_half_pixel"},
                           {"spatial_scale", 2.0f},
                           {"output_height", 5},
-                           {"output_width", 5},
+                           {"output_width", 3},


This particular change was a big deal, btw, since the old code appeared to work fine until I gave it an output_height and output_width that were not the same.

shivadbhavsar · 2024-10-22T21:23:49Z

test/onnx/verify/roialign_half_pixel_verify_test.cpp

+    migraphx::shape srois{migraphx::shape::float_type, {2, 4}};
+    std::vector<float> rois_data = {1.1, 0.73, 1.7, 1.13, 1.1, 0.73, 2.6, 1.13};
+    migraphx::shape sbi{migraphx::shape::int64_type, {2}}; // batch_index
+    std::vector<int64_t> bi_data = {0, 1};


for good measure, can you add a tests with the following cases:

Repeated batch indices

Missing batch indices (ie. not all batch items are computed on)

Number of ROIs != batch_size

You can probably just create one test case to get all these. Make the input batch_size 3 and the batch_indices something like {1,2,2,1} (and hence the rois shape will be {4,4})
Would be good to have a gpu verify test for this same case too just to be sure gpu impl matches

Added such a case in ref. tests, along with other updates. Some of the new cases fail and I'm now debugging those.

spolifroni-amd

A whitespace in the table needs to be removed.

spolifroni-amd · 2024-10-23T14:43:11Z

docs/dev/onnx_operators.rst

@@ -697,7 +697,7 @@ Operator Support Matrix
 |                          |           |                 | functions are                |
 |                          |           |                 | not enabled                  |
 +--------------------------+-----------+-----------------+------------------------------+
-| RoiAlign                 | ✅        | FP8, FP16,      |                              |
+| RoiAlign                 | ✅        | FP8, FP16,      |                               |


…variety of options including pooling mode, transformation type, spatial scale, multiple input channels, non-symmetrical output shape, and roi index list with skips and duplicates. Changed roialign_half_pixel_verify_test to match one of the new ref test cases. Cases using max pooling do not pass test.

bpickrel · 2024-10-24T18:47:12Z

test/ref/roialign.cpp

@@ -84,114 +84,164 @@ TEST_CASE(roialign_out_of_bound_test)
    }
 }

+auto create_program(const std::string& trans_mode = "half_pixel",


Notes on ref tests: added cases with all 4 combinations of trans_mode and pooling_mode and split them apart into separate named cases. The modified create_program() has a reshaped input with multiple channels, multiple layers, an ROI list that doesn't match 1-1 with the layers, non-unity scale and sampling ratio, one negative data value (any float value is legal) and an ROI that goes out of bounds (also legal). Also, the output height and width are no longer equal, which masked errors in the original implementation.

Need to debug the max pooling cases!

…s are incorrect

…ues for roialign with max pooling were found to be erroneous.

bpickrel · 2024-10-28T15:52:20Z

The licensing check fail now occurring is for a file not related to this PR:

Error: The licenses for the following 1 file(s) either... do not match the year of commit, have a different copyright format or have not been synced from the latest roialign_fix branch:
['src/targets/gpu/kernels/include/migraphx/kernels/float8.hpp']

bpickrel · 2024-10-28T18:08:17Z

Looks fine, just a few small things. I haven't been able to fully wrap my head around all the math in the ref and gpu impl, the index changes look reasonable. Do we have a way to directly test against ORT (without maunally extracting gold outputs)? If so, I think it would be worthwhile to add a few more tests comparing with ORT

I think it would be possible to add a test following the model of the existing tests in test/py/. With luck it wouldn't be very much extra work, half a day or so. @pfultz2 what do you think? The rationale for adding an op test here is that the ROIAlign op is defined in terms of the Onnxruntime implementation so it makes sense to have a specialized test with ORT as the reference.

Note my recent comment that I learned the ORT implementation of the max pooling option is buggy and can't be used for a test reference until the fix is released. I don't know whether max pooling is widely used with this op or not.

bpickrel · 2024-10-28T18:16:01Z

Looks fine, just a few small things. I haven't been able to fully wrap my head around all the math in the ref and gpu impl, the index changes look reasonable.

Do you want me to go over it with you? I can explain the intent of nearly everything but the indexing is still very difficult to unravel.

bpickrel · 2024-10-28T22:20:56Z

test/py/test_roialign.py

+    #  XXXXX 0x562d956ec8f0    (0x562d956ec8f0 + 0 * 2 + channel 0) * 4 * 3
+    #  XXXXX 0x562d956ec920    (0x562d956ec8f0 + 0 * 2 + channel 1) * 4 * 3        
+    res = sess.run(['y'], {'x': data, 'rois': roi_data, 'batch_ind': index_data})
+    assert np.allclose(mgx_result, res, rtol=1e-05, atol=1e-08, equal_nan=False)


tolerances are the Numpy defaults

bpickrel · 2024-11-11T23:26:49Z

Requesting re-review after a recent change: Added a Python test test_roialign.py to check MigraphX output directly vs. onnxruntime, and found that MigraphX results were internally consistent but output the right values in a transposed shape. Fixing this caused changes to internal computations, but I updated both the ref. and GPU implementations to emit a corrected shape.

Repeat of an earlier comment: we can't do a similar check vs. onnxruntime for "max" pooling mode because the ORT implementation of max pooling in ROIAlign has a known bug.

migraphx-bot · 2024-11-12T18:55:26Z

Test	Batch	Rate new 400bd0	Rate old c51bea	Diff	Compare
torchvision-resnet50	64	3,258.94	3,257.81	0.03%	✅
torchvision-resnet50_fp16	64	6,988.19	6,987.81	0.01%	✅
torchvision-densenet121	32	2,431.87	2,434.57	-0.11%	✅
torchvision-densenet121_fp16	32	4,099.62	4,065.61	0.84%	✅
torchvision-inceptionv3	32	1,636.68	1,637.17	-0.03%	✅
torchvision-inceptionv3_fp16	32	2,761.86	2,759.26	0.09%	✅
cadene-inceptionv4	16	775.66	776.31	-0.08%	✅
cadene-resnext64x4	16	808.05	811.75	-0.46%	✅
slim-mobilenet	64	7,525.92	7,533.16	-0.10%	✅
slim-nasnetalarge	64	211.28	211.39	-0.05%	✅
slim-resnet50v2	64	3,497.50	3,504.83	-0.21%	✅
bert-mrpc-onnx	8	1,147.54	1,146.47	0.09%	✅
bert-mrpc-tf	1	464.53	473.89	-1.98%	✅
pytorch-examples-wlang-gru	1	413.15	425.31	-2.86%	✅
pytorch-examples-wlang-lstm	1	389.69	408.68	-4.65%	🔴
torchvision-resnet50_1	1	806.71	771.75	4.53%	🔆
cadene-dpn92_1	1	399.87	399.01	0.22%	✅
cadene-resnext101_1	1	382.86	383.85	-0.26%	✅
onnx-taau-downsample	1	343.04	343.09	-0.02%	✅
dlrm-criteoterabyte	1	33.33	33.31	0.05%	✅
dlrm-criteoterabyte_fp16	1	52.71	52.71	0.01%	✅
agentmodel	1	7,901.33	8,235.67	-4.06%	🔴
unet_fp16	2	58.79	58.90	-0.19%	✅
resnet50v1_fp16	1	948.60	940.89	0.82%	✅
resnet50v1_int8	1	1,002.37	1,025.93	-2.30%	✅
bert_base_cased_fp16	64	1,171.54	1,170.88	0.06%	✅
bert_large_uncased_fp16	32	363.60	363.69	-0.02%	✅
bert_large_fp16	1	200.49	200.14	0.18%	✅
distilgpt2_fp16	16	2,202.57	2,200.77	0.08%	✅
yolov5s	1	543.48	535.15	1.56%	✅
tinyllama	1	43.46	43.41	0.10%	✅
vicuna-fastchat	1	175.77	178.09	-1.30%	✅
whisper-tiny-encoder	1	417.88	418.18	-0.07%	✅
whisper-tiny-decoder	1	427.73	427.58	0.03%	✅

This build is not recommended to merge 🔴

migraphx-bot · 2024-11-12T18:55:28Z

✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

✅ agentmodel: PASSED: MIGraphX meets tolerance

✅ unet: PASSED: MIGraphX meets tolerance

✅ resnet50v1: PASSED: MIGraphX meets tolerance

✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

✅ bert_large: PASSED: MIGraphX meets tolerance

✅ yolov5s: PASSED: MIGraphX meets tolerance

✅ tinyllama: PASSED: MIGraphX meets tolerance

✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

pfultz2 · 2024-11-12T19:09:22Z

You should capture the onnxruntime results and just create a ref test.

pfultz2 · 2024-11-12T19:10:10Z

test/py/test_roialign.py

+
+
+if __name__ == "__main__":
+    test_roialign()


This file is not even used in our test suite. It should just be removed and a ref test should be used.

bpickrel and others added 30 commits July 17, 2024 22:23

adjust stride ordering rules for standard shape: stride can be anythi…

735e7a6

…ng in a dimension of size 1

add a shape test

92bac55

debug code removed

090c767

fix a test

9e8528f

Merge branch 'develop' into stride_ordering_for_mlir

cd32e55

Merge branch 'develop' into stride_ordering_for_mlir

f654d5f

added shape::compatible_lens() method

22cc5ff

format

c5508e6

refactor the function for testing compatible shapes to a non-member s…

21dddd0

…tatic.

changing recursive equal call

31addfc

conditional conpilation for is_compatible_shape()

02633f2

different workaround for compile problem

68467b6

Merge branch 'develop' into stride_ordering_for_mlir

cf58b3c

misc small fixes

be7a72e

Merge branch 'stride_ordering_for_mlir' of github.com:ROCm/AMDMIGraph…

c7a4920

…X into stride_ordering_for_mlir

Merge branch 'develop' into stride_ordering_for_mlir

32bfc24

changes to compatible check, want to see if this passes jenkins

e0f1695

style

95d7a2f

cleanup method names

a3f40dd

format

4fffb17

style

20fb5bc

comment

2371231

add test subcases for new function

0c6bef7

style

ef1d2f6

Merge branch 'develop' into stride_ordering_for_mlir

d64abcb

bug fix work in progress. Contains fixed source code. Contains debug …

94392aa

…code. Tests need to be completed, including updating generated onnx test files.

reordered lens for iteration shape; added some tests. Passes roialign…

a43303c

…_half_pixel_verify_test for first roi but fails for second

bug fixes and added roialign_half_pixel_verify_test which passes. Wor…

69d0d44

…k in progress with debug code.

test cases 2 rois, fails

dbe18b5

created out of bounds test for roialign. Learned that existing code g…

4cb582e

…ive correct result only for ROI in bounds.

bpickrel commented Oct 22, 2024

View reviewed changes

shivadbhavsar reviewed Oct 22, 2024

View reviewed changes

spolifroni-amd requested changes Oct 23, 2024

View reviewed changes

bpickrel added 4 commits October 23, 2024 18:10

work in progress

e9fd0fa

split test into 2 cases

174a5b7

add roialign verify test for max pooling; doesn't pass

0f25c4f

bpickrel commented Oct 24, 2024

View reviewed changes

bpickrel added 3 commits October 24, 2024 18:53

adds updated onnx test file for previous commit

1ad15d0

update roialign max pooling test; learned that onnxruntime Gold value…

5c8151d

…s are incorrect

update gold values for some roialign tests. Onnxruntime reference val…

46f6ff2

…ues for roialign with max pooling were found to be erroneous.

Add Python test, migraphx directly verified against Onnxruntime

2a40853

bpickrel commented Oct 28, 2024

View reviewed changes

work in progress, reshkaping output in compute method. does not work

2fc109d

bpickrel marked this pull request as draft October 29, 2024 15:49

bpickrel added 4 commits November 7, 2024 00:25

reshape output for reference op, working but test updates needed

4ae500c

work in progress; values are right up to index 8

d8a3de0

reshape output for roialign GPU kernel; works; contains debug code

31dbc0a

add a max test for roialign, clean up debug code, update comments

a222101

bpickrel requested a review from shivadbhavsar November 11, 2024 23:11

format

400bd07

bpickrel marked this pull request as ready for review November 12, 2024 16:44

pfultz2 reviewed Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roialign fix and half_pixel mode support #3482

Roialign fix and half_pixel mode support #3482

bpickrel commented Sep 26, 2024 •

edited

Loading

bpickrel Oct 22, 2024

shivadbhavsar Oct 22, 2024

bpickrel Oct 24, 2024

spolifroni-amd left a comment

spolifroni-amd Oct 23, 2024

bpickrel Oct 24, 2024

bpickrel commented Oct 28, 2024

bpickrel commented Oct 28, 2024

bpickrel commented Oct 28, 2024

bpickrel Oct 28, 2024

bpickrel commented Nov 11, 2024

migraphx-bot commented Nov 12, 2024

migraphx-bot commented Nov 12, 2024

pfultz2 commented Nov 12, 2024

pfultz2 Nov 12, 2024

	\| RoiAlign \| ✅ \| FP8, FP16, \| \|
	\| RoiAlign \| ✅ \| FP8, FP16, \| \|

Roialign fix and half_pixel mode support #3482

Are you sure you want to change the base?

Roialign fix and half_pixel mode support #3482

Conversation

bpickrel commented Sep 26, 2024 • edited Loading

bpickrel Oct 22, 2024

Choose a reason for hiding this comment

shivadbhavsar Oct 22, 2024

Choose a reason for hiding this comment

bpickrel Oct 24, 2024

Choose a reason for hiding this comment

spolifroni-amd left a comment

Choose a reason for hiding this comment

spolifroni-amd Oct 23, 2024

Choose a reason for hiding this comment

bpickrel Oct 24, 2024

Choose a reason for hiding this comment

bpickrel commented Oct 28, 2024

bpickrel commented Oct 28, 2024

bpickrel commented Oct 28, 2024

bpickrel Oct 28, 2024

Choose a reason for hiding this comment

bpickrel commented Nov 11, 2024

migraphx-bot commented Nov 12, 2024

migraphx-bot commented Nov 12, 2024

pfultz2 commented Nov 12, 2024

pfultz2 Nov 12, 2024

Choose a reason for hiding this comment

bpickrel commented Sep 26, 2024 •

edited

Loading