Enable ORT accuracy tests to verify int8 #1904

TedThemistokleous · 2023-06-29T21:03:07Z

rocMLIR will be added to migraphx. This will be for all the data types but this issue will be to ensure existing tests in DLM pass using int8

TedThemistokleous · 2023-10-06T14:38:58Z

Seems like #2300 will aid in accuracy for these runs.

TedThemistokleous · 2023-10-10T14:51:52Z

Reopenning, need to rerun testing with these changes + update ORT EP

TedThemistokleous · 2023-11-04T21:05:09Z

Resnet50 runs, need to do further analysis to compare to fp16 through same pipeline.

Plan of attack

Add changes to DLM to run e2e pipeline with int8 <- has accuracy result using real dataset
Storage for imagenet data set for DLM runs
Add test for bert
Add test for distilgpt

Added models will leverage existing e2e code, may need to write/borrow code for benchmark.py thats used in parity checks.

TedThemistokleous · 2023-11-20T23:07:36Z

Doing this part of QA validation for the resnet50 pipeline added in onnxruntime-inference-examples

TedThemistokleous · 2023-12-06T18:07:24Z

Got DLM changes for benchmark.py but failing off 6.0 sorting out issues with run scripts.

TedThemistokleous · 2023-12-06T19:48:35Z

Tracked by JIRA: https://ontrack-internal.amd.com/browse/SWDEV-410597

TedThemistokleous · 2023-12-06T19:55:31Z

Able to get run for bert-large, bert-based-cased and distilgpt2 model runs into DLM reusing existing runs. Missing GPT2 as referenced in #1905 . Will need to add gpt2 + requivalent int8 run.

We're letting onnxruntime do the quantization of the model before we do a run through the MIGraphX EP right now.

Running these by hand to verify

bert_large_uncased int8 Quantized.

Finished quantizing model: ./onnx_models/bert_large_uncased_1_int8_gpu.onnx
Run onnxruntime on bert-large-uncased with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-large-uncased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:36:35.944933', 'test_times': 100, 'latency_variance': '0.13', 'latency_90_percentile': '175.18', 'latency_95_percentile': '175.84', 'latency_99_percentile': '211.92', 'average_latency_ms': '176.60', 'QPS': '5.66'}
Run onnxruntime on bert-large-uncased with input shape [16, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-large-uncased', 'inputs': 1, 'threads': 16, 'batch_size': 16, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:36:54.211083', 'test_times': 100, 'latency_variance': '1164.47', 'latency_90_percentile': '5226.89', 'latency_95_percentile': '5381.81', 'latency_99_percentile': '5932.76', 'average_latency_ms': '3711.79', 'QPS': '4.31'}

bert_based_cased int8 Quantized

Finished quantizing model: ./onnx_models/bert_base_cased_1_int8_gpu.onnx
Run onnxruntime on bert-base-cased with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:15.985903', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '12.94', 'latency_95_percentile': '15.32', 'latency_99_percentile': '15.75', 'average_latency_ms': '12.85', 'QPS': '77.80'}
Run onnxruntime on bert-base-cased with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:17.620586', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '54.57', 'latency_95_percentile': '54.61', 'latency_99_percentile': '55.08', 'average_latency_ms': '54.20', 'QPS': '18.45'}
Run onnxruntime on bert-base-cased with input shape [32, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 32, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:23.099155', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '127.41', 'latency_95_percentile': '127.57', 'latency_99_percentile': '128.02', 'average_latency_ms': '126.68', 'QPS': '252.61'}
Run onnxruntime on bert-base-cased with input shape [32, 384]

distilgpt2 int8 Quantized

Size of quantized ONNX model(MB):116.36144828796387
Finished quantizing model: ./onnx_models/distilgpt2_1_int8_gpu.onnx
Run onnxruntime on distilgpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:43.699617', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '8.07', 'latency_95_percentile': '10.20', 'latency_99_percentile': '10.26', 'average_latency_ms': '8.14', 'QPS': '122.79'}
Run onnxruntime on distilgpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:44.836106', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '45.96', 'latency_95_percentile': '45.97', 'latency_99_percentile': '46.01', 'average_latency_ms': '45.78', 'QPS': '21.84'}
Run onnxruntime on distilgpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:49.480584', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '32.90', 'latency_95_percentile': '32.95', 'latency_99_percentile': '33.15', 'average_latency_ms': '32.70', 'QPS': '244.62'}
Run onnxruntime on distilgpt2 with input shape [8, 384]

TedThemistokleous · 2023-12-06T21:10:43Z

Got a gpt2 run with int8 quant here.

quantized model saved to:./onnx_models/gpt2_1_int8_gpu.onnx
Size of quantized ONNX model(MB):157.34468364715576
Finished quantizing model: ./onnx_models/gpt2_1_int8_gpu.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:05.271922', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '17.57', 'latency_95_percentile': '17.61', 'latency_99_percentile': '17.68', 'average_latency_ms': '15.39', 'QPS': '65.00'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:07.146352', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '72.94', 'latency_95_percentile': '73.17', 'latency_99_percentile': '73.99', 'average_latency_ms': '72.80', 'QPS': '13.74'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:14.528135', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '52.48', 'latency_95_percentile': '52.50', 'latency_99_percentile': '52.58', 'average_latency_ms': '52.30', 'QPS': '152.97'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:19.830109', 'test_times': 100, 'latency_variance': '0.01', 'latency_90_percentile': '529.51', 'latency_95_percentile': '530.92', 'latency_99_percentile': '544.40', 'average_latency_ms': '529.13', 'QPS': '15.12'}
Fusion statistics is saved to csv file: benchmark_fusion_20231206-205813.csv
Detail results are saved to csv file: /tmp/results.csv
Summary results are saved to csv file: benchmark_summary_20231206-205813.csv

TedThemistokleous · 2023-12-07T01:29:31Z

changes pushed into #2468

TedThemistokleous · 2023-12-07T23:08:23Z

Not seeing proper code path when running trace compile for tests in DLM with MIGRAPHX_TRACE_EVAL=1 after investigating large drop in outputs compared to fp16 versions. Sorting this out before I close it out.

TedThemistokleous · 2023-12-09T04:18:19Z

Seeing about an order of magnitude drop between fp16 and int8 runs eg) distilgpt2 below

{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:38.877793', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.61', 'latency_95_percentile': '10.79', 'latency_99_percentile': '11.09', 'average_latency_ms': '10.57', 'QPS': '94.61'}
Run onnxruntime on distilgpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:40.258887', 'test_times': 100, 'latency_variance': '0.19', 'latency_90_percentile': '80.93', 'latency_95_percentile': '81.12', 'latency_99_percentile': '81.47', 'average_latency_ms': '61.06', 'QPS': '16.38'}
Run onnxruntime on distilgpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:46.467054', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '36.39', 'latency_95_percentile': '36.77', 'latency_99_percentile': '36.95', 'average_latency_ms': '35.85', 'QPS': '223.17'}
Run onnxruntime on distilgpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:50.101417', 'test_times': 100, 'latency_variance': '1.09', 'latency_90_percentile': '362.16', 'latency_95_percentile': '409.24', 'latency_99_percentile': '554.73', 'average_latency_ms': '363.10', 'QPS': '22.03'}

{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:06.612680', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.03', 'latency_95_percentile': '1.03', 'latency_99_percentile': '1.04', 'average_latency_ms': '1.01', 'QPS': '987.05'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:23.724592', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.03', 'latency_95_percentile': '2.04', 'latency_99_percentile': '2.11', 'average_latency_ms': '2.00', 'QPS': '500.15'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:01.631328', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.02', 'latency_95_percentile': '2.13', 'latency_99_percentile': '2.16', 'average_latency_ms': '1.97', 'QPS': '4051.12'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:19.493161', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.90', 'latency_95_percentile': '10.92', 'latency_99_percentile': '11.03', 'average_latency_ms': '10.72', 'QPS': '746.60'}

TedThemistokleous · 2023-12-12T22:04:49Z

It appears we're bouncing between all 3 EPs when doing int8 runs.

Attempting to run the model in the driver I'm seeing the following:

terminate called after throwing an instance of 'migraphx::version_2_8_0::exception'
  what():  /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/AMDMIGraphX/src/onnx/onnx_parser.cpp:417: parse_graph: Unknown operator: DynamicQuantizeLinear
Aborted (core dumped)

Which would most likely be why in the EP we don't find that OP and then fallback to the rest of the nodes put onto the other EPs

causten · 2023-12-12T22:35:28Z

Hmm we added support for that operator

TedThemistokleous · 2023-12-12T23:36:46Z

Looks like we're using an older version.

MIGraphX Version: 2.8.0.7f8f0fd0f

Also DynamicQuantizeLinear isn't in the ep list of OPs. Have a patch in comming to add it in.

TedThemistokleous · 2023-12-12T23:53:28Z

Got fixes up to here:

Upstream - microsoft/onnxruntime#18798
Internal - ROCm/onnxruntime#26

Running a test with latest develop + change in onnxruntime + dlm container for the gpt2 test for int8. Will try to run the other models and analyze once complete as well.

Using this to build the end to end pipeline

python3 tools/run_models.py --tags migx_onnxrt_gpt2_quant_benchmarks --liveOutput --cleanDockerCache                               --additionalContext "{'guest_os':'UBUNTU', \
                              'docker_build_arg':{\
                              'BASE_DOCKER':'compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.0:88_ubuntu22.04_py3.10_pytorch_release-2.1_011de5c', \      
                              'ORT_UNIT_TESTS':'false', 'ORT_BUILD':'true', 'ONNXRUNTIME_BRANCH':'add_dynamic_quantize_linear','ONNXRUNTIME_REPO':'https://github.com/ROCmSoftwarePlatform/onnxruntime', 'MIGX_BUILD':'true'}}"

TedThemistokleous · 2023-12-12T23:54:00Z

@causten for the MIGraphX side, looking at develop, your right, we should have this op in. Need to figure out where APT is getting things here.

TedThemistokleous · 2023-12-13T02:36:43Z

Build off develop seems to work to read the int8 model correctly

@2689 = @return(@2688,@1075,@1077,@1211,@1213,@1347,@1349,@1483,@1485,@1619,@1621,@1755,@1757,@1891,@1893,@2027,@2029,@2163,@2165,@2299,@2301,@2435,@2437,@2571,@2573), target_id=0


[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver read gpt2_1_int8_gpu.onnx

[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver run gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids

From the ORT benchmark test the fp16 run got

{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:06.612680', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.03', 'latency_95_percentile': '1.03', 'latency_99_percentile': '1.04', 'average_latency_ms': '1.01', 'QPS': '987.05'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:23.724592', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.03', 'latency_95_percentile': '2.04', 'latency_99_percentile': '2.11', 'average_latency_ms': '2.00', 'QPS': '500.15'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:01.631328', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.02', 'latency_95_percentile': '2.13', 'latency_99_percentile': '2.16', 'average_latency_ms': '1.97', 'QPS': '4051.12'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:19.493161', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.90', 'latency_95_percentile': '10.92', 'latency_99_percentile': '11.03', 'average_latency_ms': '10.72', 'QPS': '746.60'}

edit
rerunning this test off develop + latest changes for FP16 I get the following timings. which is 50% lower than previous. My understanding here is that we have fast math disabled on for both cases so that shouldn't be the cause..


Model saved to ./onnx_models/gpt2_1_fp16_gpu.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 03:06:14.856181', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.83', 'latency_95_percentile': '1.84', 'latency_99_percentile': '1.87', 'average_latency_ms': '1.82', 'QPS': '549.61'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-13 03:06:42.573559', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '3.63', 'latency_95_percentile': '3.64', 'latency_99_percentile': '3.67', 'average_latency_ms': '3.55', 'QPS': '282.03'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 03:07:27.814779', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '3.41', 'latency_95_percentile': '3.66', 'latency_99_percentile': '3.71', 'average_latency_ms': '3.38', 'QPS': '2368.26'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-13 03:07:56.767108', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '18.09', 'latency_95_percentile': '18.24', 'latency_99_percentile': '18.71', 'average_latency_ms': '17.62', 'QPS': '454.15'}

Trying a perf run with only int8 on in the driver we're seeing a value on our end that's only

gpu::code_object::quantizelinear_kernel: 1.37962ms / 60 = 0.0229937ms, 21%
gpu::code_object::contiguous_kernel: 0.835693ms / 36 = 0.0232137ms, 13%
gpu::code_object::mlir_reshape_quant_dot: 0.755034ms / 23 = 0.0328275ms, 12%
gpu::code_object::layernorm_mul_add_quantizelinear_kernel: 0.581659ms / 24 = 0.0242358ms, 9%
gpu::code_object::dequantizelinear_add_add_kernel: 0.525061ms / 23 = 0.0228287ms, 8%
gpu::code_object::mlir_reshape_quant_dot_dequantizelinear_add: 0.387697ms / 13 = 0.0298228ms, 6%
gpu::code_object::mlir_quant_dot: 0.338115ms / 12 = 0.0281762ms, 6%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_quantizelinear_kernel: 0.309476ms / 12 = 0.0257897ms, 5%
gpu::code_object::softmax_kernel: 0.289521ms / 12 = 0.0241267ms, 5%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul_where: 0.283153ms / 12 = 0.0235961ms, 5%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.279093ms / 12 = 0.0232578ms, 5%
load: 0.119553ms / 219 = 0.000545903ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.117032ms / 1 = 0.117032ms, 2%
multibroadcast: 0.111057ms / 98 = 0.00113324ms, 2%
hip::hip_copy_literal: 0.0854196ms / 149 = 0.000573286ms, 2%
reshape_lazy: 0.0637735ms / 95 = 0.0006713ms, 1%
transpose: 0.0545841ms / 48 = 0.00113717ms, 1%
slice: 0.0419648ms / 36 = 0.00116569ms, 1%
gpu::code_object::add_layernorm_quantizelinear_kernel: 0.0241358ms / 1 = 0.0241358ms, 1%
gpu::code_object::gather_kernel: 0.0231269ms / 1 = 0.0231269ms, 1%
gpu::code_object::add_kernel: 0.0230521ms / 1 = 0.0230521ms, 1%
gpu::code_object::convert_kernel: 0.0227299ms / 1 = 0.0227299ms, 1%
@param: 0.00901038ms / 26 = 0.000346553ms, 1%
hip::hip_allocate_memory: 0.0008224ms / 1 = 0.0008224ms, 1%
check_context::migraphx::gpu::context: 0.00066418ms / 1 = 0.00066418ms, 1%

Batch size: 1
Rate: 585.383 inferences/sec
Total time: 1.70828ms
Total instructions time: 6.66105ms
Overhead time: 0.18396ms, -4.95276ms
Overhead: 11%, -290%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8

Soley fp16 run gives us

Summary:
gpu::code_object::mlir_reshape_dot: 0.826736ms / 23 = 0.0359451ms, 16%
gpu::code_object::convert_kernel: 0.57429ms / 25 = 0.0229716ms, 11%
gpu::code_object::layernorm_mul_add_kernel: 0.566644ms / 24 = 0.0236102ms, 11%
gpu::code_object::contiguous_kernel: 0.533037ms / 24 = 0.0222099ms, 10%
gpu::code_object::add_add_kernel: 0.514395ms / 23 = 0.022365ms, 10%
gpu::code_object::mlir_reshape_dot_add: 0.391352ms / 13 = 0.030104ms, 8%
gpu::code_object::mlir_transpose_reshape_dot: 0.324614ms / 12 = 0.0270511ms, 6%
gpu::code_object::add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.300936ms / 12 = 0.025078ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_reshape_slice_transpose_dot_mul_where: 0.276505ms / 12 = 0.023042ms, 6%
gpu::code_object::softmax_kernel: 0.275471ms / 12 = 0.0229559ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_dot: 0.27312ms / 12 = 0.02276ms, 6%
gpu::code_object::mlir_dot_add_convert: 0.148124ms / 1 = 0.148124ms, 3%
multibroadcast: 0.0958457ms / 98 = 0.000978017ms, 2%
load: 0.0918247ms / 171 = 0.000536986ms, 2%
hip::hip_copy_literal: 0.0814319ms / 149 = 0.000546523ms, 2%
reshape_lazy: 0.0574439ms / 83 = 0.000692095ms, 2%
slice: 0.0288253ms / 24 = 0.00120106ms, 1%
gpu::code_object::add_layernorm_kernel: 0.0230485ms / 1 = 0.0230485ms, 1%
gpu::code_object::gather_kernel: 0.0226782ms / 1 = 0.0226782ms, 1%
gpu::code_object::add_kernel: 0.0221361ms / 1 = 0.0221361ms, 1%
transpose: 0.0186464ms / 24 = 0.000776932ms, 1%
@param: 0.0092624ms / 26 = 0.000356246ms, 1%
hip::hip_allocate_memory: 0.0007636ms / 1 = 0.0007636ms, 1%
check_context::migraphx::gpu::context: 0.0006462ms / 1 = 0.0006462ms, 1%

Batch size: 1
Rate: 624.128 inferences/sec
Total time: 1.60223ms
Total instructions time: 5.45778ms
Overhead time: 0.148524ms, -3.85554ms
Overhead: 9%, -241%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --fp16

With mixed int8 and fp16 we get the following

Summary:
gpu::code_object::quantizelinear_kernel: 1.38454ms / 60 = 0.0230757ms, 20%
gpu::code_object::contiguous_kernel: 0.819941ms / 36 = 0.0227761ms, 12%
gpu::code_object::mlir_reshape_quant_dot: 0.759796ms / 23 = 0.0330346ms, 11%
gpu::code_object::convert_kernel: 0.597603ms / 25 = 0.0239041ms, 9%
gpu::code_object::layernorm_mul_add_quantizelinear_kernel: 0.576765ms / 24 = 0.0240319ms, 8%
gpu::code_object::dequantizelinear_add_add_kernel: 0.526679ms / 23 = 0.0228991ms, 8%
gpu::code_object::mlir_reshape_quant_dot_dequantizelinear_add: 0.388696ms / 13 = 0.0298997ms, 6%
gpu::code_object::mlir_quant_dot: 0.339814ms / 12 = 0.0283178ms, 5%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_quantizelinear_kernel: 0.311157ms / 12 = 0.0259297ms, 5%
gpu::code_object::softmax_kernel: 0.286588ms / 12 = 0.0238823ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul_where: 0.28528ms / 12 = 0.0237733ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.282882ms / 12 = 0.0235735ms, 4%
load: 0.139606ms / 243 = 0.000574509ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_add_convert: 0.11139ms / 1 = 0.11139ms, 2%
multibroadcast: 0.110408ms / 98 = 0.00112661ms, 2%
hip::hip_copy_literal: 0.0868224ms / 149 = 0.000582701ms, 2%
reshape_lazy: 0.0709634ms / 95 = 0.000746983ms, 1%
transpose: 0.054488ms / 48 = 0.00113517ms, 1%
slice: 0.0448662ms / 36 = 0.00124628ms, 1%
gpu::code_object::add_kernel: 0.0239382ms / 1 = 0.0239382ms, 1%
gpu::code_object::add_layernorm_quantizelinear_kernel: 0.0238166ms / 1 = 0.0238166ms, 1%
gpu::code_object::gather_kernel: 0.0233278ms / 1 = 0.0233278ms, 1%
@param: 0.0093938ms / 26 = 0.0003613ms, 1%
hip::hip_allocate_memory: 0.0007862ms / 1 = 0.0007862ms, 1%
check_context::migraphx::gpu::context: 0.0006766ms / 1 = 0.0006766ms, 1%

Batch size: 1
Rate: 455.81 inferences/sec
Total time: 2.1939ms
Total instructions time: 7.26023ms
Overhead time: 0.193532ms, -5.06633ms
Overhead: 9%, -231%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --fp16 --int8

After building in DynamicQuantizeLinear into MIGraphX EP + newer develop we're seeing the following.

un onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 02:18:41.520488', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '17.70', 'latency_95_percentile': '17.72', 'latency_99_percentile': '17.74', 'average_latency_ms': '17.60', 'QPS': '56.81'}
``

QPS: 56.81 which is almost half of the initial int8 run.

TedThemistokleous · 2023-12-13T05:41:39Z

I think it is related to fast math.

Running fp16 on our driver as a baseline

Summary:
gpu::code_object::mlir_reshape_dot: 0.764539ms / 23 = 0.0332408ms, 14%
gpu::code_object::convert_kernel: 0.593802ms / 25 = 0.0237521ms, 11%
gpu::code_object::layernorm_mul_add_kernel: 0.570796ms / 24 = 0.0237832ms, 11%
gpu::code_object::contiguous_kernel: 0.538753ms / 24 = 0.022448ms, 10%
gpu::code_object::add_add_kernel: 0.51837ms / 23 = 0.0225378ms, 10%
gpu::code_object::mlir_reshape_dot_add: 0.387976ms / 13 = 0.0298443ms, 8%
gpu::code_object::mlir_transpose_reshape_dot: 0.331672ms / 12 = 0.0276394ms, 7%
gpu::code_object::add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.304019ms / 12 = 0.0253349ms, 6%
gpu::code_object::softmax_kernel: 0.28573ms / 12 = 0.0238108ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_reshape_slice_transpose_dot_mul_where: 0.285179ms / 12 = 0.0237649ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_dot: 0.285079ms / 12 = 0.0237566ms, 6%
gpu::code_object::mlir_dot_add_convert: 0.148713ms / 1 = 0.148713ms, 3%
multibroadcast: 0.11099ms / 98 = 0.00113255ms, 3%
load: 0.0965821ms / 171 = 0.000564807ms, 2%
hip::hip_copy_literal: 0.0866334ms / 149 = 0.000581432ms, 2%
reshape_lazy: 0.0606654ms / 83 = 0.000730908ms, 2%
slice: 0.0382431ms / 24 = 0.00159346ms, 1%
gpu::code_object::add_layernorm_kernel: 0.0234014ms / 1 = 0.0234014ms, 1%
gpu::code_object::gather_kernel: 0.0232851ms / 1 = 0.0232851ms, 1%
gpu::code_object::add_kernel: 0.0230552ms / 1 = 0.0230552ms, 1%
transpose: 0.0194102ms / 24 = 0.00080876ms, 1%
@param: 0.00970014ms / 26 = 0.000373082ms, 1%
hip::hip_allocate_memory: 0.0007364ms / 1 = 0.0007364ms, 1%
check_context::migraphx::gpu::context: 0.0006434ms / 1 = 0.0006434ms, 1%

Batch size: 1
Rate: 562.497 inferences/sec
Total time: 1.77779ms
Total instructions time: 5.50797ms
Overhead time: 0.151713ms, -3.73019ms
Overhead: 9%, -210%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_fp16_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --fp16

which similarly to the fp16 run through the ep is around 549 QPS and in the EP we have fast math default off right now due to the accuracy issue we saw previously.

TedThemistokleous · 2023-12-14T21:09:00Z

Running this off latest develop I'm seeing this error now when trying to run latest int8 model. Rolling back support when we added DynamicQuantizeLinear added into the opset from two days ago.

@1010 = convert[target_type=4](@1009) -> uint8_type, {1}, {1}, target_id=0
@1011 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1005) -> float_type, {1, 768}, {0, 0}, target_id=0
@1012 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1010) -> uint8_type, {1, 768}, {0, 0}, target_id=0
@1013 = quantizelinear(@999,@1011,@1012) -> uint8_type, {1, 768}, {768, 1}, target_id=0
@1014 = mul(@1005,@858) -> float_type, {1}, {1}, target_id=0


terminate called after throwing an instance of 'migraphx::version_1::exception'
  what():  /workspace/migraphx/src/src/include/migraphx/check_shapes.hpp:210: same_type: quant_dot: Types do not match
Aborted (core dumped)
root@aus-navi3x-02:/workspace/onnxruntime/build/Linux/Release/onnxruntime/transformers/onnx_models# cd /workspace/migraphx/src/

TedThemistokleous · 2023-12-22T22:31:34Z

I think this is related to migraphx::shape::uint8_type cropping up in the output. If I losen the same_type constraint to only the fp8 types in quant_dot for now I can get gpt2 to read correctly like before.

It appears we're not adding either uint8 as part of our supported types or we're hitting a case where a convert that convert listed above is happening after we perform quant so when we go to compute_shape() we fail.

I'll add an eliminate_data_type pass and see if this helps to convert uint8->int8 albeit I think we'd need to be concerned about narrowing here for accuracy but currently building something to test this

TedThemistokleous · 2023-12-22T23:01:36Z

Runs seem to still fail. without a the elminate_data_type pass for uint8, I'm getting robBlas failures

With the added pass I get

Reading: gpt2_1_int8_gpu.onnx
terminate called after throwing an instance of 'migraphx::version_1::exception'
  what():  /workspace/AMDMIGraphX/src/targets/gpu/mlir.cpp:706: run_high_level_pipeline: Invalid MLIR created: Error: 'migraphx.dot' op operand #0 must be !migraphx.shaped of 32-bit float or 16-bit float or bfloat16 type values, but got '!migraphx.shaped<32x768xi8, 768x1>'
Note: see current operation: %0 = "migraphx.dot"(%arg0, %arg1) : (!migraphx.shaped<32x768xi8, 768x1>, !migraphx.shaped<768x2304xi8, 2304x1>) -> !migraphx.shaped<32x2304xi8, 2304x1>

Aborted (core dumped)

Pushed up changes to debug_quant_dot branch. May try a later ROCm build since I'm still using build 88 from the 6.0 release cycle.

TedThemistokleous · 2023-12-29T22:30:24Z

Tried some more stuff taking a look at quantizelinear which default returns to uint8 if we only have two input ops.
Messing with that seems to also break witht the same mlir result above.

running the following after a read gives me this output where we're seeing the uint8 type popping up

migraphx-driver read gpt2_1_int8_gpu.onnx | grep quant_dot -b25 | head -26

83303-@1359 = gather[axis=0](@375,@1074) -> int64_type, {1}, {0}, target_id=0
83375-@1360 = gather[axis=0](@374,@1073) -> int64_type, {1}, {0}, target_id=0
83447-@1361 = slice[axes={0},starts={-1},ends={9223372036854775807}](@373) -> int64_type, {1}, {1}, target_id=0
83553-@1362 = unsqueeze[axes={0},steps={}](@1359) -> int64_type, {1}, {1}, target_id=0
83634-@1363 = unsqueeze[axes={0},steps={}](@1360) -> int64_type, {1}, {1}, target_id=0
83715-@1364 = squeeze[axes={0}](@1361) -> int64_type, {1}, {0}, target_id=0
83785-@1365 = concat[axis=0](@1362,@1363,@1068) -> int64_type, {3}, {1}, target_id=0
83864-@1366 = unsqueeze[axes={0},steps={}](@1364) -> int64_type, {1}, {1}, target_id=0
83945-@1367 = concat[axis=0](@1069,@1366) -> int64_type, {2}, {1}, target_id=0
84018-@1368 = reshape[dims={-1, 768}](@1358) -> float_type, {1, 768}, {768, 1}, target_id=0
84104-@1369 = reshape[dims={768}](@1368) -> float_type, {768}, {1}, target_id=0
84178-@1370 = concat[axis=0](@1369,@372) -> float_type, {769}, {1}, target_id=0
84252-@1371 = reduce_max[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84325-@1372 = reduce_min[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84398-@1373 = sub(@1371,@1372) -> float_type, {1}, {1}, target_id=0
84460-@1374 = div(@1373,@371) -> float_type, {1}, {1}, target_id=0
84521-@1375 = sub(@370,@1372) -> float_type, {1}, {1}, target_id=0
84582-@1376 = div(@1375,@1374) -> float_type, {1}, {1}, target_id=0
84644-@1377 = clip(@1376,@370,@369) -> float_type, {1}, {1}, target_id=0
84711-@1378 = nearbyint(@1377) -> float_type, {1}, {1}, target_id=0
84773-@1379 = convert[target_type=4](@1378) -> uint8_type, {1}, {1}, target_id=0
84848-@1380 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1374) -> float_type, {1, 768}, {0, 0}, target_id=0
84958-@1381 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1379) -> uint8_type, {1, 768}, {0, 0}, target_id=0
85068-@1382 = quantizelinear(@1368,@1380,@1381) -> uint8_type, {1, 768}, {768, 1}, target_id=0
85157-@1383 = mul(@1374,@1227) -> float_type, {1}, {1}, target_id=0
85219:@1384 = quant_dot(@1382,@1225) -> int32_type, {1, 2304}, {2304, 1}, target_id=0

TedThemistokleous · 2023-12-29T22:42:06Z

Looks like this uint8 is sneaking in from how we handle dynamicQuantize linear. Not sure why we're assuming the type is supposd to be uint8 here instead of int8. Will need to investigate further next week.

Changing the output target type for zero point to uint type around line 140 in parse_dynamicquantizelinear seems to fix inserting uint8 now

83303-@1359 = gather[axis=0](@375,@1074) -> int64_type, {1}, {0}, target_id=0
83375-@1360 = gather[axis=0](@374,@1073) -> int64_type, {1}, {0}, target_id=0
83447-@1361 = slice[axes={0},starts={-1},ends={9223372036854775807}](@373) -> int64_type, {1}, {1}, target_id=0
83553-@1362 = unsqueeze[axes={0},steps={}](@1359) -> int64_type, {1}, {1}, target_id=0
83634-@1363 = unsqueeze[axes={0},steps={}](@1360) -> int64_type, {1}, {1}, target_id=0
83715-@1364 = squeeze[axes={0}](@1361) -> int64_type, {1}, {0}, target_id=0
83785-@1365 = concat[axis=0](@1362,@1363,@1068) -> int64_type, {3}, {1}, target_id=0
83864-@1366 = unsqueeze[axes={0},steps={}](@1364) -> int64_type, {1}, {1}, target_id=0
83945-@1367 = concat[axis=0](@1069,@1366) -> int64_type, {2}, {1}, target_id=0
84018-@1368 = reshape[dims={-1, 768}](@1358) -> float_type, {1, 768}, {768, 1}, target_id=0
84104-@1369 = reshape[dims={768}](@1368) -> float_type, {768}, {1}, target_id=0
84178-@1370 = concat[axis=0](@1369,@372) -> float_type, {769}, {1}, target_id=0
84252-@1371 = reduce_max[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84325-@1372 = reduce_min[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84398-@1373 = sub(@1371,@1372) -> float_type, {1}, {1}, target_id=0
84460-@1374 = div(@1373,@371) -> float_type, {1}, {1}, target_id=0
84521-@1375 = sub(@370,@1372) -> float_type, {1}, {1}, target_id=0
84582-@1376 = div(@1375,@1374) -> float_type, {1}, {1}, target_id=0
84644-@1377 = clip(@1376,@370,@369) -> float_type, {1}, {1}, target_id=0
84711-@1378 = nearbyint(@1377) -> float_type, {1}, {1}, target_id=0
84773-@1379 = convert[target_type=5](@1378) -> int8_type, {1}, {1}, target_id=0
84847-@1380 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1374) -> float_type, {1, 768}, {0, 0}, target_id=0
84957-@1381 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1379) -> int8_type, {1, 768}, {0, 0}, target_id=0
85066-@1382 = quantizelinear(@1368,@1380,@1381) -> uint8_type, {1, 768}, {768, 1}, target_id=0
85155-@1383 = mul(@1374,@1227) -> float_type, {1}, {1}, target_id=0
85217:@1384 = quant_dot(@1382,@1225) -> int32_type, {1, 2304}, {2304, 1}, target_id=0

TedThemistokleous · 2024-02-15T06:05:16Z

add a convert step at the end of parse_dynamicquantizelinear to handle this as we'll bump up against MLIR converts

Upscaled to int16 before doing the convert to handle saturation before the convert to int8 (uint8->int16 - 127 -> int8)

Still seeing a perf drop though. Need to go over in morning if I need to add more to simplify_qdq

Block performing the conversion

@1422 = sub(@1420,@1421) -> float_type, {1}, {1}, target_id=0
@1423 = div(@1422,@420) -> float_type, {1}, {1}, target_id=0
@1424 = sub(@419,@1421) -> float_type, {1}, {1}, target_id=0
@1425 = div(@1424,@1423) -> float_type, {1}, {1}, target_id=0
@1426 = clip(@1425,@419,@418) -> float_type, {1}, {1}, target_id=0
@1427 = nearbyint(@1426) -> float_type, {1}, {1}, target_id=0
@1428 = convert[target_type=4](@1427) -> uint8_type, {1}, {1}, target_id=0
@1429 = convert[target_type=7](@1428) -> int16_type, {1}, {1}, target_id=0
@1430 = add(@1429,@417) -> int16_type, {1}, {1}, target_id=0
@1431 = convert[target_type=5](@1430) -> int8_type, {1}, {1}, target_id=0
@1432 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1423) -> float_type, {1, 768}, {0, 0}, target_id=0
@1433 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1431) -> int8_type, {1, 768}, {0, 0}, target_id=0
@1434 = quantizelinear[out_type=nullopt](@1417,@1432,@1433) -> int8_type, {1, 768}, {768, 1}, target_id=0

Perf output

Summary:
gpu::code_object::reduce_min_kernel: 2.06482ms / 49 = 0.0421391ms, 11%
gpu::code_object::reduce_max_sub_mul_kernel: 2.0575ms / 49 = 0.0419897ms, 11%
gpu::code_object::mul_quantizelinear_kernel: 1.67524ms / 48 = 0.0349009ms, 9%
gpu::code_object::mul_kernel: 1.59728ms / 50 = 0.0319455ms, 9%
gpu::code_object::mlir_quant_dot: 1.35165ms / 47 = 0.0287586ms, 8%
gpu::code_object::quantizelinear_kernel: 1.20892ms / 37 = 0.0326734ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16375ms / 49 = 0.0237501ms, 7%
gpu::code_object::concat_kernel: 1.15656ms / 49 = 0.0236034ms, 7%
gpu::code_object::convert_kernel: 1.15122ms / 50 = 0.0230244ms, 6%
gpu::code_object::contiguous_kernel: 1.13786ms / 48 = 0.0237053ms, 6%
gpu::code_object::neg_div_clip_nearbyint_add_kernel: 1.13631ms / 49 = 0.0231901ms, 6%
gpu::code_object::layernorm_mul_add_kernel: 0.585409ms / 24 = 0.0243921ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.538739ms / 23 = 0.0234234ms, 3%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.389363ms / 13 = 0.029951ms, 3%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.363619ms / 13 = 0.0279707ms, 2%
load: 0.360501ms / 600 = 0.000600835ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.295485ms / 12 = 0.0246237ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_mul_mul_add_neg_sub_exp_add_div_mul_kernel: 0.288516ms / 12 = 0.024043ms, 2%
multibroadcast: 0.251145ms / 296 = 0.000848464ms, 2%
reshape_lazy: 0.128273ms / 180 = 0.00071263ms, 1%
hip::hip_copy_literal: 0.104623ms / 151 = 0.00069287ms, 1%
transpose: 0.063942ms / 48 = 0.00133213ms, 1%
slice: 0.0456442ms / 36 = 0.00126789ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248973ms / 1 = 0.0248973ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0237355ms / 1 = 0.0237355ms, 1%
gpu::code_object::gather_kernel: 0.0237274ms / 1 = 0.0237274ms, 1%
@param: 0.0103636ms / 26 = 0.0003986ms, 1%
hip::hip_allocate_memory: 0.0011818ms / 1 = 0.0011818ms, 1%
check_context::migraphx::gpu::context: 0.0008362ms / 1 = 0.0008362ms, 1%

Batch size: 1
Rate: 157.206 inferences/sec
Total time: 6.36109ms
Total instructions time: 19.2011ms
Overhead time: 0.435307ms, -12.84ms
Overhead: 7%, -202%
[ MIGraphX Version: 2.9.0. ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8

Hey @pfultz2 got any idea on best to speed this one up? Should our quantize also be in mlir here not just the dequantize?

TedThemistokleous · 2024-03-01T16:43:15Z

Latest changes in the PR seem to speed things up (remove flattening via reshape/concat, serialize min/max operations)

This appears to create a bout a 20% speedup alone relative to the original run on the int8 model.

Summary:
gpu::code_object::reduce_min_min_kernel: 2.04025ms / 49 = 0.0416377ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.03854ms / 49 = 0.0416029ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66959ms / 48 = 0.0347831ms, 10%
gpu::code_object::mlir_quant_dot: 1.34554ms / 47 = 0.0286286ms, 9%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16123ms / 49 = 0.0236985ms, 7%
gpu::code_object::convert_kernel: 1.14344ms / 50 = 0.0228688ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.12494ms / 49 = 0.022958ms, 7%
gpu::code_object::mul_kernel: 1.11627ms / 49 = 0.022781ms, 7%
gpu::code_object::contiguous_kernel: 0.845617ms / 36 = 0.0234894ms, 6%
gpu::code_object::quantizelinear_kernel: 0.835653ms / 36 = 0.0232126ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.583655ms / 24 = 0.024319ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.537387ms / 23 = 0.0233647ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.388378ms / 13 = 0.0298752ms, 3%
load: 0.332797ms / 537 = 0.000619734ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.29401ms / 12 = 0.0245009ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_mul_mul_add_neg_sub_exp_add_div_mul_kernel: 0.286982ms / 12 = 0.0239152ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.280409ms / 12 = 0.0233674ms, 2%
multibroadcast: 0.252798ms / 295 = 0.000856943ms, 2%
hip::hip_copy_literal: 0.105973ms / 150 = 0.000706487ms, 1%
reshape_lazy: 0.0981803ms / 131 = 0.000749468ms, 1%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.0888045ms / 1 = 0.0888045ms, 1%
transpose: 0.059129ms / 48 = 0.00123185ms, 1%
slice: 0.0458985ms / 36 = 0.00127496ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248396ms / 1 = 0.0248396ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0239322ms / 1 = 0.0239322ms, 1%
gpu::code_object::gather_kernel: 0.0237319ms / 1 = 0.0237319ms, 1%
@param: 0.0106455ms / 26 = 0.000409442ms, 1%
check_context::migraphx::gpu::context: 0.0011866ms / 1 = 0.0011866ms, 1%
hip::hip_allocate_memory: 0.00104084ms / 1 = 0.00104084ms, 1%

Batch size: 1
Rate: 191.057 inferences/sec
Total time: 5.23404ms
Total instructions time: 16.7609ms
Overhead time: 0.382316ms, -11.5268ms
Overhead: 7%, -220%
[ MIGraphX Version: 2.9.0. ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8

Still seeing large time still done in the reduction min/max steps.

Curious if the above block before the quantize linear can be fused which adds a significant amount of time to the run

gpu::code_object::reduce_min_min_kernel: 2.04496ms / 49 = 0.0417338ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.04381ms / 49 = 0.0417105ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.68006ms / 48 = 0.0350012ms, 10%
gpu::code_object::mlir_quant_dot: 1.35275ms / 47 = 0.028782ms, 9%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16634ms / 49 = 0.0238029ms, 7%
gpu::code_object::convert_kernel: 1.1519ms / 50 = 0.0230379ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.13642ms / 49 = 0.0231923ms, 7%
gpu::code_object::mul_kernel: 1.1247ms / 49 = 0.0229531ms, 7%

TedThemistokleous · 2024-03-18T21:51:27Z

@pfultz2 the gpt2 model this issue stemmed from has the following repeated everything as part of the inserted dynamic quantization step

TedThemistokleous · 2024-03-19T02:24:32Z

Have initial changes after also reworking MatMulInteger after talking with @pfultz2

@causten we're seeing about a 30% increase once we're properly handling the input as quant_dot instead of just dots for the onnx model.

Summary of a run with only the change to MatMulInteger parser. on/off with disable fast math has around the same ballpark of a speedup (211-212 QPS)

Summary:
gpu::code_object::reduce_min_kernel: 2.06349ms / 49 = 0.042112ms, 14%
gpu::code_object::reduce_max_sub_mul_kernel: 2.05264ms / 49 = 0.0418906ms, 13%
gpu::code_object::mlir_quant_dot: 1.87745ms / 61 = 0.0307779ms, 12%
gpu::code_object::concat_kernel: 1.15502ms / 49 = 0.0235718ms, 8%
gpu::code_object::quantizelinear_sub_convert_add_convert_kernel: 1.14867ms / 49 = 0.0234422ms, 8%
gpu::code_object::mul_kernel: 1.13639ms / 49 = 0.0231916ms, 8%
gpu::code_object::neg_div_clip_nearbyint_convert_kernel: 1.13261ms / 49 = 0.0231144ms, 8%
gpu::code_object::contiguous_kernel: 1.12907ms / 48 = 0.0235223ms, 8%
gpu::code_object::quantizelinear_kernel: 0.836898ms / 36 = 0.0232472ms, 6%
gpu::code_object::layernorm_mul_add_kernel: 0.585637ms / 24 = 0.0244015ms, 4%
gpu::code_object::convert_mul_add_add_kernel: 0.545402ms / 23 = 0.0237131ms, 4%
gpu::code_object::convert_mul_add_kernel: 0.310343ms / 13 = 0.0238725ms, 2%
load: 0.307436ms / 515 = 0.000596963ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.294008ms / 12 = 0.0245007ms, 2%
gpu::code_object::convert_mul_add_mul_mul_add_mul_exp_add_div_kernel: 0.28876ms / 12 = 0.0240634ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.285526ms / 12 = 0.0237939ms, 2%
multibroadcast: 0.220696ms / 246 = 0.000897138ms, 2%
reshape_lazy: 0.119711ms / 180 = 0.000665061ms, 1%
hip::hip_copy_literal: 0.100321ms / 151 = 0.00066438ms, 1%
transpose: 0.0685118ms / 48 = 0.00142733ms, 1%
slice: 0.0486096ms / 36 = 0.00135027ms, 1%
gpu::code_object::convert_mul_kernel: 0.0291348ms / 1 = 0.0291348ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0249158ms / 1 = 0.0249158ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0237881ms / 1 = 0.0237881ms, 1%
gpu::code_object::gather_kernel: 0.023749ms / 1 = 0.023749ms, 1%
gpu::code_object::convert_kernel: 0.0230378ms / 1 = 0.0230378ms, 1%
@param: 0.00976646ms / 26 = 0.000375633ms, 1%
hip::hip_allocate_memory: 0.000916ms / 1 = 0.000916ms, 1%
check_context::migraphx::gpu::context: 0.000771ms / 1 = 0.000771ms, 1%

Batch size: 1
Rate: 212.998 inferences/sec
Total time: 4.69488ms
Total instructions time: 15.8433ms
Overhead time: 0.376437ms, -11.1484ms
Overhead: 8%, -237%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8

Changes pushed to : #2903

TedThemistokleous · 2024-03-20T13:41:39Z

Seeing larger speedup with MatMulinteger (#2903) + Fixes to DynamicQuantizeLinea (#2896) that were added when running through ORT right now for GPT2. Testing other models through driver appeared to show the correct speedup as well.

For GPT (shown below) it appears we're slightly faster than fp16 runs now

int8

root@aus-navi3x-02:/onnxruntime/onnxruntime/python/tools/transformers# python3 benchmark.py -g -m gpt2 --model_class AutoModelForCausalLM  --sequence_length 32 384 --batch_sizes 1 8  --provider=migraphx -p int8 --disable_gelu --disable_layer_norm --disable_attention --disable_skip_layer_norm --disable_embed_layer_norm --disable_bias_skip_layer_norm --disable_bias_gelu -o no_opt
Arguments: Namespace(models=['gpt2'], model_source='pt', model_class='AutoModelForCausalLM', engines=['onnxruntime'], cache_dir='./cache_models', onnx_dir='./onnx_models', use_gpu=True, provider='migraphx', precision=<Precision.INT8: 'int8'>, verbose=False, overwrite=False, optimizer_info=<OptimizerInfo.NOOPT: 'no_opt'>, validate_onnx=False, fusion_csv=None, detail_csv=None, result_csv=None, input_counts=[1], test_times=100, batch_sizes=[1, 8], sequence_lengths=[32, 384], disable_ort_io_binding=False, num_threads=[16], force_num_layers=None, disable_attention=True, disable_skip_layer_norm=True, disable_embed_layer_norm=True, disable_bias_skip_layer_norm=True, disable_bias_gelu=True, disable_layer_norm=True, disable_gelu=True, enable_gelu_approximation=False, disable_shape_inference=False, enable_gemm_fast_gelu=False, use_mask_index=False, use_raw_attention_mask=False, no_attention_mask=False, use_multi_head_attention=False, disable_group_norm=False, disable_skip_group_norm=False, disable_packed_kv=False, disable_packed_qkv=False, disable_bias_add=False, disable_bias_splitgelu=False, disable_nhwc_conv=False, use_group_norm_channels_first=False, disable_rotary_embeddings=False)
OptimizerInfo is set to no_opt, graph optimizations specified in FusionOptions are not applied.
Model class name: AutoModelForCausalLM
Skip export since model existed: ./onnx_models/gpt2_1.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:29:43.346929', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.96', 'latency_95_percentile': '3.00', 'latency_99_percentile': '3.11', 'average_latency_ms': '2.67', 'QPS': '374.99'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:30:08.887040', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '7.75', 'latency_95_percentile': '7.78', 'latency_99_percentile': '7.81', 'average_latency_ms': '7.54', 'QPS': '132.68'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:30:44.223341', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '5.99', 'latency_95_percentile': '6.31', 'latency_99_percentile': '6.46', 'average_latency_ms': '5.92', 'QPS': '1351.07'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:31:09.392412', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '49.11', 'latency_95_percentile': '49.53', 'latency_99_percentile': '49.77', 'average_latency_ms': '48.40', 'QPS': '165.29'}
Detail results are saved to csv file: benchmark_detail_20240320-133156.csv
Summary results are saved to csv file: benchmark_summary_20240320-133156.csv

fp16 runs

root@aus-navi3x-02:/onnxruntime/onnxruntime/python/tools/transformers# python3 benchmark.py -g -m gpt2 --model_class AutoModelForCausalLM  --sequence_length 32 384 --batch_sizes 1 8  --provider=migraphx -p fp16 --disable_gelu --disable_layer_norm --disable_attention --disable_skip_layer_norm --disable_embed_layer_norm --disable_bias_skip_layer_norm --disable_bias_gelu -o no_opt
Arguments: Namespace(models=['gpt2'], model_source='pt', model_class='AutoModelForCausalLM', engines=['onnxruntime'], cache_dir='./cache_models', onnx_dir='./onnx_models', use_gpu=True, provider='migraphx', precision=<Precision.FLOAT16: 'fp16'>, verbose=False, overwrite=False, optimizer_info=<OptimizerInfo.NOOPT: 'no_opt'>, validate_onnx=False, fusion_csv=None, detail_csv=None, result_csv=None, input_counts=[1], test_times=100, batch_sizes=[1, 8], sequence_lengths=[32, 384], disable_ort_io_binding=False, num_threads=[16], force_num_layers=None, disable_attention=True, disable_skip_layer_norm=True, disable_embed_layer_norm=True, disable_bias_skip_layer_norm=True, disable_bias_gelu=True, disable_layer_norm=True, disable_gelu=True, enable_gelu_approximation=False, disable_shape_inference=False, enable_gemm_fast_gelu=False, use_mask_index=False, use_raw_attention_mask=False, no_attention_mask=False, use_multi_head_attention=False, disable_group_norm=False, disable_skip_group_norm=False, disable_packed_kv=False, disable_packed_qkv=False, disable_bias_add=False, disable_bias_splitgelu=False, disable_nhwc_conv=False, use_group_norm_channels_first=False, disable_rotary_embeddings=False)
OptimizerInfo is set to no_opt, graph optimizations specified in FusionOptions are not applied.
Model class name: AutoModelForCausalLM
Skip export since model existed: ./onnx_models/gpt2_1.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:35:52.919367', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.76', 'latency_95_percentile': '2.78', 'latency_99_percentile': '2.79', 'average_latency_ms': '2.71', 'QPS': '368.84'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:36:16.149473', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '7.67', 'latency_95_percentile': '7.69', 'latency_99_percentile': '7.72', 'average_latency_ms': '7.49', 'QPS': '133.57'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:36:48.681642', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '6.07', 'latency_95_percentile': '6.33', 'latency_99_percentile': '6.52', 'average_latency_ms': '6.00', 'QPS': '1334.37'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:37:10.933650', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '48.85', 'latency_95_percentile': '49.07', 'latency_99_percentile': '49.15', 'average_latency_ms': '47.97', 'QPS': '166.79'}
Detail results are saved to csv file: benchmark_detail_20240320-133755.csv
Summary results are saved to csv file: benchmark_summary_20240320-133755.csv

TedThemistokleous self-assigned this Jun 29, 2023

TedThemistokleous added roadmap Tasks to finish for a release onnxruntime PR changes interaction between MIGraphX and Onnxruntime labels Jun 29, 2023

TedThemistokleous linked a pull request Oct 6, 2023 that will close this issue

Run optimize_module for int8 quantization #2300

Merged

causten closed this as completed in #2300 Oct 6, 2023

TedThemistokleous reopened this Oct 10, 2023

TedThemistokleous mentioned this issue Feb 23, 2024

Add pass to convert Uint8 to int8 across operators #2826

Closed

TedThemistokleous linked a pull request Feb 23, 2024 that will close this issue

Add pass to convert Uint8 to int8 across operators #2826

Closed

TedThemistokleous linked a pull request Mar 18, 2024 that will close this issue

Fixes to parse DynamicQuantizeLinear #2896

Merged

TedThemistokleous linked a pull request Mar 19, 2024 that will close this issue

Fix parsing of MatMulInteger #2903

Merged

causten closed this as completed in #2896 Mar 19, 2024

TedThemistokleous reopened this Mar 19, 2024

TedThemistokleous mentioned this issue Mar 21, 2024

Fix parsing of MatMulInteger #2903

Merged

causten closed this as completed in #2903 Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ORT accuracy tests to verify int8 #1904

Enable ORT accuracy tests to verify int8 #1904

TedThemistokleous commented Jun 29, 2023

TedThemistokleous commented Oct 6, 2023

TedThemistokleous commented Oct 10, 2023

TedThemistokleous commented Nov 4, 2023 •

edited

Loading

TedThemistokleous commented Nov 20, 2023

TedThemistokleous commented Dec 6, 2023

TedThemistokleous commented Dec 6, 2023

TedThemistokleous commented Dec 6, 2023 •

edited

Loading

TedThemistokleous commented Dec 6, 2023

TedThemistokleous commented Dec 7, 2023

TedThemistokleous commented Dec 7, 2023

TedThemistokleous commented Dec 9, 2023

TedThemistokleous commented Dec 12, 2023

causten commented Dec 12, 2023

TedThemistokleous commented Dec 12, 2023

TedThemistokleous commented Dec 12, 2023 •

edited

Loading

TedThemistokleous commented Dec 12, 2023

TedThemistokleous commented Dec 13, 2023 •

edited

Loading

TedThemistokleous commented Dec 13, 2023

TedThemistokleous commented Dec 14, 2023 •

edited

Loading

TedThemistokleous commented Dec 22, 2023

TedThemistokleous commented Dec 22, 2023 •

edited

Loading

TedThemistokleous commented Dec 29, 2023

TedThemistokleous commented Dec 29, 2023

TedThemistokleous commented Feb 15, 2024

TedThemistokleous commented Mar 1, 2024

TedThemistokleous commented Mar 18, 2024

TedThemistokleous commented Mar 19, 2024

TedThemistokleous commented Mar 20, 2024 •

edited

Loading

Enable ORT accuracy tests to verify int8 #1904

Enable ORT accuracy tests to verify int8 #1904

Comments

TedThemistokleous commented Jun 29, 2023

TedThemistokleous commented Oct 6, 2023

TedThemistokleous commented Oct 10, 2023

TedThemistokleous commented Nov 4, 2023 • edited Loading

TedThemistokleous commented Nov 20, 2023

TedThemistokleous commented Dec 6, 2023

TedThemistokleous commented Dec 6, 2023

TedThemistokleous commented Dec 6, 2023 • edited Loading

TedThemistokleous commented Dec 6, 2023

TedThemistokleous commented Dec 7, 2023

TedThemistokleous commented Dec 7, 2023

TedThemistokleous commented Dec 9, 2023

TedThemistokleous commented Dec 12, 2023

causten commented Dec 12, 2023

TedThemistokleous commented Dec 12, 2023

TedThemistokleous commented Dec 12, 2023 • edited Loading

TedThemistokleous commented Dec 12, 2023

TedThemistokleous commented Dec 13, 2023 • edited Loading

TedThemistokleous commented Dec 13, 2023

TedThemistokleous commented Dec 14, 2023 • edited Loading

TedThemistokleous commented Dec 22, 2023

TedThemistokleous commented Dec 22, 2023 • edited Loading

TedThemistokleous commented Dec 29, 2023

TedThemistokleous commented Dec 29, 2023

TedThemistokleous commented Feb 15, 2024

TedThemistokleous commented Mar 1, 2024

TedThemistokleous commented Mar 18, 2024

TedThemistokleous commented Mar 19, 2024

TedThemistokleous commented Mar 20, 2024 • edited Loading

TedThemistokleous commented Nov 4, 2023 •

edited

Loading

TedThemistokleous commented Dec 6, 2023 •

edited

Loading

TedThemistokleous commented Dec 12, 2023 •

edited

Loading

TedThemistokleous commented Dec 13, 2023 •

edited

Loading

TedThemistokleous commented Dec 14, 2023 •

edited

Loading

TedThemistokleous commented Dec 22, 2023 •

edited

Loading

TedThemistokleous commented Mar 20, 2024 •

edited

Loading