-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ORT accuracy tests to verify int8 #1904
Comments
Seems like #2300 will aid in accuracy for these runs. |
Reopenning, need to rerun testing with these changes + update ORT EP |
Resnet50 runs, need to do further analysis to compare to fp16 through same pipeline. Plan of attack
Added models will leverage existing e2e code, may need to write/borrow code for benchmark.py thats used in parity checks. |
Doing this part of QA validation for the resnet50 pipeline added in onnxruntime-inference-examples |
Got DLM changes for benchmark.py but failing off 6.0 sorting out issues with run scripts. |
Tracked by JIRA: https://ontrack-internal.amd.com/browse/SWDEV-410597 |
Able to get run for bert-large, bert-based-cased and distilgpt2 model runs into DLM reusing existing runs. Missing GPT2 as referenced in #1905 . Will need to add gpt2 + requivalent int8 run. We're letting onnxruntime do the quantization of the model before we do a run through the MIGraphX EP right now. Running these by hand to verify bert_large_uncased int8 Quantized.
bert_based_cased int8 Quantized
distilgpt2 int8 Quantized
|
Got a gpt2 run with int8 quant here.
|
changes pushed into #2468 |
Not seeing proper code path when running trace compile for tests in DLM with MIGRAPHX_TRACE_EVAL=1 after investigating large drop in outputs compared to fp16 versions. Sorting this out before I close it out. |
Seeing about an order of magnitude drop between fp16 and int8 runs eg) distilgpt2 below
|
It appears we're bouncing between all 3 EPs when doing int8 runs. Attempting to run the model in the driver I'm seeing the following:
Which would most likely be why in the EP we don't find that OP and then fallback to the rest of the nodes put onto the other EPs |
Hmm we added support for that operator |
Looks like we're using an older version.
Also DynamicQuantizeLinear isn't in the ep list of OPs. Have a patch in comming to add it in. |
Got fixes up to here: Upstream - microsoft/onnxruntime#18798 Running a test with latest develop + change in onnxruntime + dlm container for the gpt2 test for int8. Will try to run the other models and analyze once complete as well. Using this to build the end to end pipeline
|
@causten for the MIGraphX side, looking at develop, your right, we should have this op in. Need to figure out where APT is getting things here. |
Build off develop seems to work to read the int8 model correctly
From the ORT benchmark test the fp16 run got
edit
Trying a perf run with only int8 on in the driver we're seeing a value on our end that's only
Soley fp16 run gives us
With mixed int8 and fp16 we get the following
After building in DynamicQuantizeLinear into MIGraphX EP + newer develop we're seeing the following.
|
I think it is related to fast math. Running fp16 on our driver as a baseline
which similarly to the fp16 run through the ep is around 549 QPS and in the EP we have fast math default off right now due to the accuracy issue we saw previously. |
Running this off latest develop I'm seeing this error now when trying to run latest int8 model. Rolling back support when we added DynamicQuantizeLinear added into the opset from two days ago.
|
I think this is related to migraphx::shape::uint8_type cropping up in the output. If I losen the same_type constraint to only the fp8 types in quant_dot for now I can get gpt2 to read correctly like before. It appears we're not adding either uint8 as part of our supported types or we're hitting a case where a convert that convert listed above is happening after we perform quant so when we go to compute_shape() we fail. I'll add an eliminate_data_type pass and see if this helps to convert uint8->int8 albeit I think we'd need to be concerned about narrowing here for accuracy but currently building something to test this |
Runs seem to still fail. without a the elminate_data_type pass for uint8, I'm getting robBlas failures With the added pass I get
Pushed up changes to |
Tried some more stuff taking a look at quantizelinear which default returns to uint8 if we only have two input ops. running the following after a read gives me this output where we're seeing the uint8 type popping up
|
Looks like this uint8 is sneaking in from how we handle dynamicQuantize linear. Not sure why we're assuming the type is supposd to be uint8 here instead of int8. Will need to investigate further next week. Changing the output target type for zero point to uint type around line 140 in parse_dynamicquantizelinear seems to fix inserting uint8 now
|
add a convert step at the end of parse_dynamicquantizelinear to handle this as we'll bump up against MLIR converts Upscaled to int16 before doing the convert to handle saturation before the convert to int8 (uint8->int16 - 127 -> int8) Still seeing a perf drop though. Need to go over in morning if I need to add more to simplify_qdq Block performing the conversion
Perf output
Hey @pfultz2 got any idea on best to speed this one up? Should our quantize also be in mlir here not just the dequantize? |
Latest changes in the PR seem to speed things up (remove flattening via reshape/concat, serialize min/max operations) This appears to create a bout a 20% speedup alone relative to the original run on the int8 model.
Still seeing large time still done in the reduction min/max steps. Curious if the above block before the quantize linear can be fused which adds a significant amount of time to the run
|
@pfultz2 the gpt2 model this issue stemmed from has the following repeated everything as part of the inserted dynamic quantization step |
Have initial changes after also reworking MatMulInteger after talking with @pfultz2 @causten we're seeing about a 30% increase once we're properly handling the input as quant_dot instead of just dots for the onnx model. Summary of a run with only the change to MatMulInteger parser. on/off with disable fast math has around the same ballpark of a speedup (211-212 QPS)
Changes pushed to : #2903 |
Seeing larger speedup with MatMulinteger (#2903) + Fixes to DynamicQuantizeLinea (#2896) that were added when running through ORT right now for GPT2. Testing other models through driver appeared to show the correct speedup as well. For GPT (shown below) it appears we're slightly faster than fp16 runs now int8
fp16 runs
|
rocMLIR will be added to migraphx. This will be for all the data types but this issue will be to ensure existing tests in DLM pass using int8
The text was updated successfully, but these errors were encountered: