-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Get nan value when I block all the node in fp16 conversion #21345
Comments
The solution is that some nodes run in fp32 instead of fp16. It is easy to find out which nodes shall be kept in fp32 by looking at the output statistics. If the value is not in range [-65504,66504], it will need fp32. You can build from source, add
In the console output, it will show the statistics of each node. |
Hi, thanks for your reply. I have followed your way to figure out some of the nodes that exceeded the range [-65504, 66504] and I blocked them from being converted from fp32 to fp16 during the conversion. However, the output result of the fp16 model is still far away from the fp32 model. I looked into the statistics again, and then I observed some of the values in the output nodes turned into Inf and I realised this would be the problem of the cast layer because those values exceeded the range of fp16. Do you have any ideas to resolve? |
@jinhonglu , keep_io_types parameter can be a list of input and output names that need to keep fp32 instead of converting to fp16: You can add the output name to the list. |
Thanks, the current fp16 model has a low difference from the fp32 model. (The difference is about [0.0004, 0.001] now, before is 0.29) However, in inference, the result of the fp16 model is still far away from the fp32 model with the actual data. Btw, I am doing an audio task. I have to do a masking method with the model's output and then apply istft. I suspect the difference for an audio task is still too large. The current conversion code is below
|
later I found out that as the current fp16 model is converted by CPU, I ran the fp16 model through CPU for inference, the result is mostly the same as the fp32 model. But when I ran the fp16 model in GPU, the result would be totally different. Then I rebuild the onnxruntime in GPU to support CUDAProviders, I reran the fp32 model to find out the node names (put in the keep_io_types), converted to fp16 and ran with the onnxruntime-gpu. The statistics in fp16 again have Inf values. what is wrong with this? |
@jinhonglu, could you use same input tensors, and dump CPU inference stdout to one text file, and redirect GPU inference console output to another text file, and share the text files? We can compare the results to find out which node / operator causes the difference. |
fp16_model_cpu_stat.txt both files are uploaded. For your convenience, I have figured out those node/operator exceeded the range for both operations. Under CPU, Under CUDA, |
When I look at the cuda txt file, I realise the first Inf occurs at the output name '/band_split/to_features.25/to_features.25.0/ReduceSum_output_0' But, this node should not be restricted to fp16 as it is in 'keep_io_types' list. Is it caused by the 'max_finite_val'? |
@jinhonglu, keep_io_types list is only for graph input and outputs. You can use other two parameters op_block_list (a list of operator names like ["ReduceSum"]) or node_block_list (a list of node names): onnxruntime/onnxruntime/python/tools/transformers/float16.py Lines 190 to 192 in 281ed8c
|
Currently the fp16 model is suffering from back-to-back cast operation (ReduceSumOp(fp32)->Ouput(fp32)->Cast(fp16)->Cast(fp32)->PowOp(fp32)) What I want is I have noticed #8787 and #17953, and cast remover seems not to be taking effect when disabling all optimizers. model_fp16 = float16.convert_float_to_float16(model, min_positive_val=5.96e-08, max_finite_val=65504.0, opts = onnxruntime.SessionOptions() Anyways to disable these back-to-back Cast? I have tried to add the cast node name to the node_block_list, but there is no effect. This seems to be resolved by this https://github.com/microsoft/onnxconverter-common/pull/286 |
@jinhonglu, thanks for identifying the root cause. You can also follow this to walkaround it. Let me add a same post-processing to float16.convert_float_to_float16 for next release 1.19. |
@tianleiwu Furthermore, do you have any experience converting mixed precision onnx model to a TensorRT engine? I tried the TensorRT provider in onnxruntime with enabling 'trt_fp16_enbale' to run my above model, it seems that the engine builder forces all the nodes to be fp16 and is incompatible with mixed precision model. |
@jinhonglu, you can use fp32 onnx model to run TRT EP, only need set the trt_fp16_enbale flag in trt provider option. onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/engine_builder_ort_trt.py Line 64 in a6c5e2c
For optimization, the onnx model only need constant folding and shape inference since most optimizations will be done in engine building inside TRT. Do not over done it. Example code in the demo: onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_models.py Lines 451 to 455 in a6c5e2c
onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/engine_builder_ort_trt.py Line 45 in a6c5e2c
|
I followed your instructions, but the result of the engine is not correct. The following is my setting,
|
@jinhonglu, please look at NVIDIA/TensorRT#2691 for other options like plugin or adding constraint using Polygraphy cli to resolve tensorrt fp16 precision issue. For example, when you created a custom trt engine using the second approach, you can embed the engine into an ONNX file: Here is a python script to embed an externally (trtexec) compiled engine file: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/tensorrt/gen_trt_engine_wrapper_onnx_model.py#L156-L187 |
I have followed https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/run/08_adding_precision_constraints And I have passed the comparison test between the onnx_32 output and fp16 engine with my own constraints using polygraph Below is the postprocess constraint I created.
The ops I keep to float32 is based on the onnx analysis above (Or should I do another TensorRT analysis instead?) After that, I converted and got the engine using polygraphy
However, The inference result is still far away from the onnx model. |
@tianleiwu I ran through the previous procedure again,
However, the difference between fp32 and fp16 model output is big 0.004 I use the same script to run my previous model, the difference is only 0.0005. The fine-tuned model analysis The previous model analysis Any suggestions? |
@jinhonglu, Based on my experience, fp32 and fp16 model output difference 0.004 is acceptable. You can try evaluate end to end metrics (like precision/recall etc) to confirm it. |
Describe the issue
Since the mixed precision conversion is not working well, I tried to figure out which nodes to be converted fp16 and get the best performance. Thus, I tried to block all the nodes at first. However, I got nan output from the fp16 model. Ideally, this fp16 model should perform exactly as the fp32 model.
To reproduce
model = onnx.load("my_fp32_onnx_model")
list_ = []
include_list = []
for node in model.graph.node:
list_.append(node.name)
model_fp16 = float16.convert_float_to_float16(model, min_positive_val=1e-7, max_finite_val=1e4, keep_io_types=True,
disable_shape_infer=False,
node_block_list=list_
)
onnx.save(model_fp16, "fp16_model.onnx")
ort_session = onnxruntime.InferenceSession('fp16_model.onnx', providers=["CUDAExecutionProvider"])
batch_output_mask_fp16 = torch.tensor(ort_session.run(None, ort_inputs)[0])
print(sum(batch_output_mask_fp16.cpu().numpy()))
print(sum(batch_output_mask.cpu().numpy()))
print(np.abs(batch_output_mask_fp16.cpu().numpy() - batch_output_mask.cpu().numpy()).mean())
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04.4 LTS
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu 1.18.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
12.5
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: