-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional NaN results during inference #19851
Comments
If the Python based inference script returns right results but the C++ program doesn't yield right results, it is most likely that the pre-processing step in C++ doesn't match the Python pre-processing. The underlying op implementations between the C++ and the Python ORT runtimes are exactly same and unless there is a strange bug in the C++ API layer (we are not aware of any), it is not possible for C++ ORT to produce different results than Python ORT for the same raw input (assuming the ORT versions are the same for Python and C++).
A good check would be to see if the "raw" inputs to ORT (after pre-processing) matches between C++ and Python. You can search through the existing (open and closed) issues in the repo - there are quite a few issues along the lines of "Python works but C++ doesn't" (especially for image models) - it almost invariably ends up a quirk of OpenCV that resulted in incorrectly pre-processed inputs. Please look through these issues to see if something in them applies to your case as well. |
Hi @hariharans29, thank you for your quick response.
|
@goutamyg, you can try build from source and enable dumping of node inputs/outputs: https://onnxruntime.ai/docs/build/inferencing.html#debugnodeinputsoutputs |
@tianleiwu The build is successful, but 1 out of the 4 test cases fail (I think these tests are related to the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1 flag). I have attached the log file for your reference. |
@hariharans29 I checked the C++ code you suggested. It resembles the model architecture I am using (i.e., two inputs and multiple outputs). However, in my case, the output is a set of float values, which are used to compute the bounding-box coordinates as the tracking output. I tried changing my code to handle the inputs and outputs, but the stochastic nature of NaN output persists. |
Thanks for the feedback. It is going to be very hard to debug your C++ app as there are some components (like OpenCV) that I am not fully familiar with.
Also please confirm if the Python ORT version is the same as the C++ ORT version. |
If the build passes, worth a shot to see which layer starts producing NaNs using this approach. |
@hariharans29 I dont find the libonnxruntime.so file in the build/ folder. The available .so files are libtest_execution_provider.so and libonnxruntime_providers_shared.so, which are throwing errors when I include them in the cmake file. I did
and the compilation output, before initiating the test cases, says:
Am I missing something? Also, I will upload a minimal reproducible example in a day or two. I confirm using the same onnxruntime version (v1.12.1) for python and C++ inference. |
To get a release flavor libonnxruntime.so, please include |
Now I am using the |
You can take a look at the env variables available here (one such env variable will dump intermediate results) : https://onnxruntime.ai/docs/build/inferencing.html#debugnodeinputsoutputs. You may have to dump out the intermediate results and go through them to see which ones have the first NaNs. |
Based on this, I set the following env variables as
where I also tried
but there is no file containing the intermediate results was saved in the destination folder. Chatgpt says that |
Please take a look at some samples in the repo - like this - onnxruntime/onnxruntime/python/tools/transformers/models/gpt2/parity_check_helper.py Line 39 in 1fb6cbd
|
Try run like the following in Linux to dump to stdout, then redirect to file:
|
@tianleiwu Thank you. Now I have the intermediate feature maps dumped to a txt file. From my analysis, the reason for NaN results (which occurs at some point in the network) is due to some arbitrarily large values present as a part of the input data. The first 16 samples in one of the inputs as per the dumped file are: However, when I print these input values before creating the ort tensor, I get The first eight values do not match and are of the order 10^30. This probably is leading to larger values in the feature maps of subsequent layers (see the image below), eventually leading to NaN values. Can you please confirm if this is the appropriate way to create a ort tensor from a std::vector? It is partly based on an existing implementation. |
That looks like a bug to me: inputTensorValues_Z is a local variable, so the memory will be released after the function is finished. You will need keep input data alive until inference run is done: for example, bind to blob_z instead and make sure blob_z's life is long enough. When the input/output tensors is in CPU, you might need call SynchronizeBoundInputs and SynchronizeBoundOutputs. |
The code is working fine after I kept the std::vectors inputTensorValues_Z and inputTensorValues_X alive during inference run. Thank you very much @tianleiwu @hariharans29 |
Describe the issue
My C++ inference script for visual object tracking occasionally generates NaN output for all the frames in the video, i.e., nearly 4 out of 10 times, otherwise the outputs are as expected. I have added a condition to verify if the input has any NaNs and the input data seems fine. The Python-based inference script does not have this issue.
To reproduce
Follow the instructions here: https://github.com/goutamyg/MVT.cpp/tree/main/onnxruntime The link has the code and pretrained model
Urgency
Somewhat urgent
Platform
Linux
OS Version
Ubuntu 22.04.3 LTS
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-linux-x64-1.12.1 (also tested with the recent 1.16.1)
ONNX Runtime API
C++
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
https://drive.google.com/file/d/15dI9j7UQc35pcWjD0133eRzLh0P_fRvx/view
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: