Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in ONNX conversion #563

Closed
fgrosa opened this issue Feb 1, 2022 · 9 comments
Closed

Error in ONNX conversion #563

fgrosa opened this issue Feb 1, 2022 · 9 comments

Comments

@fgrosa
Copy link

fgrosa commented Feb 1, 2022

I tried to run the following script based on the example in the README.md changing backend from pytorch to onnx:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from hummingbird.ml import convert, load

# Create some random data for binary classification
num_classes = 2
X = np.random.rand(1000, 4)
y = np.random.randint(num_classes, size=1000)

# Create and train a model (scikit-learn RandomForestClassifier in this case)
skl_model = RandomForestClassifier(n_estimators=10, max_depth=10)
skl_model.fit(X, y)

# Use Hummingbird to convert the model to ONNX
model = convert(skl_model, 'onnx', X[0:1])

# Run predictions on CPU
model.predict(X)

# Save the model
model.save('hb_model')

# Load the model back
model = load('hb_model')

Everything works as expected, but at the end of the execution I get the following error:

Fatal Python error: take_gil: PyMUTEX_LOCK(gil->mutex) failed
Python runtime state: finalizing (tstate=0x7fceb1409790)

Abort trap: 6

Is there something wrong in my installation or is something in the onnx conversion?

@ksaur
Copy link
Contributor

ksaur commented Feb 1, 2022

Hello and welcome! We have not seen that error before, and I'm pretty sure I personally have hit all errors onnx has to offer. :-D So I do not think it is specific to Hummingbird, but let's see. |
What platform are you using? Does it happen to be related to this rclpy/#805 ? Can you tell us a little more about your system and versions?

@fgrosa
Copy link
Author

fgrosa commented Feb 1, 2022

Hi @ksaur thanks for the quick reply! Indeed it looks like the same error, not sure if it is related.
My platform is the following:

  • Operating System: MacOS, 11.6.1 (non M1)
  • Installation type: pip
  • Version: hummingbird_ml[extra]>=0.4.2

@ksaur
Copy link
Contributor

ksaur commented Feb 1, 2022

Ok, and do I understand correctly that it only happens when you change from torch to onnx for the convert? Is it possible that you just use torch instead?

Also, does it still happen when you kill python, and THEN call load on the model? You are able to still run predictions?

@fgrosa
Copy link
Author

fgrosa commented Feb 2, 2022

It is correct, it happens only with onnx. Unfortunately currently I need to use onnx because I need to apply the models on c++ and in our framework we use ONNXRuntime for this.

If I load the converted model and I call predict on a test sample, I get the correct output. The error only appears when I call convert.

@ksaur
Copy link
Contributor

ksaur commented Feb 2, 2022

Hmm ok. Can share with us the full stack trace we can see where the error is in convert and try to sort it out? Also share version information on OS/GPU/Python/onnx libs? I'm assuming you installed onnxruntime-gpu as opposed to regular onnxruntime?

I am curious...does it work on your platform if you instead install onnxruntime and run on CPU? That would be a good baseline check to rule out any other issues.

We mostly test on Linux/Windows GPU, but do our best to also support Mac! We don't have access to a MacOS-enabled GPU in our pipeline.

I'm not sure if this will be helpful, but one idea could be to check onnx[runtime,tools,...] versions maybe? We had some bug (unrelated to this) which caused us to pin to this version of sklearn-onnx in our workflow. (This is likely not the issue, but I'm just trying to think of all possible ideas.) See also the versions in our most recent version of the workflow for MacOS python3.7-3.9 for which onnx versions we expect to work on CPU at least.
Screen Shot 2022-02-02 at 9 17 30 AM

Please let us know how it goes, and we will try to help you further troubleshoot!

@fgrosa
Copy link
Author

fgrosa commented Feb 2, 2022

We are indeed working with CPU and therefore onnxruntime already. In principle we needed older versions of onnx, onnxruntime, and onnxmltools (onnx==1.8.0, onnxruntime==1.7.0, and onnxmltools==1.7.0), but I tried also with the latest versions I get no error, however it remains stuck after the last line and does not exit the script. I also tried by installing the sklearn-onnx that you have in your workflow, but unfortunately it does not help. I got the same behaviour also on Linux (CentOS7). Since I don't get a full stacktrace, do you have a command to suggest me to use? Thanks!

@ksaur
Copy link
Contributor

ksaur commented Feb 2, 2022

If it is CPU-only, then I am surprised it doesn't work because we have tested onnx+cpu+mac quite a bit! Are you running the exact code above, or any modifications? What version of Python?

In the past, we've used this to force traceback:

import traceback
import sys

try:
    do_stuff()
except Exception:
    print(traceback.format_exc())
    # or
    print(sys.exc_info()[2])

For CentOS7, I would expect it to work; we've tested on other fedora-based distros but not that one specifically. Does it also hang there, or print GIL error? Is this on a VM on the same machine as your Mac, or a separate machine?

@ksaur
Copy link
Contributor

ksaur commented Feb 3, 2022

I tried just now on CentOS7 with the above code with the older onnx versions and wasn't able to reproduce the error; it worked as expected, so I'm not sure.

@ksaur
Copy link
Contributor

ksaur commented Mar 10, 2022

Hi @fgrosa I'm going to close this since I couldn't reproduce it. If you are still stuck, please reopen or file another issue! Thanks!

@ksaur ksaur closed this as completed Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants