Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference performance issues with Lightgbm when using categoricals #647

Open
ozancicek opened this issue Aug 26, 2023 · 1 comment
Open

Comments

@ozancicek
Copy link

Hi,

When running inference with a Lightgbm model having categorical features, experienced much higher latencies when compared to treating them as all numerical features, especially as vocab size increased. Did a little bit of looking around and it seems like nodes with categorical split conditions (such as value in value_1|value_2|value_3, i.e.set contains split) are expanded to "==" nodes which can increase the node count with an amount depending on the vocab size. This is I guess due to not supporting set contains split natively in onnxruntime.

I was wondering if you had any plans for supporting such categorical split conditions natively? We were looking forward to use categorical features but due to unpredictable latencies we weren't able to utilize them on a latency critical path.

Here is a toy example, the slowdown might depend on setup but using a few categorical feature columns was enough to double the average runtime.

from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split

from onnxruntime import InferenceSession
from onnxmltools.convert.common.data_types import FloatTensorType
from onnxmltools.convert import convert_lightgbm
import lightgbm as lgb
import numpy as np
from onnxruntime import InferenceSession

data = fetch_covtype()
X, y = data.data, data.target
X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print(X_train.shape) # (435759, 54)

# Variant 1: Fit all as numerical
model = lgb.LGBMClassifier(max_depth=6, n_estimators=100, seed=0)
model.fit(X_train, y_train)

# Variant 2: get top 3 important features, fit as categorical
categorical_features = np.argsort(model.booster_.feature_importance())[-3:].tolist()
cardinalities = [len(np.unique(X[:, c])) for c in categorical_features]
print(cardinalities) # [1978, 5827, 5785]

cat_model = lgb.LGBMClassifier(max_depth=6, n_estimators=100, seed=0)
cat_model.fit(X_train, y_train, categorical_feature=categorical_features)

onx = convert_lightgbm(model, None, init, zipmap=False)
s = InferenceSession(onx.SerializeToString())

onx_cat = convert_lightgbm(cat_model, None, init, zipmap=False)
s_cat = InferenceSession(onx_cat.SerializeToString())
%%timeit -n20
out = s.run(None, {"X": X[:1000, :]})
21.8 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
%%timeit -n20
out = s_cat.run(None, {"X": X[:1000, :]})
43 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
@xadupre
Copy link
Collaborator

xadupre commented Oct 2, 2023

It is a known issue. A new rule must be added to the definition of operators TreeEnsembleRegressor and TreeEnsembleClassifier to support that scenario. That's the first step before implementing this new feature in onnxruntime and updating the converter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants