Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in fit method with evaluation sets #10793

Open
ldesreumaux opened this issue Sep 1, 2024 · 1 comment
Open

Performance regression in fit method with evaluation sets #10793

ldesreumaux opened this issue Sep 1, 2024 · 1 comment

Comments

@ldesreumaux
Copy link
Contributor

I have observed a significant performance regression in XGBoost version 1.7 when using the fit method with evaluation sets in sklearn estimators. The issue appears to have been introduced by this commit, which defaults to using QuantileDMatrix for both training and evaluation sets.

While the optimization of prediction with QuantileDMatrix has been addressed in #9013, there remains a significant performance gap when using QuantileDMatrix for evaluation sets compared to DMatrix.

Here is a sample code to reproduce the issue:

import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import time

n_samples = 1000000
n_features = 100
seed = 42

np.random.seed(seed)

X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, 2, size=n_samples)

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=seed)
X_eval1, X_eval2, y_eval1, y_eval2 = train_test_split(X_temp, y_temp, test_size=0.5, random_state=seed)

model = XGBClassifier(
    tree_method='hist',
    max_depth=6,
    n_estimators=500,
    eval_metric='logloss',
    random_state=seed
)

start_time = time.time()

model.fit(X_train, y_train, eval_set=[(X_eval1, y_eval1), (X_eval2, y_eval2)], verbose=True)

end_time = time.time()
execution_time = end_time - start_time

y_pred_eval1 = model.predict(X_eval1)
y_pred_eval2 = model.predict(X_eval2)

accuracy_eval1 = accuracy_score(y_eval1, y_pred_eval1)
accuracy_eval2 = accuracy_score(y_eval2, y_pred_eval2)

print(f"Accuracy on Evaluation Set 1: {accuracy_eval1:.4f}")
print(f"Accuracy on Evaluation Set 2: {accuracy_eval2:.4f}")

print(f"Execution Time: {execution_time:.2f} seconds")

Performance comparison (with current master branch):

  • With QuantileDMatrix: 66.13 seconds
  • With DMatrix: 36.35 seconds

Here are profiling graphs for the two cases:

The graphs clearly show that the performance degradation is linked to the prediction step with QuantileDMatrix for evaluation sets.

This sample code uses synthetic data, but I have observed the same order of magnitude of performance degradation with a real-world dataset.

If no further optimization is possible, I would suggest to change the default behavior to use a simple DMatrix for the evaluation sets.

@trivialfis
Copy link
Member

I agree that the gap is unexpectedly large. The choice of QDM is for reduced memory usage as it compresses the data. But there's a cost in data lookup during prediction. I will try to see what can be done there. Maybe use in place predict, maybe optimize the value lookup a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants