Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost very slow for classification with many classes #2926

Closed
lesshaste opened this issue Dec 5, 2017 · 6 comments
Closed

xgboost very slow for classification with many classes #2926

lesshaste opened this issue Dec 5, 2017 · 6 comments

Comments

@lesshaste
Copy link

lesshaste commented Dec 5, 2017

Environment info

Ubuntu

Compiler:

gcc

Package used (python/R/jvm/C++):

python

xgboost version used:

0.6

If installing from source, please provide

  1. The commit hash (git rev-parse HEAD)

git rev-parse HEAD
3dcf966

  1. Logs will be helpful (If logs are large, please upload as attachment).

If you are using python package, please provide

  1. The python version and distribution

Python 3.5.2

  1. The command to install xgboost if you are not installing from source

Installed from source

Steps to reproduce

Run the following self contained code which also creates fake data for classification with 120 classes:

#!/usr/bin/python3

from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from time import time

(trainX, trainY) = make_classification(n_informative=10, n_redundant=0, n_samples=50000, n_classes=120)

print("Shape of trainX and trainY", trainX.shape, trainY.shape)
t0 = time()
rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
rf.fit(trainX, trainY)
print("Time elapsed by RandomForestClassifier is: ", time()-t0)
t0 = time()
xgbrf = XGBClassifier(n_estimators=50, n_jobs=-1,verbose=True)
xgbrf.fit(trainX, trainY)
print("Time elapsed by XGBClassifier is: ", time()-t0)

What have you tried?

The RandomForestClassifier takes about 15 seconds but xgboost never terminates at all for me even after 10 minutes.

@marugari
Copy link
Contributor

marugari commented Dec 11, 2017

XGBoost uses regression trees, so build 120 (=num_class) trees per iteration.

That issue will be helpfull.
microsoft/LightGBM#524

@lesshaste
Copy link
Author

@marugari

Does this mean xgboost will always be slow for multi class classification?

You accidentally linked back to this issue in your comment (2926). Did you mean to link to another one?

@marugari
Copy link
Contributor

@lesshaste Sorry, I fix the link.

@khotilov
Copy link
Member

There is an inefficiently in prediction caching that results in nclass^2 complexity #1689 (comment)
I didn't yet get to fixing it, as I didn't really have any practical need for so many classes, but I will.

@lesshaste
Copy link
Author

That would be awesome! Thank you.

@mfeurer
Copy link

mfeurer commented Mar 20, 2018

This recent paper might be another potential solution to increase multiclass speed.

@tqchen tqchen closed this as completed Jul 4, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants