Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] LGBMClassifier produces empty trees #6080

Closed
znacer opened this issue Sep 6, 2023 · 3 comments
Closed

[python-package] LGBMClassifier produces empty trees #6080

znacer opened this issue Sep 6, 2023 · 3 comments
Labels

Comments

@znacer
Copy link

znacer commented Sep 6, 2023

Description

While exploring the trees produced by LGBMClassifier, it appears that some trees can have only one leaves.
This behavior seems strange as we would expect at least one split for every trees.

Reproducible example

import numpy as np
from sklearn.datasets import load_digits
from lightgbm import LGBMClassifier
np.random.seed(0)

# data | regression | binary | multi class
data_mult = load_digits(as_frame=True)
# Model
model_mult = LGBMClassifier(**{'verbosity': -1, }).fit(data_mult.data, data_mult.target)

for tree in model_mult._Booster.dump_model()["tree_info"]:
    if tree["num_leaves"] == 1:
        print(f"Tree {tree['tree_index']} has only one {tree['num_leaves']} leaf")

Environment info

LightGBM version or commit hash: 4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

This bug has been found while working on an issue on the SHAP library.

@jameslamb jameslamb changed the title LGBMClassifier produces empty trees [python-package] LGBMClassifier produces empty trees Sep 6, 2023
@jameslamb
Copy link
Collaborator

jameslamb commented Sep 6, 2023

Thanks for using LightGBM and for the report.

Short Answer

we would expect at least one split for every trees

The existence of trees with 0 splits is not a bug in LightGBM.

LightGBM comes with many settings to limit overfitting, and it's completely possible that during training, it may intentionally choose to produce a tree with 0 splits.

In fact, there's even a separate thread going on here specifically about adding support for such 0-split trees in shap:

cc @thatlittleboy


Long Answer

For debugging activities like this, it's useful to include the logs produced by your program.

I ran the following modified version of your code (with verbosity=1):

import numpy as np
from sklearn.datasets import load_digits
from lightgbm import LGBMClassifier
np.random.seed(0)

# data | regression | binary | multi class
data_mult = load_digits(as_frame=True)

# Model
model_mult = LGBMClassifier(verbosity=1).fit(data_mult.data, data_mult.target)

for tree in model_mult._Booster.dump_model()["tree_info"]:
    if tree["num_leaves"] == 1:
        print(f"Tree {tree['tree_index']} has only one {tree['num_leaves']} leaf")

Doing that, I can see that the logs are full of this same warning:

[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000249 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 846
[LightGBM] [Info] Number of data points in the train set: 1797, number of used features: 54
[LightGBM] [Info] Start training from score -2.312090
[LightGBM] [Info] Start training from score -2.289867
[LightGBM] [Info] Start training from score -2.317724
[LightGBM] [Info] Start training from score -2.284388
[LightGBM] [Info] Start training from score -2.295377
[LightGBM] [Info] Start training from score -2.289867
[LightGBM] [Info] Start training from score -2.295377
[LightGBM] [Info] Start training from score -2.306488
[LightGBM] [Info] Start training from score -2.334819
[LightGBM] [Info] Start training from score -2.300917
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

It appears to me that in this example, LightGBM added new trees with splits for the first few iterations, and then was not able to find any additional splits satisfying its constraints for splitting.

Examples of those constraints:

  • min_gain_to_split = how much gain (in terms of the objective function) must a split offer to be added to a tree in the model?
  • min_data_in_leaf = how many samples must, at minimum, fall into all leaves?

You can find documentation on these and others here:

If you search for that warning message here and on Stack Overflow, you'll find many explanations of this:

@znacer
Copy link
Author

znacer commented Sep 6, 2023

Thanks for your quick return and explanation.
I did not saw that this issue was already explained. It is a duplicated. closing the issue.

Copy link

github-actions bot commented Dec 6, 2023

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants