Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing Considered Splits and Gains During Tree Construction #6667

Open
JohnPaulR opened this issue Oct 8, 2024 · 0 comments
Open

Accessing Considered Splits and Gains During Tree Construction #6667

JohnPaulR opened this issue Oct 8, 2024 · 0 comments
Labels

Comments

@JohnPaulR
Copy link

Summary

I would like to request a feature that enables the logging or extraction of all features considered for splits at each node during tree construction, along with their associated gain values (or impurity reductions). The goal is to access not only the best split but also the alternatives that were evaluated in order to identify and potentially trim variables that are essentially duplicative in their contribution to the model.

Motivation

This feature would help modelers better understand how LightGBM is considering features during tree-building. It could be particularly useful for feature engineering and model optimization, as it would allow practitioners to detect features that often compete for splits, meaning they are highly correlated or duplicative in their predictive power. By identifying such variables, it would be possible to simplify models, reduce dimensionality, and improve model interpretability without sacrificing accuracy.

Description

I propose adding functionality to LightGBM that would:

  1. Log or expose all considered features and split thresholds at each node, not just the selected split.
  2. Capture the gain (or impurity reduction) for all potential splits, allowing users to see which features were close competitors in terms of gain.
  3. This feature could be made accessible through a custom callback, an internal API hook, or a configurable parameter that enables detailed logging of splits during model training.

This could be useful for:

  • Model optimization: Trimming redundant variables that offer little marginal value compared to similar features.
  • Feature selection: Understanding which variables frequently compete for splits can aid in feature selection or combination.
  • Model interpretability: Providing insights into the decision-making process of the algorithm, beyond just the final tree structure.

If this feature already exists or can be achieved through custom means (such as callbacks or hooks), please provide guidance on how to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants