Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even when forced_splits is set, the threshold is chosen from the bin_upper_bounds. #1829

Closed
AnchorBlues opened this issue Nov 7, 2018 · 7 comments

Comments

@AnchorBlues
Copy link

In the lightgbm, each numerical variable is replaced with discrete bins, and the threshold for split is chosen from the upper bounds of bin(variable name bin_upper_bounds).
Even when forced_splits is set, the lightgbm decides the threshold like that.
As a result, an unexpected result occurs.

Environment info

Operating System:
Ubuntu 14.04.5 LTS, Trusty Tahr

CPU/GPU model:
cpu
Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz

C++/Python/R version:

Python 3.6.6 (default, Oct 9 2018, 12:34:16)
[GCC 7.3.0] :: Anaconda, Inc. on linux

lightgbm version : newest version (merged the commit until the commit id ca4b666(new version for master branch (#1824))

source

data = load_breast_cancer()
x = data.data
y = data.target

np.random.seed(0)
# create a new feature that takes a value 0 or 1.
dummy = np.random.randint(0, 2, size=len(x)) 

x = np.c_[dummy, x]

# create json file that forces the tree to split by the dummy feature. threshold is 0.5.
s = """
{
    "feature": 0,
    "threshold": 0.5,
    "left": {
    },
    "right": {
    }
}
"""

with open("forced_splits-0.json", mode='w') as f:
    f.write(s)

# model training
model = lgb.LGBMClassifier(random_state=42, forced_splits="forced_splits-0.json", num_leaves=3)
model.fit(x, y)

# dump model
json_obj = model.booster_.dump_model(1)
trees = json_obj['tree_info']

# the first tree
tree = trees[0]

tree['tree_structure']

result

The json output of the first decision tree is as follows:

{'split_index': 0,
 'split_feature': 0,
 'split_gain': 5.158199787139893,
 'threshold': 1e+300,
 'decision_type': '<=',
 'default_left': True,
 'missing_type': 'None',
 'internal_value': 0,
 'internal_count': 569,
 'left_child': {'split_index': 1,
  'split_feature': 24,
  'split_gain': 57102.30078125,
  'threshold': 700.6500000000001,
  'decision_type': '<=',
  'default_left': True,
  'missing_type': 'None',
  'internal_value': 0.204333,
  'internal_count': 569,
  'left_child': {'leaf_index': 0,
   'leaf_value': 49.87522832525989,
   'leaf_count': 296},
  'right_child': {'leaf_index': 2,
   'leaf_value': 0.36087350821712294,
   'leaf_count': 273}},
 'right_child': {'leaf_index': 1,
  'leaf_value': 0.5021707657990948,
  'leaf_count': 295}}

In line 4, the threshold for split is 1e+300, not 0.5.

I investigated the cause of the problem.
The threshold for split is chosen from bin_upper_bounds, specifically, the minimum value that is greater than or equal to the threshold forced by forced_splits.
In this case, bin_upper_bounds = [1e-35, inf], so the value inf is chosen as a threshold for split.
That is because the minimum value in [1e-35, inf] that is greater than or equal to 0.5 is inf.
As a result, the lightgbm try to split the data using the threshold inf.

I hope that the data is splitted by the 0.5 if the threshold is set to 0.5 at forced_splits.

@guolinke
Copy link
Collaborator

it is not trivial as the contents of force-splits json should be used in bin mapper finding algorithm.

I think a better solution is to allow the pre-defined bin_upper_bounds.

@StrikerRUS
Copy link
Collaborator

@guolinke Should anything be done before 2.2.4 release here?

@StrikerRUS
Copy link
Collaborator

@guolinke

@guolinke
Copy link
Collaborator

guolinke commented Aug 1, 2019

@StrikerRUS As it is not trivial and not very critical, I think we can do it after v2.2.4.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@StrikerRUS
Copy link
Collaborator

Reopening this as we have open PR.

@StrikerRUS StrikerRUS reopened this Sep 12, 2019
@StrikerRUS
Copy link
Collaborator

Implemented in #2325.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants