Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column order impacting the predictions of LGBM (regression) #6671

Open
erykml opened this issue Oct 10, 2024 · 0 comments
Open

Column order impacting the predictions of LGBM (regression) #6671

erykml opened this issue Oct 10, 2024 · 0 comments
Labels

Comments

@erykml
Copy link

erykml commented Oct 10, 2024

Hi 🙂

I encountered some unexpected behavior and wanted to understand the reasoning behind it. The issue is regarding the impact of column order on model predictions in a regression setup. I’ve seen similar questions on this topic and tried applying various suggestions to achieve deterministic results, but without success.

Below is a toy example with:

  • Two sets of features
  • Two sets of hyperparameters

With the default hyperparameters (params 1), I get the same results regardless of column order. However, with the second set (params 2), the results are the same for feature set 1, while they differ for feature set 2—there’s only one observation in the test set that returns a different prediction.

Could you please help me understand where the difference is coming from? In my actual use case, the discrepancies are larger than in this toy dataset.

If you need any further details regarding the environment, please let me know :)

Env:

MacOS Sonoma 14.6.1
LGBM 4.5.0
sklearn 1.5.1

Toy example:

import pandas as pd

import sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# feature set #1
# features_set = ['HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'MedInc']

# feature set #2
features_set = ["Longitude", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "MedInc",]

# params 1
# params = {
#     "verbose": -1,
#     "seed": 42,
# }

# params 2
params = {
    "boosting_type": "gbdt",
    "max_depth": 4,
    "bagging_fraction": 1.0,
    "bagging_freq": 0,
    "feature_fraction": 1.0,
    "learning_rate": 0.019324,
    "num_leaves": 128,
    "min_data_in_leaf": 16,
    "max_bin": 90,
    "num_iterations": 267,
    "min_gain_to_split": 0.0,
    "lambda_l1": 0.001356,
    "lambda_l2": 0.000581,
    "verbose": -1,
    "seed": 42,
    "num_thread": 1,
    "deterministic": True,
    "force_row_wise": True,
}

train_data_1 = lgb.Dataset(X_train, label=y_train)
model_1 = lgb.train(params, train_data_1)
y_pred_1 = model_1.predict(X_test)
mse_1 = mean_squared_error(y_test, y_pred_1)

train_data_2 = lgb.Dataset(X_train[features_set], label=y_train)
model_2 = lgb.train(params, train_data_2)
y_pred_2 = model_2.predict(X_test[features_set])
mse_2 = mean_squared_error(y_test, y_pred_2)

print(mse_1 == mse_2)
@erykml erykml changed the title Column order impacting the prediction of LGBM (regression) Column order impacting the predictions of LGBM (regression) Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants