Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] make Dataset pickleable #5098

Closed
5uperpalo opened this issue Mar 26, 2022 · 4 comments
Closed

[python-package] make Dataset pickleable #5098

5uperpalo opened this issue Mar 26, 2022 · 4 comments

Comments

@5uperpalo
Copy link

Description

ctypes objects pointers are added to the training(and evaluation) dataset after it is used for training a LightGBM model in Python. This becomes an issue if I want to serialize the dataset using cloudpickle after I use it for model training.

cloudpickle is used by RayTune to serialize the objects during the distribution of the objects while running multiple trail runs. I did not expect any dataset attribute change after using it to train LightGBM model. As a workaround for now I create a copy of the dataset that I use for training as I could not use it otherwise in followup hyper parameter optimization steps that are done by RayTune.

Reproducible example

import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm import Dataset
import cloudpickle

train_df = pd.DataFrame(
    {
        "id": np.arange(0, 20),
        "cont_feature": np.arange(0, 20),
        "target": [0] * 5 + [1] * 15,
    },
)
lgbtrain = Dataset(
        train_df.drop(columns=["target", "id"]),
        train_df["target"],
        free_raw_data=False,
    )

pickled_train_df_ok = cloudpickle.dumps(lgbtrain)

model = lgb.train(
    train_set=lgbtrain,
    params={},
)

pickled_train_df_NOTok = cloudpickle.dumps(lgbtrain)

Environment info

LightGBM version or commit hash:

lightgbm = "3.3.2"

Command(s) you used to install LightGBM

pip install lightgbm

python = "3.9.7"
cloudpickle = "2.0.0"
pandas = "1.4.1"
numpy = "1.22.3"

@jameslamb
Copy link
Collaborator

This becomes an issue

Thanks for the excellent report! We'll look into it as soon as possible. Can you please clarify what specifically you mean by "This becomes an issue"?

  • if you get an exception or warning, please include the message you see (so that this issue can be found by others facing the same issue and using search engines to look for solutions)
  • if something else, please describe specifically what "becomes an issue" means

@5uperpalo
Copy link
Author

5uperpalo commented Mar 26, 2022

hi @jameslamb , thanks for quick response, by "This becomes an issue" I mean the pickler used by RayTune is unable to serialize the dataset object and outputs a value error:
cloudpickle ValueError: ctypes objects containing pointers cannot be pickled
only way around is it either do a simple copy of dataset object that I am using for training, or do a over-complicated solution by defining my own customized serialization as described here:
https://docs.ray.io/en/latest/ray-core/serialization.html

@jameslamb jameslamb changed the title Python LightGBM : ctypes objects pointers added to training dataset after model train [python-package] ctypes objects pointers added to training dataset after model train Mar 26, 2022
@StrikerRUS
Copy link
Collaborator

Indeed, Dataset object isn't serializable. Booster object (that is a trained LightGBM model actually, returned value of the train() function) is serializable because it can be re-created from the text format. I don't think that the same is applicable for the Dataset entity.

def __getstate__(self):
this = self.__dict__.copy()
handle = this['handle']
this.pop('train_set', None)
this.pop('valid_sets', None)
if handle is not None:
this["handle"] = self.model_to_string(num_iteration=-1)
return this
def __setstate__(self, state):
model_str = state.get('handle', None)
if model_str is not None:
handle = ctypes.c_void_p()
out_num_iterations = ctypes.c_int(0)
_safe_call(_LIB.LGBM_BoosterLoadModelFromString(
c_str(model_str),
ctypes.byref(out_num_iterations),
ctypes.byref(handle)))
state['handle'] = handle
self.__dict__.update(state)

@jameslamb jameslamb changed the title [python-package] ctypes objects pointers added to training dataset after model train [python-package] make Dataset pickleable Jun 1, 2023
@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants