Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use get_params() #135

Open
koaning opened this issue Aug 19, 2024 · 16 comments
Open

How to use get_params() #135

koaning opened this issue Aug 19, 2024 · 16 comments
Assignees

Comments

@koaning
Copy link

koaning commented Aug 19, 2024

I have a scikit-learn pipeline defined in the code below.

from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder, Binarizer
from sklearn.impute import SimpleImputer
from skrub import SelectCols
from sklearn.ensemble import HistGradientBoostingClassifier

feat_pipe = make_union(
    make_pipeline(
        SelectCols(["pclass", "sex"]),
        OneHotEncoder(sparse_output=False)
    ),
    SelectCols(["fare", "age"])
)

pipe = make_pipeline(
    feat_pipe, 
    HistGradientBoostingClassifier()
)

When I ask for the params of said pipeline I can see a long list of names that I can refer to when I do hyperparameter tuning.

pipe.get_params()

The list is long, but that is because it is nice and elaborate.

{'memory': None,
 'steps': [('featureunion',
   FeatureUnion(transformer_list=[('pipeline',
                                   Pipeline(steps=[('selectcols',
                                                    SelectCols(cols=['pclass',
                                                                     'sex'])),
                                                   ('onehotencoder',
                                                    OneHotEncoder(sparse_output=False))])),
                                  ('selectcols',
                                   SelectCols(cols=['fare', 'age']))])),
  ('histgradientboostingclassifier', HistGradientBoostingClassifier())],
 'verbose': False,
 'featureunion': FeatureUnion(transformer_list=[('pipeline',
                                 Pipeline(steps=[('selectcols',
                                                  SelectCols(cols=['pclass',
                                                                   'sex'])),
                                                 ('onehotencoder',
                                                  OneHotEncoder(sparse_output=False))])),
                                ('selectcols',
                                 SelectCols(cols=['fare', 'age']))]),
 'histgradientboostingclassifier': HistGradientBoostingClassifier(),
 'featureunion__n_jobs': None,
 'featureunion__transformer_list': [('pipeline',
   Pipeline(steps=[('selectcols', SelectCols(cols=['pclass', 'sex'])),
                   ('onehotencoder', OneHotEncoder(sparse_output=False))])),
  ('selectcols', SelectCols(cols=['fare', 'age']))],
 'featureunion__transformer_weights': None,
 'featureunion__verbose': False,
 'featureunion__verbose_feature_names_out': True,
 'featureunion__pipeline': Pipeline(steps=[('selectcols', SelectCols(cols=['pclass', 'sex'])),
                 ('onehotencoder', OneHotEncoder(sparse_output=False))]),
 'featureunion__selectcols': SelectCols(cols=['fare', 'age']),
 'featureunion__pipeline__memory': None,
 'featureunion__pipeline__steps': [('selectcols',
   SelectCols(cols=['pclass', 'sex'])),
  ('onehotencoder', OneHotEncoder(sparse_output=False))],
 'featureunion__pipeline__verbose': False,
 'featureunion__pipeline__selectcols': SelectCols(cols=['pclass', 'sex']),
 'featureunion__pipeline__onehotencoder': OneHotEncoder(sparse_output=False),
 'featureunion__pipeline__selectcols__cols': ['pclass', 'sex'],
 'featureunion__pipeline__onehotencoder__categories': 'auto',
 'featureunion__pipeline__onehotencoder__drop': None,
 'featureunion__pipeline__onehotencoder__dtype': numpy.float64,
 'featureunion__pipeline__onehotencoder__feature_name_combiner': 'concat',
 'featureunion__pipeline__onehotencoder__handle_unknown': 'error',
 'featureunion__pipeline__onehotencoder__max_categories': None,
 'featureunion__pipeline__onehotencoder__min_frequency': None,
 'featureunion__pipeline__onehotencoder__sparse_output': False,
 'featureunion__selectcols__cols': ['fare', 'age'],
 'histgradientboostingclassifier__categorical_features': 'warn',
 'histgradientboostingclassifier__class_weight': None,
 'histgradientboostingclassifier__early_stopping': 'auto',
 'histgradientboostingclassifier__interaction_cst': None,
 'histgradientboostingclassifier__l2_regularization': 0.0,
 'histgradientboostingclassifier__learning_rate': 0.1,
 'histgradientboostingclassifier__loss': 'log_loss',
 'histgradientboostingclassifier__max_bins': 255,
 'histgradientboostingclassifier__max_depth': None,
 'histgradientboostingclassifier__max_features': 1.0,
 'histgradientboostingclassifier__max_iter': 100,
 'histgradientboostingclassifier__max_leaf_nodes': 31,
 'histgradientboostingclassifier__min_samples_leaf': 20,
 'histgradientboostingclassifier__monotonic_cst': None,
 'histgradientboostingclassifier__n_iter_no_change': 10,
 'histgradientboostingclassifier__random_state': None,
 'histgradientboostingclassifier__scoring': 'loss',
 'histgradientboostingclassifier__tol': 1e-07,
 'histgradientboostingclassifier__validation_fraction': 0.1,
 'histgradientboostingclassifier__verbose': 0,
 'histgradientboostingclassifier__warm_start': False}

The reason why this is nice is that it allows me to be very specific. I can tune each input argument of every component like featureunion__pipeline__selectcols__cols or featureunion__pipeline__onehotencoder__sparse_output. This is very nice for gridsearch!

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    pipe, 
    param_grid={
        "featureunion__pipeline__onehotencoder__min_frequency": [None, 1, 5, 10]
    }
)

grid.fit(X, y)

The cool thing about this is that I am able to get a nice table as output too.

import pandas as pd

pd.DataFrame(grid.cv_results_).to_markdown()
mean_fit_time std_fit_time mean_score_time std_score_time param_featureunion__pipeline__onehotencoder__min_frequency params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.557284 0.0364319 0.0053968 0.00091813 nan {'featureunion__pipeline__onehotencoder__min_frequency': None} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
1 0.567849 0.0222483 0.00532556 0.000495336 1 {'featureunion__pipeline__onehotencoder__min_frequency': 1} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
2 0.567496 0.00920872 0.00557318 0.000404766 5 {'featureunion__pipeline__onehotencoder__min_frequency': 5} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
3 0.553523 0.023475 0.0052145 0.000855578 10 {'featureunion__pipeline__onehotencoder__min_frequency': 10} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1

But when I look at IbisML I wonder if I am able to do the same thing.

import ibis_ml as iml

tfm = iml.Recipe(
    iml.ExpandDateTime(iml.date())
)

In IbisML it is the Recipe object that is scikit-learn compatible, not the ExpandDateTime object. So lets inspect.

tfm.get_params()

This yields the following.

{'steps': (ExpandDateTime(date(),
                 components=['dow', 'month', 'year', 'hour', 'minute', 'second']),),
 'expanddatetime': ExpandDateTime(date(),
                components=['dow', 'month', 'year', 'hour', 'minute', 'second'])}

In fairness, this is not completely unlike what scikit-learn does natively. In a pipeline in scikit-learn you also have access to the steps argument and you could theoretically make all the changes there directly by passing in new subpipelines. But there is a reason why scikit-learn does not stop there! It can go deeper into all the input arguments of all the estimators in the pipeline because it makes the end cv_results_ output a lot nicer. And this is where I worry if IbisML can do the same thing. It seems that I need to pass full objects, instead of being able to pluck out the individual attributes that I care about.

In this particular case, what if I want to measure the effect of including/excluding dow or the hour? Is that possible? Can I have an understore-syntax-like string just like in scikit-learn to configure that? Or do I need to overwrite the steps object?

@gtauzin
Copy link

gtauzin commented Aug 23, 2024

I also ran into the same issue while trying to make use of IbisML within a pipeline that is then tuned using an sklearn search CV.

To make it work, I was naively considering wrapping the IbisML steps I need into sklearn transformers and combining them directly using an sklearn Pipeline (instead of the recipe). Is this a bad idea? I am no ibis expert, but it seems to me that having ibisML steps be scikit-learn estimators and just writing a few additional ibis-compatible utilities such as CV splitters would already open quite a few doors.

@koaning
Copy link
Author

koaning commented Aug 23, 2024

The more that I think of it. The more that I wonder if one should perhaps be careful. After all, do we really want to use parallel grid search with ibisML as a backend? That might not place super nice. Not to mention ... how would caching work? The sklearn memory system might not understand ibisML well enough.

Keeping the domains separate might also just be fine for the time being. There be a ton of details to get right.

@deepyaman
Copy link
Collaborator

I also ran into the same issue while trying to make use of IbisML within a pipeline that is then tuned using an sklearn search CV.

To make it work, I was naively considering wrapping the IbisML steps I need into sklearn transformers and combining them directly using an sklearn Pipeline (instead of the recipe). Is this a bad idea?

How are you wrapping them? I think the issue will be that the step boundary will evaluate the transformations eagerly (i.e. you'll basically end up passing NumPy arrays between steps, whereas, wrapped in a Recipe object, all of that processing happens on the backend). I can verify this, to be sure.

I am no ibis expert, but it seems to me that having ibisML steps be scikit-learn estimators and just writing a few additional ibis-compatible utilities such as CV splitters would already open quite a few doors.

@gtauzin I think this is correct! As mentioned in #136 (comment), we've added a basic train_test_split, and hadn't prioritized implementing additional CV splitters yet. That said, I don't think there's anything stopping us from doing so.

Is there any chance you'd be open to sharing a bit more about how you're using IbisML (happy to connect separately)? Amongst other things, it would be helpful to know:

  1. which Ibis backend(s)
  2. which modeling library
  3. what data volume

@gtauzin
Copy link

gtauzin commented Aug 25, 2024

The more that I think of it. The more that I wonder if one should perhaps be careful. After all, do we really want to use parallel grid search with ibisML as a backend? That might not place super nice. Not to mention ... how would caching work? The sklearn memory system might not understand ibisML well enough.

@koaning This is a good point. AFAIK, pipeline caching relies on joblib Memory which itself supports (only?) numpy arrays. I am not sure how it works, when set_output is set to a transform of "polars" or "pandas". In any case if the transformation has not been executed at the end of the cached step, then no caching can take place.

Indeed, it is critical if one uses IbisML for defining a feature generation pipeline to be tuned with grid search and caching is not working. However, IMO it is less of a problem if one uses RandomizedSearchCV or OptunaSearchCV as caching is less useful in that case (except if the "earliest" tuned params are all categorical). I think it may still be worth trying to list up all the potential breaking points having ibis sklearn transformers.

How are you wrapping them? I think the issue will be that the step boundary will evaluate the transformations eagerly (i.e. you'll basically end up passing NumPy arrays between steps, whereas, wrapped in a Recipe object, all of that processing happens on the backend). I can verify this, to be sure.

IMO, there are two potential pitfalls one should be careful about to have the same behavior with sklearn transformers and with thre ibisML recipe:

  1. check_array is the culprit when it comes to casting into numpy arrays. Therefore an IbisML transformer should not make use of it. However, it maybe a good idea to write an ibis-friendly one so that we can make the transformer work no matter the dataframe type of the input. This custom check_array would take as an input any dataframe and return a (validated) ibis table. This would enable the writing of dataframe backend-independent transformers (it is a strong point for me).
  2. set_output can force transformers to return a specific container. Having transformer.set_output(transform=None) does not modify the output of transform or fit_transform.

sklearn Pipeline (and FeatureUnion) do not call check_array and leave it to the transformers to handle the input type.

I believe FeatureUnion is also quite useful for feature generation, so it is important to be compatible with it as well. It relies on a hstack method to concatenate features together. This hstack can be customized by defining and registering an IbisAdapter (thus allowing an "ibis" set_output's transform see here for the pandas one).

@gtauzin I think this is correct! As mentioned in #136 (comment), we've added a basic train_test_split, and hadn't prioritized implementing additional CV splitters yet. That said, I don't think there's anything stopping us from doing so.

I see. I don't expect there should be any problem having ibis CV splitters. Maybe it would even be possible to wrap them.

Is there any chance you'd be open to sharing a bit more about how you're using IbisML (happy to connect separately)? Amongst other things, it would be helpful to know:
1. which Ibis backend(s)
2. which modeling library
3. what data volume

I do not (yet) use ibis or IbisML on any project. I just started looking into it to see what was possible with it and what use I could make of it. At the moment, it is still unclear to me.

As for the modeling libraries, I write everything I need in a way that is compatible with the sklearn API. This gives me immediate access to any technique available with the sklearn ecosystem, which is very rich and easily extandable. I work mostly with tabular and time series data, so I am interested in dataframe support in general. Happy to provide a better feedback once I have a better idea!

@gtauzin
Copy link

gtauzin commented Aug 25, 2024

@deepyaman I have opened a PR #141 meant for discussion to suggest such a sklearn transformer wrapper.

@jitingxu1
Copy link
Collaborator

It is possible to measure different parameters in one step without cv support, you need to overwrite the steps object. Something like this.

for components in components_list:
     recipe = ml.Reciple(ml.ExpandDate(ml.date(), components=components))
     pipe = Pipeline(recipe, model)
     # fit and evaluate pipe
     pipe.fit(...).score(...)

It is possible to have an understore-syntax-like string parameters of steps, but we have no cv splitter right now.

In this particular case, what if I want to measure the effect of including/excluding dow or the hour? Is that possible? Can I have an understore-syntax-like string just like in scikit-learn to configure that? Or do I need to overwrite the steps object?

@jitingxu1
Copy link
Collaborator

The more that I think of it. The more that I wonder if one should perhaps be careful. After all, do we really want to use parallel grid search with ibisML as a backend? That might not place super nice. Not to mention ... how would caching work? The sklearn memory system might not understand ibisML well enough.

Keeping the domains separate might also just be fine for the time being. There be a ton of details to get right.

Thank you for your input. In IbisML, the backend (local or remote) handles the data preprocessing, while model training takes place on the training node. We can still run GridSearchCV on model hyper-parameters using the preprocessed output from IbisML (not including the ibisML recipe in the sklearn pipeline, ). However, at present, we do not support integrating the tuning of data preprocessing steps within GridSearchCV.

recipe = ml.Recipe(.....)
X = recipe.fit_transform(X, y) # preprocessed output from IbisML

#Build Sklearn pipeline and do gridsearchCV start from here
...

IbisML do the feature transformation on a compute backend, it will transfer the preprocessed data from compute backend to the training process, the data movement cost will be significantly high for training with large datasets.

The second reason is that random or grid search may not be suitable for tuning preprocessing parameters. Instead, preprocessing parameter tuning often requires feature analysis, for example, choosing between imputing a feature with the median or mean.

@deepyaman
Copy link
Collaborator

@deepyaman I have opened a PR #141 meant for discussion to suggest such a sklearn transformer wrapper.

Just took a quick look at this. I think we should be able to do this, as you show, and also without inheriting from BaseEstimator (we have generally tried to implement the scikit-learn interface for compatibility without requiring it, for environments where people may use IbisML without scikit-learn).

The main gap (and also addressing this issue) is that get_params and set_params are implemented for Recipe objects, but not for Steps.

I and/or @jitingxu1 will try and take a closer look tomorrow!

@gtauzin
Copy link

gtauzin commented Aug 26, 2024

@deepyaman @jitingxu1 You could indeed implement a get_params / set_params set of methods at a recipe level that retrieves/set parameters of each step but you would need to give each step a name. The name could simply be the index in the step list.

I think what I am missing is the reason of existence of the Recipe itself. From what I understand, the Recipe was created to mimic a pipeline ensuring that transformations are lazily executed. Now that we know it can be the case with a sklearn Pipeline, what are the pros of Recipe vs Pipeline?

@deepyaman
Copy link
Collaborator

I think what I am missing is the reason of existence of the Recipe itself. From what I understand, the Recipe was created to mimic a pipeline ensuring that transformations are lazily executed. Now that we know it can be the case with a sklearn Pipeline, what are the pros of Recipe vs Pipeline?

@gtauzin Sorry, I didn't realize didn't answer this!

I actually had similar questions when I first started working on IbisML. Less to the exact point about Recipe vs. Pipeline, but more around Step vs Estimator—IbisML took a lot of inspiration from tidymodels recipes, to provide what some may consider a more ergonomic (e.g. selectors instead of ColumnTransformers and FeatureUnions). (To that point, I think @koaning has also done some separate thinking on simplifying some of these workflows in https://github.com/koaning/scikit-playtime.) At the same time, a lot of people may be very familiar with scikit-learn and it's syntax and have no complaints.

In addition to that, at least initially, we weren't sure the extent to which we could/should integrate with scikit-learn. As some of the above discussions have covered, when everything is expressed in scikit-learn, one would also expect all those features of scikit-learn to work.

That said, especially since we're targeting Python users, if we can make more and more stuff also work seamlessly with scikit-learn—including not having to use Recipes—I think that's great! To be honest, what I would love more than anything is to leave my scikit-learn code unchanged but to leverage any Ibis backend efficiently—i.e. magically enable any scikit-learn user's workflow to scale—but that of course is more ambitious. :)

@jcrist I know you also have more history with the origin of IbisML; feel free to chime in if I missed or misrepresented anything!

@koaning
Copy link
Author

koaning commented Aug 30, 2024

(To that point, I think @koaning has also done some separate thinking on simplifying some of these workflows in https://github.com/koaning/scikit-playtime.)

Perhaps to add context, all the tools in the playtime project are effectively just syntactic sugar around normal sklearn pipelines. All the functions and operators do is modify pipelines under the hood. It uses Narwhals under the hood at the moment to support multiple dataframes but I am assuming dataframes that fit into memory here, which is a different ballgame to what y'all are doing here.

@jcrist
Copy link
Member

jcrist commented Aug 30, 2024

I know you also have more history with the origin of IbisML; feel free to chime in if I missed or misrepresented anything!

Sure. There are a number of reasons why we went with the current design.

  • We wanted to have an API that could work without scikit-learn, for users that may want to use ibis-ml for preprocessing before feeding into some other framework. To that end we'd also need our own pipeline-like-concept.
  • The API design is heavily cribbed from tidymodels (hence the name Recipe). In ibis-ml a Recipe is a sequence of transforms applied to an input table adding/modifying/removing columns as they process. Personally I find this approach a lot easier to read and write than the sklearn ColumnTransformer approach (e.g. https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html). In this case the transformations of the input data mirror how I might write them using ibis itself, as an ordered sequence of mutations.

All that said, we did want you to be able to place an ibis_ml.Recipe into a sklearn pipeline for ease of use with the sklearn ecosystem. The intention here is that the recipe is the first step in the pipeline, followed by one (or more) sklearn transformers/estimators. By default the output type of a recipe is something sklearn compatible, while accepting a number of other input types (ibis.Table included). This makes it work well with existing sklearn code which expects aligned in-memory containers to be passed around for X/y.


Regarding #145, I think you'll find you'll run into issues if you drop the full Recipe and only use a pipeline with a series of steps. Sklearn pipelines plumb around X and y, but they don't let the individual steps modify y. The loop looks something like:

def fit(X, y=None):
    for step in steps[:-1]:
        X = step.fit_transform(X, y)
    return steps[-1].fit(X, y)

A few downsides with dropping the wrapping Recipe to rely on sklearn's builtin logic:

  • This is the big one. sklearn pipelines accept aligned in-memory containers (like two numpy arrays for X, y), and expect steps to apply transforms without reordering rows. When a Recipe gets these as inputs it coerces them to ibis types, then plumbs them through each step in turn. It does this while maintaining the original ordering so the output of transform will remain aligned with the input y. The trick here is that ibis itself doesn't maintain row ordering, we have to:

    • add a new ordered column on injest to dictate the order
    • apply all the steps in the recipe
    • sort by (and then drop) the ordering column

    Maintaining the ordering on each step would be expensive, applying ordering only at the output of ibis_ml.Recipe makes this a lot cheaper. Removing Recipe so that a sklearn Pipeline manages the plumbing would prevent coordinating ordering between steps, or even knowing what the initial ordering was without casting back to an in-memory output between steps.

  • Adding steps directly to a pipeline would make the demarcation between ibis-ml stuff and downstream sklearn stuff easier to mix up. Isolating ibis things to inside a Recipe prevents users from adding a sklearn step in between ibis steps (a performance footgun), encouraging better practices of ibis-things -> sklearn-things.

  • Not every estimator in sklearn is forgiving to non-numpy-like inputs. Some may call np.asarray on the input, some might explode. For ecosystem compatibility we really want the output type of transform/fit_transform to default to the standard numpy. Perhaps this is a bug, but it's harder to change the sklearn ecosystem than to just use the standard default value of set_output in ibis_ml.

  • And if every step needs to default to outputting an in-memory container, then every step would need to re-injest that input, preventing efficient pipelining between steps.


All that said, I do think we can still make get_params/set_params work on a Recipe (exposing options on individual steps), but I don't think we should/could make the steps themselves valid sklearn transformers. Better to isolate ibis_ml things inside a container so we can better (and more efficiently) control how processing of ibis-related-steps happens.

@gtauzin
Copy link

gtauzin commented Sep 13, 2024

Thanks @deepyaman @jcrist for taking the time to write such detail answers to my question, I really appreciate it! Sorry for the time it took me to get back to you.

  • We wanted to have an API that could work without scikit-learn, for users that may want to use ibis-ml for preprocessing before feeding into some other framework. To that end we'd also need our own pipeline-like-concept.

When I discovered ibisML, I first thought that it could be used to write backend-independent dataframe custom transformations. I was expecting that a Recipe could be inserted after a sklearn transformer that outputs a dataframe (e.g. if we applied set_output(transform="polars") to it), process it and output a dataframe according to the Recipe own set_output transform setting. Should I understand that this is not the purpose of IbisML? Would you rather recommend that I use narwhals for that as @koaning mentioned?

  • The API design is heavily cribbed from tidymodels (hence the name Recipe). In ibis-ml a Recipe is a sequence of transforms applied to an input table adding/modifying/removing columns as they process. Personally I find this approach a lot easier to read and write than the sklearn ColumnTransformer approach (e.g. https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html). In this case the transformations of the input data mirror how I might write them using ibis itself, as an ordered sequence of mutations.

Thanks, I have never used tidymodels but I'll have a look. Anyways, it is also my experience that writing preprocessing / feature generation pipeline mixing up Pipeline, FeatureUnion, and ColumnTransformer can get very complex. I believe that a lot of scikit-learn users end up writing a selector transformer and writing a wrapper combining a selector and a transformer to reproduce exactly the behavior that you are describing.

As @deepyaman mentioned, @koaning's scikit-playtime does that in a very neat way. However, it does not allow you to customize step's names but rely on make_pipeline and make_union to generate them (using the transformation class name). For data preprocessing that does not need tuning, it is perfectly fine but when it comes to tuning complex transformation pipeline, IMO, it is quite nice to be able to refer to the steps custom names. For IbisML, I guess custom or generated step naming would be necessary to improve get_params/set_params.

A few downsides with dropping the wrapping Recipe to rely on sklearn's builtin logic:

  • This is the big one. sklearn pipelines accept aligned in-memory containers (like two numpy arrays for X, y), and expect steps to apply transforms without reordering rows. When a Recipe gets these as inputs it coerces them to ibis types, then plumbs them through each step in turn. It does this while maintaining the original ordering so the output of transform will remain aligned with the input y. The trick here is that ibis itself doesn't maintain row ordering, we have to:

    • add a new ordered column on injest to dictate the order
    • apply all the steps in the recipe
    • sort by (and then drop) the ordering column

    Maintaining the ordering on each step would be expensive, applying ordering only at the output of ibis_ml.Recipe makes this a lot cheaper. Removing Recipe so that a sklearn Pipeline manages the plumbing would prevent coordinating ordering between steps, or even knowing what the initial ordering was without casting back to an in-memory output between steps.

I had not thought of that, this makes a lot of sense. When working with pandas and scikit-learn, I always have to be careful to keep my index sorted. I guess this is a similar issue here.

All that said, I do think we can still make get_params/set_params work on a Recipe (exposing options on individual steps), but I don't think we should/could make the steps themselves valid sklearn transformers. Better to isolate ibis_ml things inside a container so we can better (and more efficiently) control how processing of ibis-related-steps happens.

Thanks for clarifying!

@deepyaman
Copy link
Collaborator

When I discovered ibisML, I first thought that it could be used to write backend-independent dataframe custom transformations. I was expecting that a Recipe could be inserted after a sklearn transformer that outputs a dataframe (e.g. if we applied set_output(transform="polars") to it), process it and output a dataframe according to the Recipe own set_output transform setting. Should I understand that this is not the purpose of IbisML? Would you rather recommend that I use narwhals for that as @koaning mentioned?

Hmm, let me test this case that you describe (maybe tomorrow? 🤞). That said, the use case we imagine (and this can always change based on usage and feedback!) is more that you do your fit/transform workflows immediately after feature engineering (which you can do with vanilla Ibis) with IbisML. This continues with the theme of pushing computation to the scalable computing engine (database, distributed computing framework, whatever) where you data already lives. Intuitively, it feels like there is less value to being able to choose amongst Ibis backends after you put the data into Polars.

@deepyaman
Copy link
Collaborator

@koaning @gtauzin I've implemented get_params() and set_params() (for Recipe objects, which aim to be compatible with the scikit-learn estimator interface), and this is released in v0.1.3. Wanted to get your all thoughts on whether this issue is sufficiently addressed; I'll still leave #136 open until more explicitly check the use of Recipes in hyperparameter tuning/other such workflows.

(As a side note, I know I still owe an answer re #135 (comment))

@gtauzin
Copy link

gtauzin commented Sep 18, 2024

Thanks @deepyaman. For me, it looks good. I am not yet fully understanding how useful can sklearn-based tuning be with an ibisML Recipe using a SQL-backend. I'll be keeping a close look to see how it all evolves! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

No branches or pull requests

5 participants