-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use get_params()
#135
Comments
I also ran into the same issue while trying to make use of IbisML within a pipeline that is then tuned using an sklearn search CV. To make it work, I was naively considering wrapping the IbisML steps I need into sklearn transformers and combining them directly using an sklearn Pipeline (instead of the recipe). Is this a bad idea? I am no ibis expert, but it seems to me that having ibisML steps be scikit-learn estimators and just writing a few additional ibis-compatible utilities such as CV splitters would already open quite a few doors. |
The more that I think of it. The more that I wonder if one should perhaps be careful. After all, do we really want to use parallel grid search with ibisML as a backend? That might not place super nice. Not to mention ... how would caching work? The sklearn memory system might not understand ibisML well enough. Keeping the domains separate might also just be fine for the time being. There be a ton of details to get right. |
How are you wrapping them? I think the issue will be that the step boundary will evaluate the transformations eagerly (i.e. you'll basically end up passing NumPy arrays between steps, whereas, wrapped in a
@gtauzin I think this is correct! As mentioned in #136 (comment), we've added a basic Is there any chance you'd be open to sharing a bit more about how you're using IbisML (happy to connect separately)? Amongst other things, it would be helpful to know:
|
@koaning This is a good point. AFAIK, pipeline caching relies on joblib Memory which itself supports (only?) numpy arrays. I am not sure how it works, when set_output is set to a transform of "polars" or "pandas". In any case if the transformation has not been executed at the end of the cached step, then no caching can take place. Indeed, it is critical if one uses IbisML for defining a feature generation pipeline to be tuned with grid search and caching is not working. However, IMO it is less of a problem if one uses RandomizedSearchCV or OptunaSearchCV as caching is less useful in that case (except if the "earliest" tuned params are all categorical). I think it may still be worth trying to list up all the potential breaking points having ibis sklearn transformers.
IMO, there are two potential pitfalls one should be careful about to have the same behavior with sklearn transformers and with thre ibisML recipe:
sklearn Pipeline (and FeatureUnion) do not call check_array and leave it to the transformers to handle the input type. I believe FeatureUnion is also quite useful for feature generation, so it is important to be compatible with it as well. It relies on a hstack method to concatenate features together. This hstack can be customized by defining and registering an IbisAdapter (thus allowing an "ibis" set_output's transform see here for the pandas one).
I see. I don't expect there should be any problem having ibis CV splitters. Maybe it would even be possible to wrap them.
I do not (yet) use ibis or IbisML on any project. I just started looking into it to see what was possible with it and what use I could make of it. At the moment, it is still unclear to me. As for the modeling libraries, I write everything I need in a way that is compatible with the sklearn API. This gives me immediate access to any technique available with the sklearn ecosystem, which is very rich and easily extandable. I work mostly with tabular and time series data, so I am interested in dataframe support in general. Happy to provide a better feedback once I have a better idea! |
@deepyaman I have opened a PR #141 meant for discussion to suggest such a sklearn transformer wrapper. |
It is possible to measure different parameters in one step without cv support, you need to overwrite the steps object. Something like this. for components in components_list:
recipe = ml.Reciple(ml.ExpandDate(ml.date(), components=components))
pipe = Pipeline(recipe, model)
# fit and evaluate pipe
pipe.fit(...).score(...) It is possible to have an understore-syntax-like string parameters of steps, but we have no cv splitter right now.
|
Thank you for your input. In IbisML, the backend (local or remote) handles the data preprocessing, while model training takes place on the training node. We can still run GridSearchCV on model hyper-parameters using the preprocessed output from IbisML (not including the ibisML recipe in the sklearn pipeline, ). However, at present, we do not support integrating the tuning of data preprocessing steps within GridSearchCV. recipe = ml.Recipe(.....)
X = recipe.fit_transform(X, y) # preprocessed output from IbisML
#Build Sklearn pipeline and do gridsearchCV start from here
... IbisML do the feature transformation on a compute backend, it will transfer the preprocessed data from compute backend to the training process, the data movement cost will be significantly high for training with large datasets. The second reason is that random or grid search may not be suitable for tuning preprocessing parameters. Instead, preprocessing parameter tuning often requires feature analysis, for example, choosing between imputing a feature with the median or mean. |
Just took a quick look at this. I think we should be able to do this, as you show, and also without inheriting from The main gap (and also addressing this issue) is that I and/or @jitingxu1 will try and take a closer look tomorrow! |
@deepyaman @jitingxu1 You could indeed implement a get_params / set_params set of methods at a recipe level that retrieves/set parameters of each step but you would need to give each step a name. The name could simply be the index in the step list. I think what I am missing is the reason of existence of the Recipe itself. From what I understand, the Recipe was created to mimic a pipeline ensuring that transformations are lazily executed. Now that we know it can be the case with a sklearn Pipeline, what are the pros of Recipe vs Pipeline? |
@gtauzin Sorry, I didn't realize didn't answer this! I actually had similar questions when I first started working on IbisML. Less to the exact point about In addition to that, at least initially, we weren't sure the extent to which we could/should integrate with scikit-learn. As some of the above discussions have covered, when everything is expressed in scikit-learn, one would also expect all those features of scikit-learn to work. That said, especially since we're targeting Python users, if we can make more and more stuff also work seamlessly with scikit-learn—including not having to use @jcrist I know you also have more history with the origin of IbisML; feel free to chime in if I missed or misrepresented anything! |
Perhaps to add context, all the tools in the playtime project are effectively just syntactic sugar around normal sklearn pipelines. All the functions and operators do is modify pipelines under the hood. It uses Narwhals under the hood at the moment to support multiple dataframes but I am assuming dataframes that fit into memory here, which is a different ballgame to what y'all are doing here. |
Sure. There are a number of reasons why we went with the current design.
All that said, we did want you to be able to place an Regarding #145, I think you'll find you'll run into issues if you drop the full def fit(X, y=None):
for step in steps[:-1]:
X = step.fit_transform(X, y)
return steps[-1].fit(X, y) A few downsides with dropping the wrapping
All that said, I do think we can still make |
Thanks @deepyaman @jcrist for taking the time to write such detail answers to my question, I really appreciate it! Sorry for the time it took me to get back to you.
When I discovered ibisML, I first thought that it could be used to write backend-independent dataframe custom transformations. I was expecting that a
Thanks, I have never used As @deepyaman mentioned, @koaning's scikit-playtime does that in a very neat way. However, it does not allow you to customize step's names but rely on
I had not thought of that, this makes a lot of sense. When working with
Thanks for clarifying! |
Hmm, let me test this case that you describe (maybe tomorrow? 🤞). That said, the use case we imagine (and this can always change based on usage and feedback!) is more that you do your fit/transform workflows immediately after feature engineering (which you can do with vanilla Ibis) with IbisML. This continues with the theme of pushing computation to the scalable computing engine (database, distributed computing framework, whatever) where you data already lives. Intuitively, it feels like there is less value to being able to choose amongst Ibis backends after you put the data into Polars. |
@koaning @gtauzin I've implemented (As a side note, I know I still owe an answer re #135 (comment)) |
Thanks @deepyaman. For me, it looks good. I am not yet fully understanding how useful can sklearn-based tuning be with an ibisML |
I have a scikit-learn pipeline defined in the code below.
When I ask for the params of said pipeline I can see a long list of names that I can refer to when I do hyperparameter tuning.
The list is long, but that is because it is nice and elaborate.
The reason why this is nice is that it allows me to be very specific. I can tune each input argument of every component like
featureunion__pipeline__selectcols__cols
orfeatureunion__pipeline__onehotencoder__sparse_output
. This is very nice for gridsearch!The cool thing about this is that I am able to get a nice table as output too.
But when I look at IbisML I wonder if I am able to do the same thing.
In IbisML it is the
Recipe
object that is scikit-learn compatible, not theExpandDateTime
object. So lets inspect.This yields the following.
In fairness, this is not completely unlike what scikit-learn does natively. In a pipeline in scikit-learn you also have access to the
steps
argument and you could theoretically make all the changes there directly by passing in new subpipelines. But there is a reason why scikit-learn does not stop there! It can go deeper into all the input arguments of all the estimators in the pipeline because it makes the endcv_results_
output a lot nicer. And this is where I worry if IbisML can do the same thing. It seems that I need to pass full objects, instead of being able to pluck out the individual attributes that I care about.In this particular case, what if I want to measure the effect of including/excluding
dow
or thehour
? Is that possible? Can I have an understore-syntax-like string just like in scikit-learn to configure that? Or do I need to overwrite the steps object?The text was updated successfully, but these errors were encountered: