Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compliance with ScikitLearn API #39

Closed
ExpandingMan opened this issue Feb 8, 2017 · 10 comments
Closed

compliance with ScikitLearn API #39

ExpandingMan opened this issue Feb 8, 2017 · 10 comments

Comments

@ExpandingMan
Copy link
Collaborator

Like many, I frequently use the "ScikitLearn" paradigm where I create a model object and then call functions like fit! and predict. In Julia, with multiple dispatch, it's trivially easy to get arbitrarily complicated machine learning methods to follow this paradigm.

Except... with this package. This is because the model creation and training are done with one function call. This means that in most cases one has to write custom code to plug in XGBoost, as this is the only package I'm aware of where these are done in one step.

Has there been any thought to adding user friendly code to allow separate creation of model objects? It looks like this is possible using the Booster class, but as it is one would have to rewrite most of the xgboost function to set the parameters and so forth.

@Allardvm
Copy link
Contributor

Allardvm commented Feb 9, 2017

As far as I know, there are no concrete plans to provide this functionality for the Julia package, because the general approach is to provide a consistent interface for XGBoost across languages. It's this consistent version of the interface that is implemented for Julia, and I'm currently refactoring the package to be even more consistent (for some progress, see https://github.com/Allardvm/XGBoost.jl/tree/package-refactor).

For most languages, however, there is some alternative functionality specific to that language, such as the scikit-learn interface for Python—and the Julia package doesn't have this. In this sense, there is a case to be made to provide an alternative interface, but at this point there isn't a single dominant framework like scikit-learn for Julia that provides clear guidelines on what (and how) to implement the interface. For another Gradient Boosting library, I build a Julia interface that is somewhat similar to scikit-learn in Python (see https://github.com/Allardvm/LightGBM.jl) that would satisfy your use-case and that would be reasonably easy to implement for XGBoost as well.
For me, the key question would be how far to take such an alternative interface. On the one hand, I could try build this to (also) be compatible with the main interface (e.g. you would be able to call nfold_cv(::XGB_Regressor))—but that might be somewhat unwieldy in the long run. On the other hand, having two incompatible interfaces in one and the same package would be somewhat confusing to new users, and would be more difficult to maintain.

@ExpandingMan
Copy link
Collaborator Author

ExpandingMan commented Feb 9, 2017

Well, other conformity issues aside, at least across Julia and Python packages that I am aware of, I find that all machine learning packages have:

  • A model object of some sort.
  • A training function that takes the model, training data and target X (matrix) and y (vector or matrix).
  • A prediction function that takes test data X (matrix).

The major exception that I'm aware of are methods that accept higher rank tensors. So, even though Julia lacks a package as universally utilized as ScikitLearn is in Python, most of the machine learning code available for it at the moment still works this way. At the moment I can't think of a single other example in either Julia or Python where one can't declare the model object independently of training it.

To give you an example of my use case, I frequently use

  • ScikitLearn.jl
  • MultivariateStats.jl
  • MXNet.jl
  • Tensorflow.jl (or the python tensorflow package with PyCall)
  • XGBoost.jl

In all of these cases except for XGBoost.jl one can create a model object before training it. Other details of the interface aside, not having this breaks my process and necessitates specialized code for XGBoost. The other details don't matter to me nearly as much: if the fit! and predict functions have different names, or accept different arguments, it's very easy to just write in my own fit! and predict functions that adapt arguments and call the necessary methods. Even training multiple models simultaneously in parallel as I mentioned in #38 is pretty easy (though I still think it'd be nice to have that functionality built in). The options for dealing with model creation and training happening at the same time aren't nearly as good: I'd pretty much have to create a wrapper type. At the moment I'm just using a Vector{Booster}, which is of course no big deal, but it just seems like such an inelegant solution.

I could be wrong, but I strongly suspect many other users have exactly this same issue. I'm not trying to make an argument for any other aspect of the interface but this one which, to me at least, causes headaches. Anyway, that's the best case I think I can make, so I won't try to persuade you guys any further. Thanks for maintaining this package, it seems quite robust and (other than this one issue) easy to use.

Edit: And, of course, adding this functionality would not mean that you'd have to get rid of the existing xgboost function, so it wouldn't break a thing.

@slundberg
Copy link
Collaborator

slundberg commented Feb 14, 2017

I can see why you would want consistency, though in this case it is tricky since consistency with the other XGBoost bindings is also valuable. Since @Allardvm is working on the refactor I would leave it to him to see what is best.

One comment though is that you could call xgboost with 0 boosting rounds. Then just call update as many times as you like for training (and eval_set if you want stats).

@ExpandingMan
Copy link
Collaborator Author

One comment though is that you could call xgboost with 0 boosting rounds. Then just call update as many times as you like for training (and eval_set if you want stats).

Thanks for the tip. It doesn't completely "fix" all the data input/output steps but it's quite useful!

@cstjean
Copy link

cstjean commented Mar 8, 2017

It wouldn't be hard to implement the ScikitLearn.jl interface on top of the existing code by creating a new type. It's what I did for DecisionTree.jl (see this file) and LowRankModels.jl. I understand the reluctance to add more code. FWIW, the ScikitLearnBase.jl interface has been essentially unchanged since it started, and sticks very close to scikit-learn. I could make a PR if there is interest.

If there isn't, well, I'm kinda curious to see what a pure-Julia XGBoost would look like, and might start work on that. Has there been any effort on that side?

@ExpandingMan
Copy link
Collaborator Author

It would be kind of nice to make a common machine learning interface for Julia, sort of the way JuMP works for optimization. There is MLBase, but for me, it would be nicer to have an interface that I could plug any of ScikitLearn, XGBoost, TensorFlow, MXNet into.

I've toyed with the idea of doing this, but frankly about 95% of my time and effort is spent getting the data into the proper form to be ingested by machine learning, so the machine learning interfaces themselves have seemed like a relatively minor issue.

@cstjean
Copy link

cstjean commented Mar 8, 2017

There is MLBase, but for me, it would be nicer to have an interface that I could plug any of ScikitLearn, XGBoost, TensorFlow, MXNet into.

That's the objective behind ScikitLearnBase. JuliaML is also working on something similar. It's more ambitious, but not mature yet. We've already discussed making those two interfaces compatible where possible, once the dust settles on their designs.

@Allardvm
Copy link
Contributor

Allardvm commented Mar 9, 2017

I'll look into the ScikitLearnBase.jl interface. It closely matches what I planned for the XGBoost interface anyway, so we might as well make it compatible. Since I'm quite busy at the moment, progress is a bit slow, but you can check the current state here: https://github.com/Allardvm/XGBoost.jl/tree/package-refactor. This version is fully functional and you're welcome to contribute/test it.

@ViralBShah
Copy link
Collaborator

ViralBShah commented Apr 18, 2019

Just checking in here after a long time. What's the current plan for APIs and such here? There's the ScikitLearn.jl approach and the MLJ.jl folks as well.

@ExpandingMan
Copy link
Collaborator Author

Funny how little patience I have to go back and read my own incredibly verbose issue posts from a long time ago.

As of 2.0, the layout of this package largely matches libxgboost itself. The Booster object could be used in a ScikitLearn-like interface. Note also that MLJXGBoostInterface is now overhauled to work with 2.0.

I'm therefore closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants