Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for multi-dimensional "label" for regressions? #38

Open
ExpandingMan opened this issue Feb 3, 2017 · 5 comments
Open

support for multi-dimensional "label" for regressions? #38

ExpandingMan opened this issue Feb 3, 2017 · 5 comments

Comments

@ExpandingMan
Copy link
Collaborator

Hello all. I haven't dug too far into the source code yet, but I'm wondering if it's possible to do regressions where the "label" (target value) consists of multi-dimensional data points. (i.e. the label argument of the xgboost function would be an Array{T<:Number,2}.) This seems like a pretty important feature, but I can't find any literature about it in the xgboost documentation for any language.

It seems to me that even if it's not explicitly supported this should be possible by setting a custom loss function, however I get the following error any time I try to pass a matrix-valued "label":

ERROR: LoadError: MethodError: no method matching (::XGBoost.#_setinfo#8)(::Ptr{Void}, ::String, ::Array{Float64,2})
Closest candidates are:
  _setinfo{T<:Number}(::Ptr{Void}, ::String, ::Array{T<:Number,1}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:10
 in (::XGBoost.##call#7#11)(::Array{Any,1}, ::Type{T}, ::Array{Float64,2}, ::Bool, ::Float32) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:59
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{XGBoost.DMatrix}, ::Array{Float64,2}, ::Bool, ::Float32) at ./<missing>:0
 in makeDMatrix(::Array{Float64,2}, ::Array{Float64,2}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:137
 in #xgboost#20(::Array{Float64,2}, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Type{T}, ::Type{T}, ::Array{Any,1}, ::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:147
 in (::XGBoost.#kw##xgboost)(::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at ./<missing>:0
 in include_from_node1(::String) at ./loading.jl:488
while loading /home/user/RatingsPrediction/xgboost0.jl, in expression starting on line 43

Taking a look at the source code I get the impression it is not designed to pass labels that aren't Vectors into the C code. Certainly the above error seems to indicate that it is impossible to set a "label" that cannot be converted to Vector.

Is there any way around this? Does the Python API support this? Thanks.

@slundberg
Copy link
Collaborator

slundberg commented Feb 3, 2017 via email

@ExpandingMan
Copy link
Collaborator Author

Thanks for your prompt response.

I don't see any significant problem with using multiple models (as far as I can think, in the case of gradient boosted trees this should be exactly equivalent to "one" multi-dimensional model). Of course, one usually doesn't have to resort to this (from an API standpoint), hence the issue. Apart from convenience, I'd be a bit concerned about performance issues if I were fitting in a high-dimensional space, but perhaps that's unwarranted.

@slundberg
Copy link
Collaborator

slundberg commented Feb 3, 2017 via email

@mangolzy
Copy link

mangolzy commented Oct 13, 2022

I have a related confuse, according to some out-of-date documentation, eg:
https://xgboost.readthedocs.io/en/release_0.72/python/python_api.html
label ([list] or numpy 1-D array, optional) – Label of the training data.
it seems only 1-D array is accepted as label for construction of matrix.
but from the newly created version,
https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.training
label (array_like) – Label of the training data.
the form of label is of no limit, and we could pass a 2-D array as label that's true, but a strange thing come out, that when we use dmatrix.get_label() to look at this 2-D array, it seems the underground process has done a flatten and just keep the first "sample length" elements, like this:

X = pd.DataFrame(data=[[1,0], [2,2], [0,3], [4,4]])
y = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
dsoft_fake = xgb.DMatrix(X.values, label=y)
dsoft_fake.get_label()

output:

array([1., 2., 3., 4.], dtype=float32)

so my question is,

  1. if 2-D array is accepted for label, how it should be use correctly under which circumstance, or for solving what kind of problem?
  2. or if we do want to set the label of one sample point as vector, which can be consider as a soft label consists of different probabilities for different classes(>2), and they sum up to 1, is xgboost support this feature now? in this case, i don't think separate model for each dimension is suitable

thanks for explanation in advance

@trivialfis
Copy link
Member

The matrix input for labels is a recent addition (1.6) for multi-output and multi-label, the getter hasn't been able to return the matrix yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants