-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] [R] Add QuantileDMatrix creation from dense matrices #9864
Conversation
Let me briefly answer some the of questions first, then try to come up with improved documents.
yes.
yes, it should be part of the iterations. Yes, categorical features are supported.
Yes, everything created successfully should be freed accordingly. If the proxy is created successfully, it should be freed, regardless of how it's used.
The
I usually just let the C functions emit errors. |
R-package/R/xgb.DMatrix.R
Outdated
feature_weights = NULL, | ||
as_quantile_dmatrix = FALSE, | ||
ref = NULL, | ||
max_bin = NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably better to split it into a sub class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since the code paths will be entirely different, better to have a dedicated function xgb.QuantileDMatrix
like in the python interface.
Now I'm confused. If I understand it correctly, creating a QuantileDMatrix from a dense input involves setting the proxy dmatrix 4 times. From some quick tests, it seems to be possible to pass If the label were to be set on the proxy dmatrix at every iterator call, wouldn't that result in more work being done compared to setting it after? Aren't there conversions and memory copies happening inside those calls? Also, just to be sure - if one is supposed to set the feature names and types on the proxy dmatrix at every iterator call, does that mean that one can pass different subsets of columns in different iterations? By the way, what would happen in a case in which |
Let me write some documents for the QDM internal, it's a special case for the iterator interface, which is also used by external memory support. As a result, some of the things seem redundant, but we kept it to simplify the code so that it can work with both use cases without any special handling. So, to answer the question of whether it can accept a subset of samples, yes, but it's for external memory support and #9864 (comment).
If you are trying to free the proxy, then should succeed assuming the proxy was successfully created. If you are trying to free the QDM, then it should fail.
If
|
And yes, the QDM indeed can take on batches on initialization. See |
Ref: #9734 (comment)
it can no longer take on iterators as described in #9864 (comment). |
XGB_DLL SEXP XGQuantileDMatrixFromMat_R(SEXP R_mat, SEXP missing, SEXP n_threads, | ||
SEXP max_bin, SEXP ref_dmat) { | ||
SEXP ret = PROTECT(R_MakeExternalPtr(nullptr, R_NilValue, R_NilValue)); | ||
R_API_BEGIN(); | ||
DMatrixHandle proxy_dmat_handle; | ||
CHECK_CALL(XGProxyDMatrixCreate(&proxy_dmat_handle)); | ||
DMatrixHandle out_dmat; | ||
int res_code1, res_code2; | ||
|
||
try { | ||
xgboost::Json jconfig{xgboost::Object{}}; | ||
/* FIXME: this 'missing' field should have R_NaInt when the input is an integer matrix. */ | ||
jconfig["missing"] = Rf_asReal(missing); | ||
if (!Rf_isNull(n_threads)) { | ||
jconfig["nthread"] = Rf_asInteger(n_threads); | ||
} | ||
if (!Rf_isNull(max_bin)) { | ||
jconfig["max_bin"] = Rf_asInteger(max_bin); | ||
} | ||
std::string json_str = xgboost::Json::Dump(jconfig); | ||
|
||
DMatrixHandle ref_dmat_handle = nullptr; | ||
if (!Rf_isNull(ref_dmat)) { | ||
ref_dmat_handle = R_ExternalPtrAddr(ref_dmat); | ||
} | ||
|
||
std::string array_str = MakeArrayInterfaceFromRMat(R_mat); | ||
_RMatrixSingleIterator single_iterator(proxy_dmat_handle, array_str.c_str()); | ||
|
||
res_code1 = XGQuantileDMatrixCreateFromCallback( | ||
&single_iterator, | ||
proxy_dmat_handle, | ||
ref_dmat_handle, | ||
_reset_RMatrixSingleIterator, | ||
_next_RMatrixSingleIterator, | ||
json_str.c_str(), | ||
&out_dmat); | ||
res_code2 = XGDMatrixFree(proxy_dmat_handle); | ||
} catch(IteratorError &err) { | ||
XGDMatrixFree(proxy_dmat_handle); | ||
Rf_error(XGBGetLastError()); | ||
} | ||
|
||
CHECK_CALL(res_code2); | ||
CHECK_CALL(res_code1); | ||
|
||
R_SetExternalPtrAddr(ret, out_dmat); | ||
R_RegisterCFinalizerEx(ret, _DMatrixFinalizer, TRUE); | ||
R_API_END(); | ||
|
||
UNPROTECT(1); | ||
return ret; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on your roadmap, it might be desirable to implement this in R instead of C++, the data iterator is the common interface to QDM and external memory, I will share some old unpublished documents soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better to use a dedicated C++-only route for single-iteration QuantileDMatrix, and then later on implement a customizable DataIterator in R.
Is this still draft? |
Yes, and I'm going to create a new MR as I see this kind of functionality would need to be coded differently. |
Closing in favor of bigger PR that follows the same approach as in the python interface: #9913 |
ref #9810
This PR adds a wrapper over QuantileDMatrix for dense matrix inputs.
I am not very sure that I'm implementing all of this correctly so would be ideal if a maintainer could take a deep look.
I have some doubts about how this class (QuantileDMatrix) is meant to be created:
XGQuantileDMatrixCreateFromCallback
fails, returning a non-zero code, does the proxy dmatrix still need to be freed manually throughXGDMatrixFree
?Also a small note: there's a PR which hasn't yet been merged as it's currently failing MSVC jobs, which introduces some refactorings around the handling of 'missing' parameter in the JSON configs. After such refactoring is introduced, it should also be used as part of the code introduced here, as otherwise missing values in integer matrices will not be handled correctly.