Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data split mode to DMatrix MetaInfo #8568

Merged
merged 27 commits into from
Dec 25, 2022
Merged

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Dec 7, 2022

We need this information to change the behavior of the predictor when data is split column-wise.

Part of #8424

@rongou
Copy link
Contributor Author

rongou commented Dec 7, 2022

@trivialfis

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the one in learner if DMatrix contains all the information?

@rongou
Copy link
Contributor Author

rongou commented Dec 7, 2022

Hmm, if we don't declare a training parameter, how would the user specify it? I imagine if someone wants to do column-wise split, they'd do something like:

param = { 'tree_method': 'gpu_hist', data_split_mode: 'col'}
xgb.train(param, ...)

@trivialfis
Copy link
Member

If the DMatrix contains the information the user doesn't need to specify that parameter right?

@rongou
Copy link
Contributor Author

rongou commented Dec 7, 2022

Yeah I guess the question is whether to specify it through dmatrix or as a training parameter.

dtrain = xgb.DMatrix('very-wide.txt.train', data_split_mode='col')
dtest = xgb.DMatrix('very-wide.txt.test', data_split_mode='col')
param = {'objective': 'binary:logistic', 'tree_method': 'gpu_hist'}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 500
xgb.train(param, dtrain, num_round, evals=watchlist)

vs.

dtrain = xgb.DMatrix('very-wide.txt.train')
dtest = xgb.DMatrix('very-wide.txt.test')
param = {'objective': 'binary:logistic', 'tree_method': 'gpu_hist', 'data_split_mode': 'col'}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 500
xgb.train(param, dtrain, num_round, evals=watchlist)

Which one do you think it's more natural or user friendly?

@rongou
Copy link
Contributor Author

rongou commented Dec 8, 2022

@trivialfis this is what it looks like if we take out the dsplit parameter out of training and put it in data loading. Let me know what you think. Still need to fix some downstream calls.

@trivialfis
Copy link
Member

I think the DMatrix one is better.

A little bit of background here, @rongou shared some potential changes with me offline for federated learning. One important factor is how we handle the data split parameter in the predictor. The original plan was to pass it as part of the learner train param or learner model param. We later thought it might be possible to make it part of the DMatrix, and all algorithms will dispatch based on the DMatrix it receives, which from my perspective is cleaner than the learner option.

cc @RAMitchell .

@rongou
Copy link
Contributor Author

rongou commented Dec 13, 2022

@trivialfis Turns out this actually solves a problem we talked about before. If the user has done some preprocessing to split the training data beforehand, now they can pass in data_split_mode='none' to avoid further splitting.

Other than that bit of additional functionality, this PR preserves the existing behavior. Do you think we should merge it? The CI should pass now.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on using DMatrix as the parameter holder. Some other questions in comments.

include/xgboost/c_api.h Outdated Show resolved Hide resolved
@rongou
Copy link
Contributor Author

rongou commented Dec 20, 2022

@trivialfis @hcho3 I think this is close to what we want. Please take another look. Thanks!

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me. Excellent work! Some questions regarding the interface and the scope of the new parameter.

@@ -126,12 +126,29 @@ XGB_DLL int XGBGetGlobalConfig(char const **out_config);

/*!
* \brief load a data matrix
* \deprecated since 1.7.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the next big one would be 2.0 unless something major comes up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* \param out a loaded data matrix
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGDMatrixCreateFromFileV2(char const *config, DMatrixHandle *out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered making it a proper API for URI? For instance FromURI? Since users actually need to use some of the URI formats like file.txt?format=csv. XGBoost/dmlc-core doesn't guess the format and is a source of bugs when users pass in only the file name.

Also, do you plan to introduce the need_split with other input sources as well? If not, we can make it an URI parameter and limit its use to this function. Otherwise, we need to have an additional parameter in all language bindings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to URI.

For need_split, we can get rid of it if we are just going to preserve the current behavior, which is to split for distributed training, no otherwise. If a user wants more flexibility, they can always use another api to load the data. For distributed training, most people are probably not using the file api anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it as a parameter.

@rongou
Copy link
Contributor Author

rongou commented Dec 21, 2022

@trivialfis finally got the CI to pass. PTAL

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work on preparing the DMatrix for column split!

@trivialfis trivialfis merged commit 3ceeb8c into dmlc:master Dec 25, 2022
@rongou rongou deleted the data-split-param branch September 25, 2023 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants