A potential refinement on document #123

0as1s · 2021-08-09T04:43:57Z

When I started to deploy xgboost-operator on my kubeflow cluster, I referred to https://github.com/kubeflow/xgboost-operator/blob/master/config/samples/xgboost-dist/utils.py#L47 to implement my own version to read my own data. It's very common I follow this function to read parts of the whole data according to the rank manually.

However, I found that dmatrix already has an internal logic to only read parts of data when it detects distributed mode. Then my manual data reading causes each rank to only read 1/N*N instead of 1/N data.

I think it could be better if adding a comment in that function to guide the users to rewrite it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A potential refinement on document #123

A potential refinement on document #123

0as1s commented Aug 9, 2021

A potential refinement on document #123

A potential refinement on document #123

Comments

0as1s commented Aug 9, 2021