-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed training #2
Comments
I could try adding this through the scripts I have available currently. My setup requires that the user has AWS credentials already set up (through aws-cli or as env vars I think). Also, currently I much prefer using aws-parallelcluster but that involves running XGBoost communication over SLURM and not YARN. If we need YARN I'd have to go back and ensure that it works as expected, or I guess we could have a Spark-based benchmark, that I assume works fine still. |
@thvasilo I was thinking of using dask and run benchmark locally in a big AWS machine, to make it easy to manage. But yes, it would be nice if you can put up your script in a separate directly ( |
@hcho3 I have an initial set of scripts for running dask benchmarks. But I use cuDF the the primary backend for data handling here: https://github.com/trivialfis/dxgb_bench I will add more datasets to it as progressing. It can be extended with other backends like CPU dask or just pandas. Would you like to take a look and see if it's suitable for merging it here? |
@trivialfis I will take a look, thanks! Is it fair to assume that dask will have same performance characteristics as the underlying native distributed algorithm? My impression of dask is that it is a lightweight cluster application. |
Also would be good to have distributed benchmark suite on Kubernetes cluster using XGBoost Operator if anyone is interested in contributing: https://github.com/kubeflow/xgboost-operator |
Yes. But it will have higher memory consumption due to pandas and partition management. |
No description provided.
The text was updated successfully, but these errors were encountered: