You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For cross-validation, we are attempting to parallelize xgboost_ray.train using ray.remote tasks. Each remote task uses a different cross-validation split of the data. Unfortunately, parallelizing xgboost_ray.train results in the below errors. If the same tasks are run sequentially, rather than in parallel, no errors occur. Below is a reproducible example based on xgboost_ray's documentation's example. If this is run locally, it completes. It's only when parallelized on a remote Ray cluster that it results in the below errors.
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
import ray
import numpy as np
ray.init("ray://ray-cluster-head-svc:10001")
@ray.remote
def f():
train_x, train_y = load_breast_cancer(return_X_y=True)
# the amount of data is increased because with small amounts of data no ConnectionError is encountered
train_x = np.tile(train_x, (10000, 1))
train_y = np.repeat(train_y, 10000)
train_set = RayDMatrix(train_x, train_y)
evals_result = {}
bst = train(
{
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
},
train_set,
evals_result=evals_result,
evals=[(train_set, "train")],
verbose_eval=False,
ray_params=RayParams(
num_actors=2, # Number of remote actors
cpus_per_actor=1))
bst.save_model("model.xgb")
print("Final training error: {:.4f}".format(
evals_result["train"]["error"][-1]))
# The same data is used for both tasks for the sake of this example, but for cross-validation,
# the data passed to each task would be unique
ray.get([f.remote() for i in range(2)])
Traceback:
(freenome-risk-prediction-py3.10) ➜ risk-prediction git:(main) ✗ python risk_prediction/mre.py
(autoscaler +9s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +9s) Adding 2 node(s) of type main.
(autoscaler +19s) Resized to 1 CPUs.
(autoscaler +19s) Adding 1 node(s) of type main.
(autoscaler +29s) Resized to 2 CPUs.
(autoscaler +45s) Adding 1 node(s) of type main.
(f pid=61, ip=10.76.2.5) [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2023-08-10 14:11:59,613 WARNING dataclient.py:403 -- Encountered connection issues in the data channel. Attempting to reconnect.
2023-08-10 14:12:16,715 ERROR dataclient.py:330 -- Unrecoverable error in data channel.
Traceback (most recent call last):
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 455, in _get
for chunk in resp:
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 324, in _get_object_iterator
for chunk in self.server.GetObject(req, *args, **kwargs):
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 475, in __next__
return self._next()
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 881, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "'NoneType' object is not iterable"
debug_error_string = "UNKNOWN:Error received from peer ipv4:10.123.5.189:10001 {created_time:"2023-08-10T14:12:17.138947014+00:00", grpc_status:9, grpc_message:"\'NoneType\' object is not iterable"}"
>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zcarrico/risk-prediction/risk_prediction/mre.py", line 32, in <module>
ray.get([f.remote() for i in range(2)])
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 434, in get
res = self._get(to_get, op_timeout)
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 477, in _get
raise decode_exception(e)
ConnectionError: GRPC connection failed: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "'NoneType' object is not iterable"
debug_error_string = "UNKNOWN:Error received from peer ipv4:10.123.5.189:10001 {created_time:"2023-08-10T14:12:17.138947014+00:00", grpc_status:9, grpc_message:"\'NoneType\' object is not iterable"}"
>
I expect the same error will be encountered when parallelizing remote tasks for nested cross-validation for HPO.
Please let me know if you have any questions and thank you for the help and the great xgboost_ray library!
The text was updated successfully, but these errors were encountered:
zcarrico-fn
changed the title
Parallelization of xgboost_ray.train results in ConnectionError
Parallelization of xgboost_ray.train results in Error
Sep 18, 2023
Far from an expert on the Ray library, but perhaps it's possible to use Ray Tune, with the hyperparameter specifying which i'th fold to use?
i.e. All same data is send to each tune run, but with each run sorts/randomizes data exactly the same (each same random order is required inside each tune run so as to have each sample be part of the val split exactly once), and have the ray tune param_space identifying which i'th split to use as val split for each run
Thank you @Anner-deJong , and this works with customizable training functions, but ray_xgboost isn't customizable beyond argument input as far as I know. In other words, the fold values could be passed as a hyperparam, but there's no way to pass a custom callable to ray_xgboost to have xgboost_ray select data based on the fold value hyperparameter.
For cross-validation, we are attempting to parallelize xgboost_ray.train using ray.remote tasks. Each remote task uses a different cross-validation split of the data. Unfortunately, parallelizing xgboost_ray.train results in the below errors. If the same tasks are run sequentially, rather than in parallel, no errors occur. Below is a reproducible example based on xgboost_ray's documentation's example. If this is run locally, it completes. It's only when parallelized on a remote Ray cluster that it results in the below errors.
Traceback:
I expect the same error will be encountered when parallelizing remote tasks for nested cross-validation for HPO.
Please let me know if you have any questions and thank you for the help and the great xgboost_ray library!
The text was updated successfully, but these errors were encountered: