GridSearchCV, RandomizedSearchCV don't support big dataset train? #232

ICESDHR · 2022-01-10T07:02:38Z

GridSearchCV and RandomizedSearchCV work well when dataset is small, but it's broken down when dataset is large.

I read GridSearchCV implementation. In _fit() function, you put dataset to Ray Object Store, but ray.put() function use grpc to transform data, grpc protobuf doesn't support data greater than 2GB. Is that right?

X_id = ray.put(X)
y_id = ray.put(y)

I ask this question because I want to know whether you know this situation and whether you have an optimization plan?

thanks for your reply!

The text was updated successfully, but these errors were encountered:

Yard1 · 2022-01-10T13:23:18Z

ray.put should support objects bigger than 2 GB. Are you sure you are not simply running out of memory? What sort of errors are you getting?

ICESDHR · 2022-01-11T01:53:59Z

my ray cluster work node hava 20GB memory, i ask question in ray project too, they advised me not to use PUT to transfer big data :(

 ERROR - Exception serializing message!
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
ValueError: Message ray.rpc.DataRequest exceeds maximum protobuf size of 2GB: 5979803541
ERROR dataclient.py:150 -- Unrecoverable error in data channel.

richardliaw · 2022-01-22T20:00:22Z

Are you using Kubernetes / ray client @ICESDHR ?

ICESDHR · 2022-02-24T08:43:27Z

yeah, i use kubernets ray operator here.follow these step:

deploy raycluster crd,
deploy operator,
create a raycluster which contain one head and three worker,
create a kubernetes job, in this job i connet to raycluster and success, but if i use ray.put() to put dataset which large than 2GB i 'll get this error(Inevitable mistakes).If i use other samll dataset, it works well.

so, i think Here are some bugs, do you think? if so, I wonder if we could discuss a solution? if not, plz help me how to use this function, thx : )

Yard1 · 2022-02-24T16:18:15Z

Ok, I can see this being a problem. Would it be possible for you to load the data from S3/NFS/disk on the nodes? If yes, we could add support for that. How does that sound?

ICESDHR · 2022-02-25T03:24:19Z

thanks for your reply! It would be nice if official support could be provided. How soon will this patch be released?

ICESDHR · 2022-02-25T03:25:17Z

after load big dataset, i face another two problems:

use this method, if one node run many trials, it will load multiple copies of data into memory repeatedly, cause memory waste;
In my practice, with sufficient resources, after load big dataset from disk, training process will trigger problems as shown in the figure. i change ray/tune/ray_trial_executor.py DEFAULT_GET_TIMEOUT 60 -> 300，it works.

Is there a better solution?

Yard1 · 2022-02-25T10:50:34Z

It's not possible right now, but with the proposed changes, you should be able to use Ray Datasets, which should solve both the 2GB issue and ensure a minimum amount of copying required. Will keep you updated.

ICESDHR · 2022-02-28T02:13:15Z

I'm trying to use ray Data function, i found Operate as above, ray.data.from_pandas() still have this problem, but ray.data.read_csv() works :( Usually, the data will be processed by pandas first and trained with ray, so it would be great if it could be optimized.

Yard1 self-assigned this Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridSearchCV, RandomizedSearchCV don't support big dataset train? #232

GridSearchCV, RandomizedSearchCV don't support big dataset train? #232

ICESDHR commented Jan 10, 2022

Yard1 commented Jan 10, 2022

ICESDHR commented Jan 11, 2022

richardliaw commented Jan 22, 2022

ICESDHR commented Feb 24, 2022 •

edited

Loading

Yard1 commented Feb 24, 2022

ICESDHR commented Feb 25, 2022

ICESDHR commented Feb 25, 2022

Yard1 commented Feb 25, 2022

ICESDHR commented Feb 28, 2022 •

edited

Loading

GridSearchCV, RandomizedSearchCV don't support big dataset train? #232

GridSearchCV, RandomizedSearchCV don't support big dataset train? #232

Comments

ICESDHR commented Jan 10, 2022

Yard1 commented Jan 10, 2022

ICESDHR commented Jan 11, 2022

richardliaw commented Jan 22, 2022

ICESDHR commented Feb 24, 2022 • edited Loading

Yard1 commented Feb 24, 2022

ICESDHR commented Feb 25, 2022

ICESDHR commented Feb 25, 2022

Yard1 commented Feb 25, 2022

ICESDHR commented Feb 28, 2022 • edited Loading

ICESDHR commented Feb 24, 2022 •

edited

Loading

ICESDHR commented Feb 28, 2022 •

edited

Loading