-
Notifications
You must be signed in to change notification settings - Fork 51
GridSearchCV, RandomizedSearchCV don't support big dataset train? #232
Comments
|
my ray cluster work node hava 20GB memory, i ask question in ray project too, they advised me not to use PUT to transfer big data :(
|
Are you using Kubernetes / ray client @ICESDHR ? |
yeah, i use kubernets ray operator here.follow these step:
so, i think Here are some bugs, do you think? if so, I wonder if we could discuss a solution? if not, plz help me how to use this function, thx : ) |
Ok, I can see this being a problem. Would it be possible for you to load the data from S3/NFS/disk on the nodes? If yes, we could add support for that. How does that sound? |
thanks for your reply! It would be nice if official support could be provided. How soon will this patch be released? |
after load big dataset, i face another two problems:
Is there a better solution? |
It's not possible right now, but with the proposed changes, you should be able to use Ray Datasets, which should solve both the 2GB issue and ensure a minimum amount of copying required. Will keep you updated. |
I'm trying to use ray Data function, i found Operate as above, ray.data.from_pandas() still have this problem, but ray.data.read_csv() works :( Usually, the data will be processed by pandas first and trained with ray, so it would be great if it could be optimized. |
GridSearchCV and RandomizedSearchCV work well when dataset is small, but it's broken down when dataset is large.
I read GridSearchCV implementation. In _fit() function, you put dataset to Ray Object Store, but ray.put() function use grpc to transform data, grpc protobuf doesn't support data greater than 2GB. Is that right?
I ask this question because I want to know whether you know this situation and whether you have an optimization plan?
thanks for your reply!
The text was updated successfully, but these errors were encountered: