You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Dasks read_parquet (as well as other loaders, like csv) can accept list of files. This allows multiple dask distributed workers to load partition of data each, without data ever be loaded into client memory in it's entirety. This is important for datasets that can't fit into single machines memory.
Fugue SQL supports LOAD statement that passes this to dask backend. This statement, unfortunately, doesn't allow for list of files.
Describe the solution you'd like
Allow loading of multiple parquet files by passing a list. For example
df = LOAD ["s3://bucketname/partition1.parq", "s3://bucketname/partition2.parq"]
SELECT count(*) FROM df;
This would tell dask-distributed to load 2 files, one per worker, and perform SELECT on them in parallel.
Describe alternatives you've considered
Current workaround is passing glob statement for dask. This works for some use cases, but not all of them.
Is your feature request related to a problem? Please describe.
Dasks read_parquet (as well as other loaders, like csv) can accept list of files. This allows multiple dask distributed workers to load partition of data each, without data ever be loaded into client memory in it's entirety. This is important for datasets that can't fit into single machines memory.
Fugue SQL supports
LOAD
statement that passes this to dask backend. This statement, unfortunately, doesn't allow for list of files.Describe the solution you'd like
Allow loading of multiple parquet files by passing a list. For example
This would tell dask-distributed to load 2 files, one per worker, and perform SELECT on them in parallel.
Describe alternatives you've considered
Current workaround is passing glob statement for dask. This works for some use cases, but not all of them.
Additional context
Slack thread
The text was updated successfully, but these errors were encountered: