-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partioned column not found when metadata is not available. #749
Comments
duplicate of dask/dask#8666 ? |
Hi @martindurant, thankls for your response but I don't think it is a duplicate because
I've also further minimised my example. I've also noticed, that only the partitioned columns with a single value are missing. If for example I changed my dataset to the one below, my code succeeds. df = pd.DataFrame([
('A', 'B', 1),
('A', 'C', 3),
('B', 'C', 3),
], columns=['group1', 'group2', 'value']) |
@rjzamora , the fp engine allows a |
The situation is, that the root of the parquet dataset is not obvious in the case that there is no _metadata, and the top level of partitioning only has one option. |
This seems like the kind of thing I was hoping 8765 could help us with. |
What happened:
dast.read_parquet
is not able to find partioned column names when metadata is not available using the fastparquet engine. pandas (with fastparquet engine) and dask (with pyarrow engine) is able to find it though. The problem appears to lie with dask's fastparquet engine.I've removed metadata, because for my usecase I find it is faster to update the parquet dataset without metadata. (i.e. periodic updates of metadata is too expensive).
What you expected to happen:
All columns to be found by dask.
Minimal Complete Verifiable Example:
Anything else we need to know?:
I've also tried specifying the columns to load in the columns argument, but that returns an error:
Environment:
The text was updated successfully, but these errors were encountered: