Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partioned column not found when metadata is not available. #749

Open
dakenblack opened this issue Feb 10, 2022 · 5 comments
Open

Partioned column not found when metadata is not available. #749

dakenblack opened this issue Feb 10, 2022 · 5 comments

Comments

@dakenblack
Copy link

dakenblack commented Feb 10, 2022

What happened:
dast.read_parquet is not able to find partioned column names when metadata is not available using the fastparquet engine. pandas (with fastparquet engine) and dask (with pyarrow engine) is able to find it though. The problem appears to lie with dask's fastparquet engine.

I've removed metadata, because for my usecase I find it is faster to update the parquet dataset without metadata. (i.e. periodic updates of metadata is too expensive).

What you expected to happen:
All columns to be found by dask.

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd
import fastparquet
import dask
from os import remove

def containsall(arr, elems):
    return all([e in arr for e in elems])

print(dask.__version__, fastparquet.__version__)
# OUT: 2022.01.1 0.8.0

df = pd.DataFrame([
    ('A', 'B', 1),
    ('A', 'C', 3),
], columns=['group1', 'group2', 'value'])

dd.from_pandas(df, npartitions=1).to_parquet(
    'test', 
    partition_on=['group1', 'group2'], 
    engine='fastparquet',
    write_metadata_file=False, # metadata NOT written
    overwrite=True,
    append=False,
)

# engine = 'pyarrow'
engine='fastparquet'
from_dd = dd.read_parquet('test', engine=engine)
from_pd = pd.read_parquet('test', engine=engine)
expected = ['group1', 'group2', 'value']

assert(containsall(from_pd.columns, expected))
assert(containsall(from_dd.columns, expected)) # fails

print('Success')

Anything else we need to know?:
I've also tried specifying the columns to load in the columns argument, but that returns an error:

ValueError: The following columns were not found in the dataset {'group1'}
The following columns were found Index(['value', 'group2'], dtype='object')

Environment:

  • Dask version: 2022.01.01
  • Python version: 3.9.5
  • Operating System: Windows 11
  • Install method (conda, pip, source): pip
  • fastparquet version : 0.8.0
@martindurant
Copy link
Member

duplicate of dask/dask#8666 ?

@dakenblack
Copy link
Author

dakenblack commented Feb 10, 2022

Hi @martindurant, thankls for your response but I don't think it is a duplicate because

  • my issue is related to a situation when the metadata is not written to the dataset
  • the example in issue 8666 actually works for me (even if I add "write_metadata_file=False")

I've also further minimised my example.

I've also noticed, that only the partitioned columns with a single value are missing. If for example I changed my dataset to the one below, my code succeeds.

df = pd.DataFrame([
    ('A', 'B', 1),
    ('A', 'C', 3),
    ('B', 'C', 3),
], columns=['group1', 'group2', 'value'])

@martindurant
Copy link
Member

@rjzamora , the fp engine allows a base_path argument to be able to infer the top of the parquet dataset tree correctly for this case, but I don't see how to pass it via dd.read_parquet.

@martindurant
Copy link
Member

The situation is, that the root of the parquet dataset is not obvious in the case that there is no _metadata, and the top level of partitioning only has one option.

@rjzamora
Copy link
Member

the fp engine allows a base_path argument to be able to infer the top of the parquet dataset tree correctly for this case, but I don't see how to pass it via dd.read_parquet.

This seems like the kind of thing I was hoping 8765 could help us with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants