Partioned column not found when metadata is not available. #749

dakenblack · 2022-02-10T06:10:17Z

What happened:
dast.read_parquet is not able to find partioned column names when metadata is not available using the fastparquet engine. pandas (with fastparquet engine) and dask (with pyarrow engine) is able to find it though. The problem appears to lie with dask's fastparquet engine.

I've removed metadata, because for my usecase I find it is faster to update the parquet dataset without metadata. (i.e. periodic updates of metadata is too expensive).

What you expected to happen:
All columns to be found by dask.

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd
import fastparquet
import dask
from os import remove

def containsall(arr, elems):
    return all([e in arr for e in elems])

print(dask.__version__, fastparquet.__version__)
# OUT: 2022.01.1 0.8.0

df = pd.DataFrame([
    ('A', 'B', 1),
    ('A', 'C', 3),
], columns=['group1', 'group2', 'value'])

dd.from_pandas(df, npartitions=1).to_parquet(
    'test', 
    partition_on=['group1', 'group2'], 
    engine='fastparquet',
    write_metadata_file=False, # metadata NOT written
    overwrite=True,
    append=False,
)

# engine = 'pyarrow'
engine='fastparquet'
from_dd = dd.read_parquet('test', engine=engine)
from_pd = pd.read_parquet('test', engine=engine)
expected = ['group1', 'group2', 'value']

assert(containsall(from_pd.columns, expected))
assert(containsall(from_dd.columns, expected)) # fails

print('Success')

Anything else we need to know?:
I've also tried specifying the columns to load in the columns argument, but that returns an error:

ValueError: The following columns were not found in the dataset {'group1'}
The following columns were found Index(['value', 'group2'], dtype='object')

Environment:

Dask version: 2022.01.01
Python version: 3.9.5
Operating System: Windows 11
Install method (conda, pip, source): pip
fastparquet version : 0.8.0

The text was updated successfully, but these errors were encountered:

martindurant · 2022-02-10T14:21:03Z

duplicate of dask/dask#8666 ?

dakenblack · 2022-02-10T21:13:38Z

Hi @martindurant, thankls for your response but I don't think it is a duplicate because

my issue is related to a situation when the metadata is not written to the dataset
the example in issue 8666 actually works for me (even if I add "write_metadata_file=False")

I've also further minimised my example.

I've also noticed, that only the partitioned columns with a single value are missing. If for example I changed my dataset to the one below, my code succeeds.

df = pd.DataFrame([
    ('A', 'B', 1),
    ('A', 'C', 3),
    ('B', 'C', 3),
], columns=['group1', 'group2', 'value'])

martindurant · 2022-03-22T18:27:01Z

@rjzamora , the fp engine allows a base_path argument to be able to infer the top of the parquet dataset tree correctly for this case, but I don't see how to pass it via dd.read_parquet.

martindurant · 2022-03-22T18:28:27Z

The situation is, that the root of the parquet dataset is not obvious in the case that there is no _metadata, and the top level of partitioning only has one option.

rjzamora · 2022-03-22T18:37:04Z

the fp engine allows a base_path argument to be able to infer the top of the parquet dataset tree correctly for this case, but I don't see how to pass it via dd.read_parquet.

This seems like the kind of thing I was hoping 8765 could help us with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partioned column not found when metadata is not available. #749

Partioned column not found when metadata is not available. #749

dakenblack commented Feb 10, 2022 •

edited

Loading

martindurant commented Feb 10, 2022

dakenblack commented Feb 10, 2022 •

edited

Loading

martindurant commented Mar 22, 2022

martindurant commented Mar 22, 2022

rjzamora commented Mar 22, 2022

Partioned column not found when metadata is not available. #749

Partioned column not found when metadata is not available. #749

Comments

dakenblack commented Feb 10, 2022 • edited Loading

martindurant commented Feb 10, 2022

dakenblack commented Feb 10, 2022 • edited Loading

martindurant commented Mar 22, 2022

martindurant commented Mar 22, 2022

rjzamora commented Mar 22, 2022

dakenblack commented Feb 10, 2022 •

edited

Loading

dakenblack commented Feb 10, 2022 •

edited

Loading