Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dt regression in empty() #898

Merged
merged 1 commit into from
Oct 26, 2023
Merged

Conversation

martindurant
Copy link
Member

Fixes #897

@martindurant
Copy link
Member Author

@jrbourbeau , I'll merge this when it passes, and that should be enough to make dask CI happy.

@jrbourbeau
Copy link
Member

Thanks for fixing so quickly @martindurant!

Will there be a release out with this patch soon? We use releases in most CI build (one build uses main for fastparquet). If not, I'll just add some skip logic

@martindurant
Copy link
Member Author

Will there be a release out with this patch soon

Yes, since the windows-py3.12 wheel failed to build in the last round anyway.

@martindurant martindurant merged commit 89acf38 into dask:main Oct 26, 2023
21 checks passed
@martindurant martindurant deleted the dt_again branch October 26, 2023 17:31
@martindurant
Copy link
Member Author

@jrbourbeau , would you mind running your main-branch CI somewhere to see if the failures go away?

@jrbourbeau
Copy link
Member

Locally I'm getting the same error

____________________________________________________________________________________________________________________________________ test_timestamp96 _____________________________________________________________________________________________________________________________________

tmpdir = local('/private/var/folders/h0/_w6tz8jd3b9bk6w7d_xpg9t40000gn/T/pytest-of-james/pytest-21/test_timestamp960')

    @FASTPARQUET_MARK
    def test_timestamp96(tmpdir):
        fn = str(tmpdir)
        df = pd.DataFrame({"a": [pd.to_datetime("now", utc=True)]})
        ddf = dd.from_pandas(df, 1)
        ddf.to_parquet(fn, engine="fastparquet", write_index=False, times="int96")
        pf = fastparquet.ParquetFile(fn)
        assert pf._schema[1].type == fastparquet.parquet_thrift.Type.INT96
>       out = dd.read_parquet(fn, engine="fastparquet", index=False).compute()

dask/dataframe/io/tests/test_parquet.py:1883:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask/base.py:342: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask/base.py:628: in compute
    results = schedule(dsk, keys, **kwargs)
dask/dataframe/io/parquet/core.py:96: in __call__
    return read_parquet_part(
dask/dataframe/io/parquet/core.py:654: in read_parquet_part
    dfs = [
dask/dataframe/io/parquet/core.py:655: in <listcomp>
    func(
dask/dataframe/io/parquet/fastparquet.py:1075: in read_partition
    return cls.pf_to_pandas(
dask/dataframe/io/parquet/fastparquet.py:1115: in pf_to_pandas
    df, views = pf.pre_allocate(size, columns, categories, index)
../../../mambaforge/envs/dask-py310/lib/python3.10/site-packages/fastparquet/api.py:797: in pre_allocate
    df, arrs = _pre_allocate(size, columns, categories, index, cats,
../../../mambaforge/envs/dask-py310/lib/python3.10/site-packages/fastparquet/api.py:1051: in _pre_allocate
    df, views = dataframe.empty(dtypes, size, cols=cols, index_names=index,
../../../mambaforge/envs/dask-py310/lib/python3.10/site-packages/fastparquet/dataframe.py:202: in empty
    values = type(bvalues)._from_sequence(values, copy=False, dtype=bvalues.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

pandas/_libs/tslibs/tzconversion.pyx:187: ValueError

Note it looks like the line changed in this PR is similar, but not exactly the same, to the line where the error is being raised. Maybe both lines need the same sort of update

@martindurant
Copy link
Member Author

What's your pandas version?

@jrbourbeau
Copy link
Member

In [1]: import pandas as pd
pd
In [2]: pd.__version__
Out[2]: '1.5.3'

@martindurant
Copy link
Member Author

OK, then I think all the pandas I have and in tests are too new... Hold on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression due to _from_sequence
2 participants