`GeoArrowEngine` error when reading Parquet files #241

darribas · 2023-03-16T18:40:04Z

I am trying to read this dataset:

https://github.com/urbangrammarai/signatures_gb

Cloned locally (repo is about 50GB), within an environment made by:

conda create -n alpha dask-geopandas pyogrio ipykernel dask[distributed]

I load up libraries:

import geopandas
import dask_geopandas
from dask.distributed import LocalCluster, Client

client = Client(LocalCluster())

And then, I try to lazily read the dataset:

etcs = dask_geopandas.read_parquet(
    (
        '/home/jovyan/data/spatial_signatures'
        '/signatures_gb/form'
    )
)

Which returns (click for full error):

Full error message

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/backends.py:133, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    132 try:
--> 133     return func(*args, **kwargs)
    134 except Exception as e:

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:513, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, chunksize, aggregate_files, parquet_file_extension, filesystem, **kwargs)
    512 # Extract global filesystem and paths
--> 513 fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
    514     path,
    515     filesystem,
    516     dataset_options,
    517     open_file_options,
    518     storage_options,
    519 )
    520 read_options["open_file_options"] = open_file_options

AttributeError: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 etcs = dask_geopandas.read_parquet(
      2     (
      3         '/home/jovyan/data/spatial_signatures'
      4         '/signatures_gb/form'
      5     )
      6 )

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:111, in read_parquet(*args, **kwargs)
    110 def read_parquet(*args, **kwargs):
--> 111     result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
    112     # check if spatial partitioning information was stored
    113     spatial_partitions = result._meta.attrs.get("spatial_partitions", None)

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/backends.py:135, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    133     return func(*args, **kwargs)
    134 except Exception as e:
--> 135     raise type(e)(
    136         f"An error occurred while calling the {funcname(func)} "
    137         f"method registered to the {self.backend} backend.\n"
    138         f"Original Message: {e}"
    139     ) from e

AttributeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'

A couple of questions:

Do you have any idea what's going on? How to work around it?
This is slightly off this issue but, while I'm at it: is it possible to read it over the wire directly from the repo and download only the partitions that are required for computation?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2023-03-19T21:44:54Z

Can you share the version of dask and dask-geopandas you're using?

I can't reproduce it with this simple example, regardless of whether I create a client / LocalCluster.

In [15]: import dask_geopandas, geopandas

In [16]: df = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)

In [17]: df.to_parquet("/tmp/out.parquet")

In [18]: dask_geopandas.read_parquet("/tmp/out.parquet/").compute()
Out[18]:
   BoroCode       BoroName     Shape_Leng    Shape_Area                                           geometry
0         5  Staten Island  330470.010332  1.623820e+09  MULTIPOLYGON (((970217.022 145643.332, 970227....
1         4         Queens  896344.047763  3.045213e+09  MULTIPOLYGON (((1029606.077 156073.814, 102957...
2         3       Brooklyn  741080.523166  1.937479e+09  MULTIPOLYGON (((1021176.479 151374.797, 102100...
3         1      Manhattan  359299.096471  6.364715e+08  MULTIPOLYGON (((981219.056 188655.316, 980940....
4         2          Bronx  464392.991824  1.186925e+09  MULTIPOLYGON (((1012821.806 229228.265, 101278...

That's with dask 2023.3.1 and dask-geopandas main.

jorisvandenbossche · 2023-03-20T09:08:54Z

This extract_filesystem method was added relatively recently (dask/dask#9699), but our GeoArrowEngine subclass the dask engine, so I would expect that we just inherit that method.

darribas · 2023-03-24T21:31:50Z

Can you share the version of dask and dask-geopandas you're using?

geopandas.__version__
>>> '0.12.2'
dask_geopandas.__version__
>>> 'v0.3.0'
dask.__version__
>>> '2023.1.1'

To be clear, that version is what conda/mamba picks when I build the environment as described above.

I get the following related error:

Full error message

df = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)

df.to_parquet("/tmp/out.parquet")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], line 3
      1 df = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)
----> 3 df.to_parquet("/tmp/out.parquet")

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask_geopandas/core.py:617, in GeoDataFrame.to_parquet(self, path, *args, **kwargs)
    614 """See dask_geopadandas.to_parquet docstring for more information"""
    615 from .io.parquet import to_parquet
--> 617 return to_parquet(self, path, *args, **kwargs)

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:940, in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, **kwargs)
    931     raise ValueError(
    932         "User-defined key/value metadata (custom_metadata) can not "
    933         "contain a b'pandas' key.  This key is reserved by Pandas, "
    934         "and overwriting the corresponding value can render the "
    935         "entire dataset unreadable."
    936     )
    938 # Engine-specific initialization steps to write the dataset.
    939 # Possibly create parquet metadata, and load existing stuff if appending
--> 940 i_offset, fmd, metadata_file_exists, extra_write_kwargs = engine.initialize_write(
    941     df,
    942     fs,
    943     path,
    944     append=append,
    945     ignore_divisions=ignore_divisions,
    946     partition_on=partition_on,
    947     division_info=division_info,
    948     index_cols=index_cols,
    949     schema=schema,
    950     custom_metadata=custom_metadata,
    951     **kwargs,
    952 )
    954 # By default we only write a metadata file when appending if one already
    955 # exists
    956 if append and write_metadata_file is None:

AttributeError: type object 'GeoArrowEngine' has no attribute 'initialize_write'

TomAugspurger · 2023-03-24T22:04:08Z

Strange. I can't reproduce that using a new conda env with your commands.

As Joris says, this doesn't make sense because dask-geopandas inherits from the dask Arrow engine, so it must have the method.

jtmiclat · 2023-05-19T18:53:19Z

hi! i was able to look into this! if pyarrow is not installed then the inheritances falls apart because of the fallback import.

dask-geopandas/dask_geopandas/io/parquet.py

Lines 15 to 22 in d3e15d1

    
           try: 
        
               # pyarrow is imported here, but is an optional dependency 
        
               from dask.dataframe.io.parquet.arrow import ( 
        
                   ArrowDatasetEngine as DaskArrowDatasetEngine, 
        
               ) 
        
           except ImportError: 
        
               DaskArrowDatasetEngine = object

I think some envs default to have pyarrow so you really need a clean env to test this. A solution to this is to throw an import error/warning when instantiating GeoArrowEngine if pyarrow was not properly imported.

To reiterate

this fails

pip install dask dask-geopandas

this works

pip install dask dask-geopandas  pyarrow
# or 
pip install dask[complete] dask-geopandas

jrbourbeau mentioned this issue Apr 17, 2023

read parquet from s3 failing with 'GeoArrowEngine' has no attribute 'extract_filesystem' #250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GeoArrowEngine` error when reading Parquet files #241

`GeoArrowEngine` error when reading Parquet files #241

darribas commented Mar 16, 2023

TomAugspurger commented Mar 19, 2023

jorisvandenbossche commented Mar 20, 2023

darribas commented Mar 24, 2023 •

edited

Loading

TomAugspurger commented Mar 24, 2023

jtmiclat commented May 19, 2023

GeoArrowEngine error when reading Parquet files #241

GeoArrowEngine error when reading Parquet files #241

Comments

darribas commented Mar 16, 2023

TomAugspurger commented Mar 19, 2023

jorisvandenbossche commented Mar 20, 2023

darribas commented Mar 24, 2023 • edited Loading

TomAugspurger commented Mar 24, 2023

jtmiclat commented May 19, 2023

`GeoArrowEngine` error when reading Parquet files #241

`GeoArrowEngine` error when reading Parquet files #241

darribas commented Mar 24, 2023 •

edited

Loading