-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GeoArrowEngine
error when reading Parquet files
#241
Comments
Can you share the version of dask and dask-geopandas you're using? I can't reproduce it with this simple example, regardless of whether I create a client / LocalCluster. In [15]: import dask_geopandas, geopandas
In [16]: df = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)
In [17]: df.to_parquet("/tmp/out.parquet")
In [18]: dask_geopandas.read_parquet("/tmp/out.parquet/").compute()
Out[18]:
BoroCode BoroName Shape_Leng Shape_Area geometry
0 5 Staten Island 330470.010332 1.623820e+09 MULTIPOLYGON (((970217.022 145643.332, 970227....
1 4 Queens 896344.047763 3.045213e+09 MULTIPOLYGON (((1029606.077 156073.814, 102957...
2 3 Brooklyn 741080.523166 1.937479e+09 MULTIPOLYGON (((1021176.479 151374.797, 102100...
3 1 Manhattan 359299.096471 6.364715e+08 MULTIPOLYGON (((981219.056 188655.316, 980940....
4 2 Bronx 464392.991824 1.186925e+09 MULTIPOLYGON (((1012821.806 229228.265, 101278... That's with dask 2023.3.1 and dask-geopandas main. |
This |
geopandas.__version__
>>> '0.12.2'
dask_geopandas.__version__
>>> 'v0.3.0'
dask.__version__
>>> '2023.1.1' To be clear, that version is what I get the following related error: Full error messagedf = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)
df.to_parquet("/tmp/out.parquet")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[11], line 3
1 df = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)
----> 3 df.to_parquet("/tmp/out.parquet")
File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask_geopandas/core.py:617, in GeoDataFrame.to_parquet(self, path, *args, **kwargs)
614 """See dask_geopadandas.to_parquet docstring for more information"""
615 from .io.parquet import to_parquet
--> 617 return to_parquet(self, path, *args, **kwargs)
File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:940, in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, **kwargs)
931 raise ValueError(
932 "User-defined key/value metadata (custom_metadata) can not "
933 "contain a b'pandas' key. This key is reserved by Pandas, "
934 "and overwriting the corresponding value can render the "
935 "entire dataset unreadable."
936 )
938 # Engine-specific initialization steps to write the dataset.
939 # Possibly create parquet metadata, and load existing stuff if appending
--> 940 i_offset, fmd, metadata_file_exists, extra_write_kwargs = engine.initialize_write(
941 df,
942 fs,
943 path,
944 append=append,
945 ignore_divisions=ignore_divisions,
946 partition_on=partition_on,
947 division_info=division_info,
948 index_cols=index_cols,
949 schema=schema,
950 custom_metadata=custom_metadata,
951 **kwargs,
952 )
954 # By default we only write a metadata file when appending if one already
955 # exists
956 if append and write_metadata_file is None:
AttributeError: type object 'GeoArrowEngine' has no attribute 'initialize_write' |
Strange. I can't reproduce that using a new conda env with your commands. As Joris says, this doesn't make sense because dask-geopandas inherits from the dask Arrow engine, so it must have the method. |
hi! i was able to look into this! if pyarrow is not installed then the inheritances falls apart because of the fallback import. dask-geopandas/dask_geopandas/io/parquet.py Lines 15 to 22 in d3e15d1
I think some envs default to have pyarrow so you really need a clean env to test this. A solution to this is to throw an import error/warning when instantiating GeoArrowEngine if pyarrow was not properly imported. To reiterate this fails
this works
|
I am trying to read this dataset:
https://github.com/urbangrammarai/signatures_gb
Cloned locally (repo is about 50GB), within an environment made by:
I load up libraries:
And then, I try to lazily read the dataset:
Which returns (click for full error):
Full error message
A couple of questions:
The text was updated successfully, but these errors were encountered: