-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommendation on the Arrow specific type for the WKB geometry column ? #187
Comments
I would say that it is perhaps good practice to write row groups such that the WKB column has chunks that fit into a (non-large) binary array, although it's difficult to guarantee that I think (@jorisvandenbossche would know better than I would). pyarrow prefers to chunk arrays that would contain more than 2GB of content rather than return large binary arrays, but the R bindings don't.
I think it's possible to end up with binary, large binary, or fixed-size binary that all use a byte array in Parquet land. (I don't know if it is desirable to allow all three of those but they are all ways to represent binary). GeoArrow doesn't mention the fixed-size option in its spec and I think we're planning to keep it that way (or perhaps I'm forgetting a previous thread). |
As another data point, I also only implemented |
In the absence of more informed guidance on how binary-like things get encoded in Parquet files (I'm somewhat new to this), I would gander that it is probably a good idea to support all three. It's a bit of a learning curve, but in theory |
On the reading side (and when reading using the Arrow C++ library or its bindings), it depends on whether you have a file that was written by Arrow or not (i.e. whether it includes a serialized Arrow schema in the Parquet FileMetadata, which you can also turn off with
The
In the first bullet point above I said that this would always map to binary, not large_binary. That's based on looking at the code (and some code comment says this, but who knows if that might be outdated). I also tried to reproduce those cases, but I didn't get it to return large binary. Now, if the Parquet file was written with the Arrow schema stored (eg using pyarrow, or using GDAL as I see it has that enabled), you will get back however Arrow schema of the written data, and so in that case you can get other types than just For writing, I agree with what Dewey wrote above: generally you should use a row group size such that |
I've tried to sum up the outcome of that discussion in #190 |
The GeoParquet spec rightly specfies the type of geometry columns in terms of Parquet type: "Geometry columns MUST be stored using the BYTE_ARRAY parquet type"
For implementations using the Arrow library (typically from C++, but possibly from Python or other languages), they might use the Arrow API for Parquet reading & writing and thus not be directly exposed to Parquet types.
I came to realize recently that the GDAL implementation would only work for the Arrow::Type::BINARY, but not for Arrow::Type::LARGE_BINARY. This has been addressed, on the reading side of the driver, for GDAL 3.8.0 per OSGeo/gdal#8618 .
I'm not entirely clear how the Arrow library maps a large Parquet file with row group with a WKB column with more than 2 GB of content. I assume that would be Arrow::Type::LARGE_BINARY ?
So the question is if there should be some hints in the spec for implementations using the Arrow library on:
The text was updated successfully, but these errors were encountered: