Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendation on the Arrow specific type for the WKB geometry column ? #187

Closed
rouault opened this issue Oct 27, 2023 · 5 comments · Fixed by #190
Closed

Recommendation on the Arrow specific type for the WKB geometry column ? #187

rouault opened this issue Oct 27, 2023 · 5 comments · Fixed by #190
Milestone

Comments

@rouault
Copy link
Contributor

rouault commented Oct 27, 2023

The GeoParquet spec rightly specfies the type of geometry columns in terms of Parquet type: "Geometry columns MUST be stored using the BYTE_ARRAY parquet type"
For implementations using the Arrow library (typically from C++, but possibly from Python or other languages), they might use the Arrow API for Parquet reading & writing and thus not be directly exposed to Parquet types.
I came to realize recently that the GDAL implementation would only work for the Arrow::Type::BINARY, but not for Arrow::Type::LARGE_BINARY. This has been addressed, on the reading side of the driver, for GDAL 3.8.0 per OSGeo/gdal#8618 .
I'm not entirely clear how the Arrow library maps a large Parquet file with row group with a WKB column with more than 2 GB of content. I assume that would be Arrow::Type::LARGE_BINARY ?
So the question is if there should be some hints in the spec for implementations using the Arrow library on:

  • which Arrow types they should expect on the reading side
  • which Arrow type(s) they should use on the writing side
@paleolimbot
Copy link
Collaborator

which Arrow type(s) they should use on the writing side

I would say that it is perhaps good practice to write row groups such that the WKB column has chunks that fit into a (non-large) binary array, although it's difficult to guarantee that I think (@jorisvandenbossche would know better than I would). pyarrow prefers to chunk arrays that would contain more than 2GB of content rather than return large binary arrays, but the R bindings don't.

which Arrow types they should expect on the reading side

I think it's possible to end up with binary, large binary, or fixed-size binary that all use a byte array in Parquet land. (I don't know if it is desirable to allow all three of those but they are all ways to represent binary). GeoArrow doesn't mention the fixed-size option in its spec and I think we're planning to keep it that way (or perhaps I'm forgetting a previous thread).

@hobu
Copy link

hobu commented Oct 30, 2023

As another data point, I also only implemented arrow::Type::BINARY for PDAL, but I was confused if I should support all three possibilities.

@paleolimbot
Copy link
Collaborator

In the absence of more informed guidance on how binary-like things get encoded in Parquet files (I'm somewhat new to this), I would gander that it is probably a good idea to support all three. It's a bit of a learning curve, but in theory arrow::type_traits ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/type_traits.h#L334-L372 ) is there to help support all of them with mostly the same code.

@jorisvandenbossche
Copy link
Collaborator

On the reading side (and when reading using the Arrow C++ library or its bindings), it depends on whether you have a file that was written by Arrow or not (i.e. whether it includes a serialized Arrow schema in the Parquet FileMetadata, which you can also turn off with store_schema=False when writing):

  • If you have a plain Parquet file without Arrow schema information: a BYTE_ARRAY parquet column is always mapped to Arrow's binary type
  • If you have a Parquet file with Arrow schema information, then Arrow will cast the binary data to that type while reading, and so in theory you can then get all variable-size binary types (binary, large_binary, binary_view), depending on how the file was written.

The fixed_size_binary Arrow type gets written as FIXED_LEN_BYTE_ARRAY on the Parquet side, so I think that should never come back when reading BYTE_ARRAY. Thus, I think we can clearly exclude the fixed size binary type for handling in applications like GDAL (given we clearly say that GeoParquet should use BYTE_ARRAY).

I'm not entirely clear how the Arrow library maps a large Parquet file with row group with a WKB column with more than 2 GB of content. I assume that would be Arrow::Type::LARGE_BINARY ?

In the first bullet point above I said that this would always map to binary, not large_binary. That's based on looking at the code (and some code comment says this, but who knows if that might be outdated). I also tried to reproduce those cases, but I didn't get it to return large binary.
When reading a Parquet file with a big row group with many large string values that would require large_binary, Arrow C++ seems to always read it in some chunks, even when I set the batch size to read to a value > number of rows (at that point it seems it has some internal max batch size or max data size to process at once). So in practice, yes it seems the Arrow C++ Parquet reader wil always gives you chunked data.
When trying to create a Parquet file with one huge string that would require large_binary, I get the error "Parquet cannot store strings with size 2GB or more".


Now, if the Parquet file was written with the Arrow schema stored (eg using pyarrow, or using GDAL as I see it has that enabled), you will get back however Arrow schema of the written data, and so in that case you can get other types than just binary. I think it is good for GDAL to handle large_binary (as you already do), and we can probably also recommend that in general for reading.

For writing, I agree with what Dewey wrote above: generally you should use a row group size such that binary will suffice in most cases (you would already need huge WKB values to reach the 2GB limit for a few thousands of rows).

@jorisvandenbossche jorisvandenbossche changed the title Recommandation on the Arrow specific type for the WKB geometry column ? Recommendation on the Arrow specific type for the WKB geometry column ? Nov 14, 2023
@rouault
Copy link
Contributor Author

rouault commented Nov 17, 2023

I've tried to sum up the outcome of that discussion in #190

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants