Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader #41764

Open
paleolimbot opened this issue May 21, 2024 · 4 comments

Comments

@paleolimbot
Copy link
Member

Describe the enhancement requested

In apache/parquet-format#240 there is concern regarding the ability to add a new logical type (in this case GEOMETRY) in a backwards compatible way such that readers that don't yet implement support for the new logical type can still read the file.

@jorisvandenbossche found the place where the error would be thrown:

throw ParquetException("Metadata contains Thrift LogicalType that is not recognized");

I'm not sure what the best behaviour would be here: it will help drive support for new logical types to actually be written to files if it's possible to know that older readers won't choke on them. There was some indication that this would be a bug ( apache/parquet-format#240 (comment) ); however, it is definitely safer for a reader in general to error when it encounters a type that it doesn't understand. On the other hand, Arrow C++ silently drops unregistered extension types which, if I'm understanding the issue, is roughly the same.

It seems like returning NoLogicalType::Make(); would fall back to the physical type here; however, it also seems like that should be opt-in somehow and I don't see an obvious route to "type inference" options or similar at that particular place in the code.

Component(s)

Parquet

@jorisvandenbossche
Copy link
Member

You can actually reproduce this easily with the new float16 logical type, by writing it with the latest Arrow:

>>> table = pa.table({"a":np.array([0.1, 0.2], "float16")})
>>> pq.write_table(table, "/tmp/test_float16.parquet")
>>> pq.read_metadata("/tmp/test_float16.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7ff5a53d6140>
required group field_id=-1 schema {
  optional fixed_len_byte_array(2) field_id=-1 a (Float16);
}

and reading that file with an older version:

>>> pq.read_metadata("/tmp/test_float16.parquet").schema
...
OSError: Metadata contains Thrift LogicalType that is not recognized

So also regardless of a possible future geometry type, this seems like a case that could be handled more gracefully.

@PeterAronson
Copy link

I don't know if it is still the same case, but a few years ago we ran into the same problem in Java (Paquet Java/Parquet MR) with the UUID annotation, back before they supported it. It also caused an error to be thrown. So it seems like it might be the case across libraries.

@jorisvandenbossche jorisvandenbossche added this to the 18.0.0 milestone Jul 30, 2024
@raulcd raulcd modified the milestones: 18.0.0, 19.0.0 Oct 9, 2024
@raulcd
Copy link
Member

raulcd commented Oct 9, 2024

@jorisvandenbossche @paleolimbot I've moved this to 19.0.0, let me know if this is a blocker

@paleolimbot
Copy link
Member Author

No problem! There is still some testing work to do here that I haven't gotten to 🙂

@amoeba amoeba modified the milestones: 19.0.0, 20.0.0 Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants