Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Instead of Panic On Attempting to Write More Than 32769 Row Groups #6591

Closed
pacman82 opened this issue Oct 19, 2024 · 8 comments · Fixed by #6629
Closed

Error Instead of Panic On Attempting to Write More Than 32769 Row Groups #6591

pacman82 opened this issue Oct 19, 2024 · 8 comments · Fixed by #6629
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@pacman82
Copy link

Describe the bug

i16 counting row groups overflows and becomes negative causing panic

To Reproduce

Writing 32769 row groups with the file writer

Expected behavior

Maybe an error indicating that too many batches have been written would be preferable. Alternatively it would be nice if this just worked, yet I could also get behind the thinking that this may be too many row groups for a single file anyway.

Additional context

Occurred in the context of a user running odbc2parquet. His row groups were very small (15 rows) due to an issue with his row sizes, causing him to write lots of row groups into a single file. See: pacman82/odbc2parquet#652

@pacman82 pacman82 added the bug label Oct 19, 2024
@tustvold
Copy link
Contributor

The i16 is actually limit enforced by the parquet format itself - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L940

Row groups of this size are such a bad idea, the format actively prevents it 😅

That being said we could make this an error not a panic

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Oct 19, 2024
@tustvold tustvold changed the title i16 overflow in SerializedRowGroupWriter Error Instead of Panic On Attempting to Write More Than 32769 Row Groups Oct 19, 2024
@pacman82
Copy link
Author

No disagreement here. I am exploring opportunities to change the UX of odbc2parquet in a way to avoid this scenario entirely, but still felt that the panic should be an error.

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

I agree error vs panic is a much nicer behavior

@etseidl
Copy link
Contributor

etseidl commented Oct 25, 2024

Is this fixed by #6378? I can't reproduce with the current head because next_row_group() will no longer return a rowgroup writer with an invalid ordinal.

@alamb
Copy link
Contributor

alamb commented Oct 25, 2024

Is this fixed by #6378? I can't reproduce with the current head because next_row_group() will no longer return a rowgroup writer with an invalid ordinal.

It appears that #6378 (thanks @progval ❤️ ) was released in version 53.1.0
https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md#5310-2024-10-02

@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

label_issue.py automatically added labels {'parquet'} from #6629

@pacman82
Copy link
Author

Great stuff, thanks!

@alamb
Copy link
Contributor

alamb commented Nov 18, 2024

Kudos to @etseidl -- BTW this should be available on crates.io in about 2 days -- it is included in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants