Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fragment corruption in the GlobalOrderWriter #4383

Merged
merged 2 commits into from
Sep 27, 2023

Conversation

davisp
Copy link
Contributor

@davisp davisp commented Sep 26, 2023

Previously there was an issue with the GlobalOrderWriter when using a
maximum fragment size. If we resumed a write where the first tile would
cause the partially written fragment to overflow the set maximum size,
we failed to check that a single tile could fit. This resulted in a
convoluted path where dim_tiles[-1].cell_num() return a reasonable
result (instead of segfaulting or throwing an exception).

This lead us to writing the wrong last_tile_cell_num value in the
FragmentMetadata. When reading the last tile from one of these corrupted
fragments, the logic in Tile::load_chunk_data would throw an exception
when the expected size was calculated off an invalid
last_tile_cell_num which caused an exception to be thrown and rendered
the last tile of the fragment unreadable.

The fix is simply to make sure we check if a single tile can fit and if
not, to create a new fragment. For a belt and suspenders approach to
safety, we've also changed the code in WriterBase::set_coords_metadata
to use std::vector::at instead of std::vector::operator[] so that
bounds checking is performed and assertion is thrown rather than writing
a bad FragmentMetadata to disk.


TYPE: BUG
DESC: Fix fragment corruption in the GlobalOrderWriter

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #34072: Query produces no results. Suspected MariaDB crash.

@davisp davisp requested a review from KiterLuc September 26, 2023 15:20
@davisp davisp force-pushed the pd/sc-34072/fix-global-order-writer-fragment-corruption branch from 0d33951 to a3377ea Compare September 26, 2023 15:22
@davisp
Copy link
Contributor Author

davisp commented Sep 26, 2023

When we merge this, make sure to use the rebase instead of squash so that the two commits are kept in the history as there are two subtle halves to this bug that should be called out separately.

This test case explains part of a bug in the GlobalOrderWriter. When our
fragment is full at the beginning of a write to an existing fragment, we
would accidentally call `dim_tiles[-1].cell_num()` which would actually
manage to return a reasonable value instead of segfaulting. This was due
to `num_tiles_to_write` returning 0 coupled with our failure to check
for that situation until after a call to set_coords_metadata.

Swapping from the `std::vector::operator[]` to `std::vector::at` adds
bounds checking and now throws an exception instead of returning a
garbage value. This is better since without bounds checking we were
corrupting the last tile of every fragment when setting the wrong last
tile cell num.
Previously there was an issue with the GlobalOrderWriter when using a
maximum fragment size. If we resumed a write where the first tile would
cause the partially written fragment to overflow the set maximum size,
we failed to check that a single tile could fit. This resulted in a
convoluted path where `dim_tiles[-1].cell_num()` return a reasonable
result (instead of segfaulting or throwing an exception).

This lead us to writing the wrong `last_tile_cell_num` value in the
FragmentMetadata. When reading the last tile from one of these corrupted
fragments, the logic in `Tile::load_chunk_data` would throw an exception
when the expected size was calculated off an invalid
`last_tile_cell_num` which caused an exception to be thrown and rendered
the last tile of the fragment unreadable.

The fix is simply to make sure we check if a single tile can fit and if
not, to create a new fragment. For a belt and suspenders approach to
safety, we've also changed the code in `WriterBase::set_coords_metadata`
to use `std::vector::at` instead of `std::vector::operator[]` so that
bounds checking is performed and assertion is thrown rather than writing
a bad FragmentMetadata to disk.
@davisp davisp force-pushed the pd/sc-34072/fix-global-order-writer-fragment-corruption branch from a3377ea to ad76d38 Compare September 27, 2023 14:24
@davisp davisp merged commit cc4e7d1 into dev Sep 27, 2023
52 checks passed
@davisp davisp deleted the pd/sc-34072/fix-global-order-writer-fragment-corruption branch September 27, 2023 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants