Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport release-2.17] Fix fragment corruption in the GlobalOrderWriter #4389

Merged
merged 2 commits into from
Sep 27, 2023

Conversation

github-actions[bot]
Copy link
Contributor

Backport cc4e7d1~2..cc4e7d1 from #4383.

This test case explains part of a bug in the GlobalOrderWriter. When our
fragment is full at the beginning of a write to an existing fragment, we
would accidentally call `dim_tiles[-1].cell_num()` which would actually
manage to return a reasonable value instead of segfaulting. This was due
to `num_tiles_to_write` returning 0 coupled with our failure to check
for that situation until after a call to set_coords_metadata.

Swapping from the `std::vector::operator[]` to `std::vector::at` adds
bounds checking and now throws an exception instead of returning a
garbage value. This is better since without bounds checking we were
corrupting the last tile of every fragment when setting the wrong last
tile cell num.

(cherry picked from commit f914fd2)
Previously there was an issue with the GlobalOrderWriter when using a
maximum fragment size. If we resumed a write where the first tile would
cause the partially written fragment to overflow the set maximum size,
we failed to check that a single tile could fit. This resulted in a
convoluted path where `dim_tiles[-1].cell_num()` return a reasonable
result (instead of segfaulting or throwing an exception).

This lead us to writing the wrong `last_tile_cell_num` value in the
FragmentMetadata. When reading the last tile from one of these corrupted
fragments, the logic in `Tile::load_chunk_data` would throw an exception
when the expected size was calculated off an invalid
`last_tile_cell_num` which caused an exception to be thrown and rendered
the last tile of the fragment unreadable.

The fix is simply to make sure we check if a single tile can fit and if
not, to create a new fragment. For a belt and suspenders approach to
safety, we've also changed the code in `WriterBase::set_coords_metadata`
to use `std::vector::at` instead of `std::vector::operator[]` so that
bounds checking is performed and assertion is thrown rather than writing
a bad FragmentMetadata to disk.

(cherry picked from commit cc4e7d1)
@KiterLuc KiterLuc closed this Sep 27, 2023
@KiterLuc KiterLuc reopened this Sep 27, 2023
@KiterLuc KiterLuc merged commit cb05807 into release-2.17 Sep 27, 2023
@KiterLuc KiterLuc deleted the backport-4383-to-release-2.17 branch September 27, 2023 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants