Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix computation of GZip filter overhead. #5296

Merged
merged 7 commits into from
Sep 11, 2024
Merged

Fix computation of GZip filter overhead. #5296

merged 7 commits into from
Sep 11, 2024

Conversation

teo-tsirpanis
Copy link
Member

@teo-tsirpanis teo-tsirpanis commented Sep 10, 2024

SC-48428

The overhead of the DEFLATE algorithm is stated to be $5(\lfloor\frac{n}{16383}\rfloor+1)$, but we computed it as $5(\lceil\frac{n}{16383}\rceil)$, which produces a wrong result on zero. This caused incomplete data to be written during compression, which led to failures during decompression

This PR updates GZip::overhead to use the first definition, and also updates GZip::compress to fail when the output buffer is not large enough.

Validated by a regression test written by @davisp, which failed before this change and succeeds after it.


TYPE: BUG
DESC: Fixed GZip compression of empty data.

@@ -0,0 +1,86 @@
#include <climits>

#include <tiledb/tiledb>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment somewhere that explains that this adds a var sized attribute encoded with the GZIP filter and that we write data that will result in a 0 sized var tile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will result in a 0 sized var tile

On second look this is quite non-apparent. How about I remove this regression test and write a targeted unit test for compressing a zero-sized buffer with the GZip filter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the regression test in place and just add a quick comment.

But you are right, we should probably add a separate unit test as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test's title, hope that's sufficient.

}

}; // namespace sm
} // namespace sm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the opportunity to change this to tiledb::sm above instead of having the two seperated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

out_buf_dec_storage.data(), out_buf_dec_storage.size()};

GZip::decompress(&in_buf_dec, &out_buf_dec);
// Check that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we checking here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause of the issue, that an empty buffer is compressed and decompressed correctly.

@KiterLuc KiterLuc merged commit dbdfabc into dev Sep 11, 2024
62 checks passed
@KiterLuc KiterLuc deleted the teo/gzip-fix branch September 11, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants