Fix computation of GZip filter overhead. #5296

teo-tsirpanis · 2024-09-10T11:06:32Z

The overhead of the DEFLATE algorithm is stated to be $5(\lfloor\frac{n}{16383}\rfloor+1)$, but we computed it as $5(\lceil\frac{n}{16383}\rceil)$, which produces a wrong result on zero. This caused incomplete data to be written during compression, which led to failures during decompression

This PR updates GZip::overhead to use the first definition, and also updates GZip::compress to fail when the output buffer is not large enough.

Validated by a regression test written by @davisp, which failed before this change and succeeds after it.

TYPE: BUG
DESC: Fixed GZip compression of empty data.

KiterLuc · 2024-09-10T11:27:46Z

test/regression/targets/sc-48428.cc

@@ -0,0 +1,86 @@
+#include <climits>
+
+#include <tiledb/tiledb>


Let's add a comment somewhere that explains that this adds a var sized attribute encoded with the GZIP filter and that we write data that will result in a 0 sized var tile.

that will result in a 0 sized var tile

On second look this is quite non-apparent. How about I remove this regression test and write a targeted unit test for compressing a zero-sized buffer with the GZip filter?

Let's keep the regression test in place and just add a quick comment.

But you are right, we should probably add a separate unit test as well.

Updated the test's title, hope that's sufficient.

KiterLuc · 2024-09-10T11:28:30Z

tiledb/sm/compressors/gzip_compressor.cc

 }

-};  // namespace sm
+}  // namespace sm


Let's take the opportunity to change this to tiledb::sm above instead of having the two seperated.

KiterLuc · 2024-09-11T08:14:03Z

tiledb/sm/compressors/test/unit_gzip_compressor.cc

+      out_buf_dec_storage.data(), out_buf_dec_storage.size()};
+
+  GZip::decompress(&in_buf_dec, &out_buf_dec);
+  // Check that


What are we checking here?

The root cause of the issue, that an empty buffer is compressed and decompressed correctly.

tiledb/sm/compressors/test/unit_gzip_compressor.cc

davisp and others added 4 commits September 9, 2024 15:12

Add regression test for 48428

c18b216

Fix computation of GZip filter overhead.

8f40e9d

Fail when GZip compression output buffer is not large enough.

5dd93b6

Clean-up.

d350260

teo-tsirpanis requested a review from KiterLuc September 10, 2024 11:06

KiterLuc reviewed Sep 10, 2024

View reviewed changes

teo-tsirpanis added 2 commits September 10, 2024 19:31

Address PR feedback and replace deprecated data type.

b130ed0

Add some unit tests for the GZip compressor.

0fb5439

KiterLuc reviewed Sep 11, 2024

View reviewed changes

tiledb/sm/compressors/test/unit_gzip_compressor.cc Outdated Show resolved Hide resolved

Update tiledb/sm/compressors/test/unit_gzip_compressor.cc

13c50c4

KiterLuc approved these changes Sep 11, 2024

View reviewed changes

KiterLuc merged commit dbdfabc into dev Sep 11, 2024
62 checks passed

KiterLuc deleted the teo/gzip-fix branch September 11, 2024 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix computation of GZip filter overhead. #5296

Fix computation of GZip filter overhead. #5296

teo-tsirpanis commented Sep 10, 2024 •

edited

Loading

KiterLuc Sep 10, 2024

teo-tsirpanis Sep 10, 2024

KiterLuc Sep 10, 2024

teo-tsirpanis Sep 10, 2024

KiterLuc Sep 10, 2024

teo-tsirpanis Sep 10, 2024

KiterLuc Sep 11, 2024

teo-tsirpanis Sep 11, 2024

Fix computation of GZip filter overhead. #5296

Fix computation of GZip filter overhead. #5296

Conversation

teo-tsirpanis commented Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teo-tsirpanis commented Sep 10, 2024 •

edited

Loading