Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate LZ4 filter compatibility #134

Open
derobins opened this issue Aug 15, 2024 · 1 comment
Open

Investigate LZ4 filter compatibility #134

derobins opened this issue Aug 15, 2024 · 1 comment
Assignees
Labels
Filter - LZ4 Priority - 1. High 🔼 These are important issues that should be resolved in the next release

Comments

@derobins
Copy link
Member

@epourmal says that the LZ4 filter is not a direct LZ4 compression filter, but instead adds extra bytes to the buffer, which makes it difficult to extract chunks and decompress them directly.

@derobins derobins added Filter - LZ4 Priority - 1. High 🔼 These are important issues that should be resolved in the next release labels Aug 15, 2024
bmribler added a commit to bmribler/hdf5_plugins that referenced this issue Nov 4, 2024
@mkitti
Copy link

mkitti commented Nov 19, 2024

There is indeed a 12-byte header where the decompessed size and block size are stored as 64-bit and 32-bit big endian integers:

rpos = (char*)*buf; /* pointer to current read position */
roBuf = (char*)outBuf; /* pointer to current write position */
/* header */
i64Buf = (uint64_t *) (roBuf);
i64Buf[0] = htobe64t((uint64_t)nbytes); /* Store decompressed size in be format */
roBuf += 8;
i32Buf = (uint32_t *) (roBuf);
i32Buf[0] = htobe32t((uint32_t)blockSize); /* Store the block size in be format */
roBuf += 4;
outSize = 12; /* size of the output buffer. Header size (12 bytes) is included */

A more standard approach would have been to the LZ4 frame format which can optionally encode the decompressed size:
https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md

I suggest supporting the standard frame format by checking for the little endian magic key, 0x184D2204 as documented in the frame format above. While it is possible that these four bytes could be part of an 8-byte big endian integer value indicating a very large chunk on the order of 0x04224d1800000000, a few hundred exabytes, I think it may be possible to employ some heuristics to distinguish these two possibilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filter - LZ4 Priority - 1. High 🔼 These are important issues that should be resolved in the next release
Projects
None yet
Development

No branches or pull requests

3 participants