Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37756: [Format][Docs] Document IPC Compression #43950

Merged
merged 18 commits into from
Sep 17, 2024
Merged
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1284,6 +1284,8 @@ We additionally provide both schema-level and field-level
``custom_metadata`` attributes allowing for systems to insert their
own application defined metadata to customize behavior.

.. _ipc-recordbatch-message:

RecordBatch message
-------------------

Expand Down Expand Up @@ -1385,6 +1387,60 @@ have two entries in each RecordBatch. For a RecordBatch of this schema with
buffer 13: col2 data


Compression
-----------

There are three different options for compression of record batch
body buffers: Buffers can be uncompressed, buffers can be
compressed with the ``lz4`` compression codec, or buffers can
be compressed with the ``zstd`` compression codec. Buffers in
the flat sequence of a message body must be either all
uncompressed or all compressed separately using the same codec.
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved

.. note::

``lz4`` compression codec means the
`LZ4 frame format <https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md>`_
and should not to be confused with
`"raw" (also called "block") format <https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>`_.
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved

The difference between compressed and uncompressed buffers in the
serialized form is as follows:

* If the buffers in the :ref:`ipc-recordbatch-message` are **compressed**

- the ``data header`` includes the length and memory offset
of each **compressed buffer** in the record batch's body

- the ``body`` includes a flat sequence of **compressed buffers**
together with the **length of uncompressed buffer** as a 64-bit
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved
little-endian signed integer stored in the first 8 bytes for each
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved
buffer in the sequence
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved

* If the buffers in the :ref:`ipc-recordbatch-message` are **uncompressed**

- the ``data header`` includes the length and memory offset
of each **uncompressed buffer** in the record batch's body

- the ``body`` includes a flat sequence of **uncompressed buffers**
with the first 8 bytes empty or equal to ``-1`` to indicate that
the buffer is uncompressed
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved

AlenkaF marked this conversation as resolved.
Show resolved Hide resolved
.. note::

Some Arrow implementations lack support for producing and consuming
IPC data with compressed buffers using one or either of the codecs
listed above. See :doc:`../status` for details.

Some applications might apply compression in the protocol they use
to store or transport Arrow IPC data. (For example, an HTTP server
might serve gzip-compressed Arrow IPC streams.) Applications that
already use compression in their storage or transport protocols
should avoid using buffer compression. Double compression typically
worsens performance and does not substantially improve compression
ratios.

AlenkaF marked this conversation as resolved.
Show resolved Hide resolved

Byte Order (`Endianness`_)
---------------------------

Expand Down
Loading