[FEA] Improve ORC reader filtering and performance #13882
Labels
0 - Backlog
In queue waiting for assignment
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Milestone
Background
libcudf includes readers and writers for two popular binary formats for columnar data: Apache Parquet and Apache ORC. These formats were originally introduced in 2013, and both have open source specifications (ORC, PQ) and reference implementations (ORC, PQ) maintained by Apache. ORC also serves as the foundation for Meta’s variant DWRF and their new format "Alpha".
Both formats have hierarchical data layouts, support encoding and compression, include fully-featured type systems, and find widespread use in database systems and data warehousing. Please refer to this paper by Zeng et al for a detailed comparison of the concepts, features and performance of Parquet and ORC binary formats. Please note that Parquet files are composed of “row groups” (~128 MB) and “pages” (~1 MB), and ORC files are composed of “stripes” (~70 MB) and “row groups” (10K rows).
Some of the differences include:
Expanding functionality of the ORC reader
The libcudf Parquet reader has gained functionality in key areas, including the chunked reader (release 22.12) to control how much of a table is materialized, and AST-based filtering (release 23.08) to avoid reading row groups that aren’t needed. Filtered IO (including bloom filters) is even more important to ORC users thanks to the fine granularity of ORC row groups (10k rows per row group). We should align our Parquet and ORC reader designs and separate shared utilities from format-specific details wherever possible.
read_raw_orc_statistics
function to support these steps. We may refactor some of the AST + min/max stats tools toutilities
. Also see issue #12512Performance optimizations for binary format reading
The text was updated successfully, but these errors were encountered: