GH-1452: implement Size() filter for repeated columns #3098

clairemcginty · 2024-12-05T20:04:51Z

Rationale for this change

this PR continues the work outlined in #1452. It implements a size() predicate for filtering on # of elements in repeated fields:

FilterPredicate hasThreeElements = size(intColumn("my_list_field"), Operators.Size.Operator.EQ, 3)

What changes are included in this PR?

Size() and not(size()) implemented for all list fields with required element type. Attempting to filter on a list of optional elements will throw an exception in the schema validator. This is because the existing record-level filtering setup (IncrementallyUpdatedFilterPredicateEvaluator) only feeds in non-null values to the ValueInspectors. thus if you had an array [1,2, null, 4] it would only count 3 elements. I can file a ticket to support this eventually but I think we'd have to rework the FilteringRecordMaterializer to be aware of repetition/definition levels.

The list group itself can be optional or required. Null lists are treated as having size 0. Again, this is due to difficulty disambiguating them at the record-level filtering step. (Would love feedback on both these design decisions!!)

Are these changes tested?

Unit tests + tested a snapshot build locally with real datasets

Are there any user-facing changes?

New Operators API

Part of #1452

clairemcginty · 2024-12-05T20:05:37Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

+    // If all values have repetition level 0, then no array has more than 1 element
+    if (repetitionLevelHistogram.size() == 1
+        || repetitionLevelHistogram.subList(1, repetitionLevelHistogram.size()).stream()
+            .allMatch(l -> l == 0)) {
+
+      // Null list fields are treated as having size 0
+      if (( // all lists are nulls
+          definitionLevelHistogram.subList(1, definitionLevelHistogram.size()).stream()
+              .allMatch(l -> l == 0))
+          || // all lists are size 0
+          (definitionLevelHistogram.get(0) == 0
+              && definitionLevelHistogram.subList(2, definitionLevelHistogram.size()).stream()
+                  .allMatch(l -> l == 0))) {
+
+        final boolean blockCannotMatch =
+            size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0);
+        return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+      }
+
+      long maxDefinitionLevel = definitionLevelHistogram.get(definitionLevelHistogram.size() - 1);
+
+      // If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1, all lists
+      // are of size 1
+      if (definitionLevelHistogram.stream().allMatch(l -> l > maxDefinitionLevel - 1)) {
+        final boolean blockCannotMatch = size.filter(
+            (eq) -> eq != 1, (lt) -> lt <= 1, (lte) -> lte < 1, (gt) -> gt >= 1, (gte) -> gte > 1);
+
+        return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+      }
+    }
+    long nonNullElementCount =
+        repetitionLevelHistogram.stream().mapToLong(l -> l).sum() - definitionLevelHistogram.get(0);
+    long numNonNullRecords = repetitionLevelHistogram.get(0) - definitionLevelHistogram.get(0);
+
+    // Given the total number of elements and non-null fields, we can compute the max size of any array field
+    long maxArrayElementCount = 1 + (nonNullElementCount - numNonNullRecords);
+    final boolean blockCannotMatch = size.filter(
+        (eq) -> eq > maxArrayElementCount,
+        (lt) -> false,
+        (lte) -> false,
+        (gt) -> gt >= maxArrayElementCount,
+        (gte) -> gte > maxArrayElementCount);
+
+    return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;


hopefully this is a faithful transcription of the logic outlined here: #1452 (comment)

IIUC, these are true for only un-nested list type, right?

clairemcginty · 2024-12-05T20:07:52Z

...et-hadoop/src/test/java/org/apache/parquet/filter2/statisticslevel/TestStatisticsFilter.java

+    assertFalse(canDrop(size(nestedListColumn, Operators.Size.Operator.EQ, 0), columnMeta));
+  }
+
+  private static SizeStatistics createSizeStatisticsForRepeatedField(


I'm dynamically generating SizeStatistics for each test case which does add a lot of LOC to the file--I could also just replace it with the computed SizeStatistics for each test case if that's simpler. I just wrote it this way originally because I wasn't that confident in my ability to translate the striping algorithm by hand for all these cases 😅

clairemcginty · 2024-12-05T20:09:59Z

...rc/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java

+    public CountingValueInspector(ValueInspector delegate, Function<Long, Boolean> shouldUpdateDelegate) {
+      this.observedValueCount = 0;
+      this.delegate = delegate;
+      this.shouldUpdateDelegate = shouldUpdateDelegate;


note: The "shouldUpdateDelegate" is needed since don't want to terminate prematurely with a false positive. For example if we're filtering on size(eq(3)) but the input array has 4 elements, we want to prevent the delegated Eq from returning true after it hits the third element because it thinks the condition is satisfied.

Perhaps this is worth a comment?

clairemcginty · 2024-12-05T20:10:52Z

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java

@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> conta
          indices -> IndexIterator.all(getPageCount()));
    }

+    @Override
+    public PrimitiveIterator.OfInt visit(Size size) {
+      return IndexIterator.all(getPageCount());


repetitionLevelHistogram and definitionLevelHistogram are both in scope here, should I repeat the logic from StatisticsFilter or is that completely redundant?

Repeating the same logic is necessary if you want to support page-level filtering.

cool, I'll implement it 👍 To check my understanding, the rep- and def-level histograms we have access to here are implemented as flat List<Long> and represent the levels for all pages concatenated together:

/** * Contains repetition level histograms for each page * concatenated together. The repetition_level_histogram field on * SizeStatistics contains more details. * * When present the length should always be (number of pages * * (max_repetition_level + 1)) elements. * * Element 0 is the first element of the histogram for the first page. * Element (max_repetition_level + 1) is the first element of the histogram * for the second page. **/

So I'll need to break up the flat lists into per-page histograms in order to perform per-page filtering here. But a comment in ColumnIndexBuilder indicates that we don't have access to maxRepetitionLevel here.

I guess if all histograms across all pages are the same size and we know that {rep,def}LevelHistogram.size() % pageCount != 0, I could just divide total histogram size by pageCount to get the size of each individual histogram?

clairemcginty · 2024-12-06T19:38:26Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java

+      final boolean blockCannotMatch = size.filter(
+          (eq) -> eq < numDistinctValues,
+          (lt) -> lt <= numDistinctValues,
+          (lte) -> lte < numDistinctValues,
+          (gt) -> false,
+          (gte) -> false);
+


actually now that I think about it, this isn't accurate, since we don't know the distribution of values. I guess we could combine it with SizeStatistics to get the number of elements and work out the minimum size from there.

I don't think we can trust numDistinctValues because a row group might contain only a subset of entries from the dictionary.

yeah... I guess we can't do much here. The most we can infer is that the row group has >= dictionary.size(). values spread out over an arbitrary # of elements. So all we could do is rule out size(0) predicates if dictionary.size() > 0.

wgtmac · 2024-12-13T15:14:23Z

Thanks for adding this! This is a large PR that I need to take some time to review.

It would be good if @emkornfield @gszadovszky could take a look to see if this is a good use case for SizeStatistics.

clairemcginty · 2024-12-16T18:30:47Z

Thanks for adding this! This is a large PR that I need to take some time to review.

thanks, no rush on reviewing it! 👍

emkornfield · 2024-12-20T20:00:31Z

I can try to look in more detail but stats can certainly be used here, I imagine they are most useful for repeated fieds when trying to discriminate between repeated fields that mostly have 0 or 1 element, and trying to filter out cases with > 0 or 1 elements. e.g. if all fields have 0 observed rep_levels of 1, then one knows for sure all lists are of length 0 or 1 (whether there are any lists of length 0 or one can be deteremined by inspecting the def level histogram). For larger cardinality lists the filtering power diminishes significanly (its hard to distinguish based on histograms the difference between many very small lists vs one very large one).

wgtmac

Thanks for the effort! I just took an initial pass on it and left a couple of questions.

wgtmac · 2025-01-05T14:08:34Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java

@@ -505,6 +505,82 @@ public <R> R filter(
    }
  }

+  public static final class Size implements FilterPredicate, Serializable {
+    public enum Operator {


Should we support notEqual for completeness, though not that useful?

I was thinking the same for the LogicalInverter above

wgtmac · 2025-01-05T14:13:40Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java

+      LTE,
+      GT,
+      GTE


Suggested change

LTE,

GT,

GTE

LE,

GT,

GE

IIRC, these are commonly used abbreviations?

I think every project uses a slightly different abbreviation: https://github.com/apache/iceberg/blob/c0bd4bfbceeaf3cb6e4ead675fcb47232361af3c/api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java#L42-L61 👯

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java

wgtmac · 2025-01-09T05:04:56Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

+      // the column isn't in this file, so fail eq/gt/gte targeting size > 0
+      final boolean blockCannotMatch =
+          size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0);
+      return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;


Shouldn't it always return BLOCK_CANNOT_MATCH? What is the value of size(null list)? null or 0?

If the behavior depends on SQL dialect, perhaps we should be conservative to return BLOCK_MIGHT_MATCH instead.

wgtmac · 2025-01-09T05:22:58Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

+      return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+    }
+
+    final SizeStatistics stats = metadata.getSizeStatistics();


Should we check whether the list type can be supported before proceeding? For example, perhaps we don't want to support legacy LIST-annotated 2-level structure and unannotated repeated field according to apache/parquet-format#466.

wgtmac · 2025-01-09T05:27:00Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java

+      final boolean blockCannotMatch = size.filter(
+          (eq) -> eq < numDistinctValues,
+          (lt) -> lt <= numDistinctValues,
+          (lte) -> lte < numDistinctValues,
+          (gt) -> false,
+          (gte) -> false);
+


I don't think we can trust numDistinctValues because a row group might contain only a subset of entries from the dictionary.

wgtmac · 2025-01-09T05:31:27Z

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java

@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> conta
          indices -> IndexIterator.all(getPageCount()));
    }

+    @Override
+    public PrimitiveIterator.OfInt visit(Size size) {
+      return IndexIterator.all(getPageCount());


Repeating the same logic is necessary if you want to support page-level filtering.

wgtmac · 2025-01-09T05:35:11Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

+    // If all values have repetition level 0, then no array has more than 1 element
+    if (repetitionLevelHistogram.size() == 1
+        || repetitionLevelHistogram.subList(1, repetitionLevelHistogram.size()).stream()
+            .allMatch(l -> l == 0)) {
+
+      // Null list fields are treated as having size 0
+      if (( // all lists are nulls
+          definitionLevelHistogram.subList(1, definitionLevelHistogram.size()).stream()
+              .allMatch(l -> l == 0))
+          || // all lists are size 0
+          (definitionLevelHistogram.get(0) == 0
+              && definitionLevelHistogram.subList(2, definitionLevelHistogram.size()).stream()
+                  .allMatch(l -> l == 0))) {
+
+        final boolean blockCannotMatch =
+            size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0);
+        return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+      }
+
+      long maxDefinitionLevel = definitionLevelHistogram.get(definitionLevelHistogram.size() - 1);
+
+      // If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1, all lists
+      // are of size 1
+      if (definitionLevelHistogram.stream().allMatch(l -> l > maxDefinitionLevel - 1)) {
+        final boolean blockCannotMatch = size.filter(
+            (eq) -> eq != 1, (lt) -> lt <= 1, (lte) -> lte < 1, (gt) -> gt >= 1, (gte) -> gte > 1);
+
+        return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+      }
+    }
+    long nonNullElementCount =
+        repetitionLevelHistogram.stream().mapToLong(l -> l).sum() - definitionLevelHistogram.get(0);
+    long numNonNullRecords = repetitionLevelHistogram.get(0) - definitionLevelHistogram.get(0);
+
+    // Given the total number of elements and non-null fields, we can compute the max size of any array field
+    long maxArrayElementCount = 1 + (nonNullElementCount - numNonNullRecords);
+    final boolean blockCannotMatch = size.filter(
+        (eq) -> eq > maxArrayElementCount,
+        (lt) -> false,
+        (lte) -> false,
+        (gt) -> gt >= maxArrayElementCount,
+        (gte) -> gte > maxArrayElementCount);
+
+    return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;


IIUC, these are true for only un-nested list type, right?

clairemcginty · 2025-01-16T20:44:30Z

Thanks for the effort! I just took an initial pass on it and left a couple of questions.

Thanks for the review!! I should have time to address everything early next week at the latest 👍

wgtmac · 2025-01-17T01:55:42Z

BTW, the level histogram might not be available when max_level is 0 because there is only single level (i.e. 0) and its count can be deduced from num_values of the column chunk or page. It will complicate the size filter here.

clairemcginty commented Dec 5, 2024

View reviewed changes

clairemcginty marked this pull request as ready for review December 6, 2024 18:06

clairemcginty commented Dec 6, 2024

View reviewed changes

wgtmac changed the title ~~PARQUET-34: implement Size() filter for repeated columns~~ GH-1452: implement Size() filter for repeated columns Jan 3, 2025

wgtmac reviewed Jan 9, 2025

View reviewed changes

clairemcginty added 11 commits January 17, 2025 14:42

PARQUET-34: implement Size() filter for repeated columns

58ab077

PARQUET-34: Fix FilterApi signature

b0e3526

PARQUET-34: Test multiple size() predicates on different columns

8c6a11a

PARQUET-34: Add ignore test for optional array field filter

ee219e5

PARQUET-34: Fix DictionaryFilter logic

d3323e2

PARQUET-34: fix StatisticsFilter test

c555fa2

PARQUET-34: Add comment for shouldUpdateDelegate

5d08d3d

PARQUET-34: Fix DictionaryFilter

685c918

PARQUET-34: Test DictionaryFilter for column with skipped dict encoding

495d504

PARQUET-34: Simplify size#toString and include col name

6a7207b

PARQUET-34: Use Int for Size predicate value

d2e8dc3

clairemcginty force-pushed the parquet-34-sizefilter branch from 956a740 to d2e8dc3 Compare January 17, 2025 19:56

PARQUET-34: Assert that size(lt 0) is invalid

2e8ba22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-1452: implement Size() filter for repeated columns #3098

GH-1452: implement Size() filter for repeated columns #3098

clairemcginty commented Dec 5, 2024

clairemcginty Dec 5, 2024

wgtmac Jan 9, 2025

clairemcginty Dec 5, 2024

clairemcginty Dec 5, 2024

wgtmac Jan 6, 2025

clairemcginty Dec 5, 2024

wgtmac Jan 9, 2025

clairemcginty Jan 17, 2025

clairemcginty Dec 6, 2024

wgtmac Jan 9, 2025

clairemcginty Jan 17, 2025 •

edited

Loading

wgtmac commented Dec 13, 2024

clairemcginty commented Dec 16, 2024

emkornfield commented Dec 20, 2024

wgtmac left a comment

wgtmac Jan 5, 2025

Fokko Jan 14, 2025

wgtmac Jan 5, 2025

Fokko Jan 14, 2025

wgtmac Jan 9, 2025

wgtmac Jan 9, 2025

wgtmac Jan 9, 2025

wgtmac Jan 9, 2025

wgtmac Jan 9, 2025

wgtmac Jan 9, 2025

clairemcginty commented Jan 16, 2025

wgtmac commented Jan 17, 2025

-      LTE,
-      GT,
-      GTE
+      LE,
+      GT,
+      GE

GH-1452: implement Size() filter for repeated columns #3098

Are you sure you want to change the base?

GH-1452: implement Size() filter for repeated columns #3098

Conversation

clairemcginty commented Dec 5, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

wgtmac commented Dec 13, 2024

clairemcginty commented Dec 16, 2024

emkornfield commented Dec 20, 2024

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty commented Jan 16, 2025

wgtmac commented Jan 17, 2025

clairemcginty Jan 17, 2025 •

edited

Loading