Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-1452: implement Size() filter for repeated columns #3098

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

clairemcginty
Copy link
Contributor

Rationale for this change

this PR continues the work outlined in #1452. It implements a size() predicate for filtering on # of elements in repeated fields:

FilterPredicate hasThreeElements = size(intColumn("my_list_field"), Operators.Size.Operator.EQ, 3)

What changes are included in this PR?

Size() and not(size()) implemented for all list fields with required element type. Attempting to filter on a list of optional elements will throw an exception in the schema validator. This is because the existing record-level filtering setup (IncrementallyUpdatedFilterPredicateEvaluator) only feeds in non-null values to the ValueInspectors. thus if you had an array [1,2, null, 4] it would only count 3 elements. I can file a ticket to support this eventually but I think we'd have to rework the FilteringRecordMaterializer to be aware of repetition/definition levels.

The list group itself can be optional or required. Null lists are treated as having size 0. Again, this is due to difficulty disambiguating them at the record-level filtering step. (Would love feedback on both these design decisions!!)

Are these changes tested?

Unit tests + tested a snapshot build locally with real datasets

Are there any user-facing changes?

New Operators API

Part of #1452

Comment on lines +240 to +283
// If all values have repetition level 0, then no array has more than 1 element
if (repetitionLevelHistogram.size() == 1
|| repetitionLevelHistogram.subList(1, repetitionLevelHistogram.size()).stream()
.allMatch(l -> l == 0)) {

// Null list fields are treated as having size 0
if (( // all lists are nulls
definitionLevelHistogram.subList(1, definitionLevelHistogram.size()).stream()
.allMatch(l -> l == 0))
|| // all lists are size 0
(definitionLevelHistogram.get(0) == 0
&& definitionLevelHistogram.subList(2, definitionLevelHistogram.size()).stream()
.allMatch(l -> l == 0))) {

final boolean blockCannotMatch =
size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0);
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
}

long maxDefinitionLevel = definitionLevelHistogram.get(definitionLevelHistogram.size() - 1);

// If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1, all lists
// are of size 1
if (definitionLevelHistogram.stream().allMatch(l -> l > maxDefinitionLevel - 1)) {
final boolean blockCannotMatch = size.filter(
(eq) -> eq != 1, (lt) -> lt <= 1, (lte) -> lte < 1, (gt) -> gt >= 1, (gte) -> gte > 1);

return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
}
}
long nonNullElementCount =
repetitionLevelHistogram.stream().mapToLong(l -> l).sum() - definitionLevelHistogram.get(0);
long numNonNullRecords = repetitionLevelHistogram.get(0) - definitionLevelHistogram.get(0);

// Given the total number of elements and non-null fields, we can compute the max size of any array field
long maxArrayElementCount = 1 + (nonNullElementCount - numNonNullRecords);
final boolean blockCannotMatch = size.filter(
(eq) -> eq > maxArrayElementCount,
(lt) -> false,
(lte) -> false,
(gt) -> gt >= maxArrayElementCount,
(gte) -> gte > maxArrayElementCount);

return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this is a faithful transcription of the logic outlined here: #1452 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, these are true for only un-nested list type, right?

assertFalse(canDrop(size(nestedListColumn, Operators.Size.Operator.EQ, 0), columnMeta));
}

private static SizeStatistics createSizeStatisticsForRepeatedField(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm dynamically generating SizeStatistics for each test case which does add a lot of LOC to the file--I could also just replace it with the computed SizeStatistics for each test case if that's simpler. I just wrote it this way originally because I wasn't that confident in my ability to translate the striping algorithm by hand for all these cases 😅

public CountingValueInspector(ValueInspector delegate, Function<Long, Boolean> shouldUpdateDelegate) {
this.observedValueCount = 0;
this.delegate = delegate;
this.shouldUpdateDelegate = shouldUpdateDelegate;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: The "shouldUpdateDelegate" is needed since don't want to terminate prematurely with a false positive. For example if we're filtering on size(eq(3)) but the input array has 4 elements, we want to prevent the delegated Eq from returning true after it hits the third element because it thinks the condition is satisfied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is worth a comment?

@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> conta
indices -> IndexIterator.all(getPageCount()));
}

@Override
public PrimitiveIterator.OfInt visit(Size size) {
return IndexIterator.all(getPageCount());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repetitionLevelHistogram and definitionLevelHistogram are both in scope here, should I repeat the logic from StatisticsFilter or is that completely redundant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating the same logic is necessary if you want to support page-level filtering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I'll implement it 👍 To check my understanding, the rep- and def-level histograms we have access to here are implemented as flat List<Long> and represent the levels for all pages concatenated together:

  /**
   * Contains repetition level histograms for each page
   * concatenated together.  The repetition_level_histogram field on
   * SizeStatistics contains more details.
   *
   * When present the length should always be (number of pages *
   * (max_repetition_level + 1)) elements.
   *
   * Element 0 is the first element of the histogram for the first page.
   * Element (max_repetition_level + 1) is the first element of the histogram
   * for the second page.
   **/

So I'll need to break up the flat lists into per-page histograms in order to perform per-page filtering here. But a comment in ColumnIndexBuilder indicates that we don't have access to maxRepetitionLevel here.

I guess if all histograms across all pages are the same size and we know that {rep,def}LevelHistogram.size() % pageCount != 0, I could just divide total histogram size by pageCount to get the size of each individual histogram?

@clairemcginty clairemcginty marked this pull request as ready for review December 6, 2024 18:06
Comment on lines 515 to 521
final boolean blockCannotMatch = size.filter(
(eq) -> eq < numDistinctValues,
(lt) -> lt <= numDistinctValues,
(lte) -> lte < numDistinctValues,
(gt) -> false,
(gte) -> false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually now that I think about it, this isn't accurate, since we don't know the distribution of values. I guess we could combine it with SizeStatistics to get the number of elements and work out the minimum size from there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can trust numDistinctValues because a row group might contain only a subset of entries from the dictionary.

Copy link
Contributor Author

@clairemcginty clairemcginty Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... I guess we can't do much here. The most we can infer is that the row group has >= dictionary.size(). values spread out over an arbitrary # of elements. So all we could do is rule out size(0) predicates if dictionary.size() > 0.

@wgtmac
Copy link
Member

wgtmac commented Dec 13, 2024

Thanks for adding this! This is a large PR that I need to take some time to review.

It would be good if @emkornfield @gszadovszky could take a look to see if this is a good use case for SizeStatistics.

@clairemcginty
Copy link
Contributor Author

Thanks for adding this! This is a large PR that I need to take some time to review.

thanks, no rush on reviewing it! 👍

@emkornfield
Copy link
Contributor

I can try to look in more detail but stats can certainly be used here, I imagine they are most useful for repeated fieds when trying to discriminate between repeated fields that mostly have 0 or 1 element, and trying to filter out cases with > 0 or 1 elements. e.g. if all fields have 0 observed rep_levels of 1, then one knows for sure all lists are of length 0 or 1 (whether there are any lists of length 0 or one can be deteremined by inspecting the def level histogram). For larger cardinality lists the filtering power diminishes significanly (its hard to distinguish based on histograms the difference between many very small lists vs one very large one).

@wgtmac wgtmac changed the title PARQUET-34: implement Size() filter for repeated columns GH-1452: implement Size() filter for repeated columns Jan 3, 2025
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort! I just took an initial pass on it and left a couple of questions.

@@ -505,6 +505,82 @@ public <R> R filter(
}
}

public static final class Size implements FilterPredicate, Serializable {
public enum Operator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support notEqual for completeness, though not that useful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same for the LogicalInverter above

Comment on lines +512 to +514
LTE,
GT,
GTE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LTE,
GT,
GTE
LE,
GT,
GE

IIRC, these are commonly used abbreviations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// the column isn't in this file, so fail eq/gt/gte targeting size > 0
final boolean blockCannotMatch =
size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0);
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it always return BLOCK_CANNOT_MATCH? What is the value of size(null list)? null or 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the behavior depends on SQL dialect, perhaps we should be conservative to return BLOCK_MIGHT_MATCH instead.

return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
}

final SizeStatistics stats = metadata.getSizeStatistics();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check whether the list type can be supported before proceeding? For example, perhaps we don't want to support legacy LIST-annotated 2-level structure and unannotated repeated field according to apache/parquet-format#466.

Comment on lines 515 to 521
final boolean blockCannotMatch = size.filter(
(eq) -> eq < numDistinctValues,
(lt) -> lt <= numDistinctValues,
(lte) -> lte < numDistinctValues,
(gt) -> false,
(gte) -> false);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can trust numDistinctValues because a row group might contain only a subset of entries from the dictionary.

@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> conta
indices -> IndexIterator.all(getPageCount()));
}

@Override
public PrimitiveIterator.OfInt visit(Size size) {
return IndexIterator.all(getPageCount());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating the same logic is necessary if you want to support page-level filtering.

Comment on lines +240 to +283
// If all values have repetition level 0, then no array has more than 1 element
if (repetitionLevelHistogram.size() == 1
|| repetitionLevelHistogram.subList(1, repetitionLevelHistogram.size()).stream()
.allMatch(l -> l == 0)) {

// Null list fields are treated as having size 0
if (( // all lists are nulls
definitionLevelHistogram.subList(1, definitionLevelHistogram.size()).stream()
.allMatch(l -> l == 0))
|| // all lists are size 0
(definitionLevelHistogram.get(0) == 0
&& definitionLevelHistogram.subList(2, definitionLevelHistogram.size()).stream()
.allMatch(l -> l == 0))) {

final boolean blockCannotMatch =
size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0);
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
}

long maxDefinitionLevel = definitionLevelHistogram.get(definitionLevelHistogram.size() - 1);

// If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1, all lists
// are of size 1
if (definitionLevelHistogram.stream().allMatch(l -> l > maxDefinitionLevel - 1)) {
final boolean blockCannotMatch = size.filter(
(eq) -> eq != 1, (lt) -> lt <= 1, (lte) -> lte < 1, (gt) -> gt >= 1, (gte) -> gte > 1);

return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
}
}
long nonNullElementCount =
repetitionLevelHistogram.stream().mapToLong(l -> l).sum() - definitionLevelHistogram.get(0);
long numNonNullRecords = repetitionLevelHistogram.get(0) - definitionLevelHistogram.get(0);

// Given the total number of elements and non-null fields, we can compute the max size of any array field
long maxArrayElementCount = 1 + (nonNullElementCount - numNonNullRecords);
final boolean blockCannotMatch = size.filter(
(eq) -> eq > maxArrayElementCount,
(lt) -> false,
(lte) -> false,
(gt) -> gt >= maxArrayElementCount,
(gte) -> gte > maxArrayElementCount);

return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, these are true for only un-nested list type, right?

@clairemcginty
Copy link
Contributor Author

Thanks for the effort! I just took an initial pass on it and left a couple of questions.

Thanks for the review!! I should have time to address everything early next week at the latest 👍

@wgtmac
Copy link
Member

wgtmac commented Jan 17, 2025

BTW, the level histogram might not be available when max_level is 0 because there is only single level (i.e. 0) and its count can be deduced from num_values of the column chunk or page. It will complicate the size filter here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants