-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-1452: implement Size() filter for repeated columns #3098
base: master
Are you sure you want to change the base?
Conversation
// If all values have repetition level 0, then no array has more than 1 element | ||
if (repetitionLevelHistogram.size() == 1 | ||
|| repetitionLevelHistogram.subList(1, repetitionLevelHistogram.size()).stream() | ||
.allMatch(l -> l == 0)) { | ||
|
||
// Null list fields are treated as having size 0 | ||
if (( // all lists are nulls | ||
definitionLevelHistogram.subList(1, definitionLevelHistogram.size()).stream() | ||
.allMatch(l -> l == 0)) | ||
|| // all lists are size 0 | ||
(definitionLevelHistogram.get(0) == 0 | ||
&& definitionLevelHistogram.subList(2, definitionLevelHistogram.size()).stream() | ||
.allMatch(l -> l == 0))) { | ||
|
||
final boolean blockCannotMatch = | ||
size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0); | ||
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH; | ||
} | ||
|
||
long maxDefinitionLevel = definitionLevelHistogram.get(definitionLevelHistogram.size() - 1); | ||
|
||
// If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1, all lists | ||
// are of size 1 | ||
if (definitionLevelHistogram.stream().allMatch(l -> l > maxDefinitionLevel - 1)) { | ||
final boolean blockCannotMatch = size.filter( | ||
(eq) -> eq != 1, (lt) -> lt <= 1, (lte) -> lte < 1, (gt) -> gt >= 1, (gte) -> gte > 1); | ||
|
||
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH; | ||
} | ||
} | ||
long nonNullElementCount = | ||
repetitionLevelHistogram.stream().mapToLong(l -> l).sum() - definitionLevelHistogram.get(0); | ||
long numNonNullRecords = repetitionLevelHistogram.get(0) - definitionLevelHistogram.get(0); | ||
|
||
// Given the total number of elements and non-null fields, we can compute the max size of any array field | ||
long maxArrayElementCount = 1 + (nonNullElementCount - numNonNullRecords); | ||
final boolean blockCannotMatch = size.filter( | ||
(eq) -> eq > maxArrayElementCount, | ||
(lt) -> false, | ||
(lte) -> false, | ||
(gt) -> gt >= maxArrayElementCount, | ||
(gte) -> gte > maxArrayElementCount); | ||
|
||
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully this is a faithful transcription of the logic outlined here: #1452 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, these are true for only un-nested list type, right?
assertFalse(canDrop(size(nestedListColumn, Operators.Size.Operator.EQ, 0), columnMeta)); | ||
} | ||
|
||
private static SizeStatistics createSizeStatisticsForRepeatedField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm dynamically generating SizeStatistics for each test case which does add a lot of LOC to the file--I could also just replace it with the computed SizeStatistics
for each test case if that's simpler. I just wrote it this way originally because I wasn't that confident in my ability to translate the striping algorithm by hand for all these cases 😅
public CountingValueInspector(ValueInspector delegate, Function<Long, Boolean> shouldUpdateDelegate) { | ||
this.observedValueCount = 0; | ||
this.delegate = delegate; | ||
this.shouldUpdateDelegate = shouldUpdateDelegate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: The "shouldUpdateDelegate" is needed since don't want to terminate prematurely with a false positive. For example if we're filtering on size(eq(3))
but the input array has 4 elements, we want to prevent the delegated Eq
from returning true after it hits the third element because it thinks the condition is satisfied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this is worth a comment?
@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> conta | |||
indices -> IndexIterator.all(getPageCount())); | |||
} | |||
|
|||
@Override | |||
public PrimitiveIterator.OfInt visit(Size size) { | |||
return IndexIterator.all(getPageCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repetitionLevelHistogram
and definitionLevelHistogram
are both in scope here, should I repeat the logic from StatisticsFilter
or is that completely redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating the same logic is necessary if you want to support page-level filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, I'll implement it 👍 To check my understanding, the rep- and def-level histograms we have access to here are implemented as flat List<Long>
and represent the levels for all pages concatenated together:
/**
* Contains repetition level histograms for each page
* concatenated together. The repetition_level_histogram field on
* SizeStatistics contains more details.
*
* When present the length should always be (number of pages *
* (max_repetition_level + 1)) elements.
*
* Element 0 is the first element of the histogram for the first page.
* Element (max_repetition_level + 1) is the first element of the histogram
* for the second page.
**/
So I'll need to break up the flat lists into per-page histograms in order to perform per-page filtering here. But a comment in ColumnIndexBuilder indicates that we don't have access to maxRepetitionLevel here.
I guess if all histograms across all pages are the same size and we know that {rep,def}LevelHistogram.size() % pageCount != 0, I could just divide total histogram size by pageCount to get the size of each individual histogram?
final boolean blockCannotMatch = size.filter( | ||
(eq) -> eq < numDistinctValues, | ||
(lt) -> lt <= numDistinctValues, | ||
(lte) -> lte < numDistinctValues, | ||
(gt) -> false, | ||
(gte) -> false); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually now that I think about it, this isn't accurate, since we don't know the distribution of values. I guess we could combine it with SizeStatistics to get the number of elements and work out the minimum size from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can trust numDistinctValues
because a row group might contain only a subset of entries from the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... I guess we can't do much here. The most we can infer is that the row group has >= dictionary.size()
. values spread out over an arbitrary # of elements. So all we could do is rule out size(0)
predicates if dictionary.size() > 0
.
Thanks for adding this! This is a large PR that I need to take some time to review. It would be good if @emkornfield @gszadovszky could take a look to see if this is a good use case for SizeStatistics. |
thanks, no rush on reviewing it! 👍 |
I can try to look in more detail but stats can certainly be used here, I imagine they are most useful for repeated fieds when trying to discriminate between repeated fields that mostly have 0 or 1 element, and trying to filter out cases with > 0 or 1 elements. e.g. if all fields have 0 observed rep_levels of 1, then one knows for sure all lists are of length 0 or 1 (whether there are any lists of length 0 or one can be deteremined by inspecting the def level histogram). For larger cardinality lists the filtering power diminishes significanly (its hard to distinguish based on histograms the difference between many very small lists vs one very large one). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort! I just took an initial pass on it and left a couple of questions.
@@ -505,6 +505,82 @@ public <R> R filter( | |||
} | |||
} | |||
|
|||
public static final class Size implements FilterPredicate, Serializable { | |||
public enum Operator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support notEqual for completeness, though not that useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same for the LogicalInverter
above
LTE, | ||
GT, | ||
GTE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTE, | |
GT, | |
GTE | |
LE, | |
GT, | |
GE |
IIRC, these are commonly used abbreviations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think every project uses a slightly different abbreviation: https://github.com/apache/iceberg/blob/c0bd4bfbceeaf3cb6e4ead675fcb47232361af3c/api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java#L42-L61 👯
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
Outdated
Show resolved
Hide resolved
// the column isn't in this file, so fail eq/gt/gte targeting size > 0 | ||
final boolean blockCannotMatch = | ||
size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0); | ||
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it always return BLOCK_CANNOT_MATCH
? What is the value of size(null list)
? null or 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the behavior depends on SQL dialect, perhaps we should be conservative to return BLOCK_MIGHT_MATCH
instead.
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH; | ||
} | ||
|
||
final SizeStatistics stats = metadata.getSizeStatistics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check whether the list type can be supported before proceeding? For example, perhaps we don't want to support legacy LIST-annotated 2-level structure and unannotated repeated field according to apache/parquet-format#466.
final boolean blockCannotMatch = size.filter( | ||
(eq) -> eq < numDistinctValues, | ||
(lt) -> lt <= numDistinctValues, | ||
(lte) -> lte < numDistinctValues, | ||
(gt) -> false, | ||
(gte) -> false); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can trust numDistinctValues
because a row group might contain only a subset of entries from the dictionary.
@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> conta | |||
indices -> IndexIterator.all(getPageCount())); | |||
} | |||
|
|||
@Override | |||
public PrimitiveIterator.OfInt visit(Size size) { | |||
return IndexIterator.all(getPageCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating the same logic is necessary if you want to support page-level filtering.
// If all values have repetition level 0, then no array has more than 1 element | ||
if (repetitionLevelHistogram.size() == 1 | ||
|| repetitionLevelHistogram.subList(1, repetitionLevelHistogram.size()).stream() | ||
.allMatch(l -> l == 0)) { | ||
|
||
// Null list fields are treated as having size 0 | ||
if (( // all lists are nulls | ||
definitionLevelHistogram.subList(1, definitionLevelHistogram.size()).stream() | ||
.allMatch(l -> l == 0)) | ||
|| // all lists are size 0 | ||
(definitionLevelHistogram.get(0) == 0 | ||
&& definitionLevelHistogram.subList(2, definitionLevelHistogram.size()).stream() | ||
.allMatch(l -> l == 0))) { | ||
|
||
final boolean blockCannotMatch = | ||
size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0); | ||
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH; | ||
} | ||
|
||
long maxDefinitionLevel = definitionLevelHistogram.get(definitionLevelHistogram.size() - 1); | ||
|
||
// If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1, all lists | ||
// are of size 1 | ||
if (definitionLevelHistogram.stream().allMatch(l -> l > maxDefinitionLevel - 1)) { | ||
final boolean blockCannotMatch = size.filter( | ||
(eq) -> eq != 1, (lt) -> lt <= 1, (lte) -> lte < 1, (gt) -> gt >= 1, (gte) -> gte > 1); | ||
|
||
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH; | ||
} | ||
} | ||
long nonNullElementCount = | ||
repetitionLevelHistogram.stream().mapToLong(l -> l).sum() - definitionLevelHistogram.get(0); | ||
long numNonNullRecords = repetitionLevelHistogram.get(0) - definitionLevelHistogram.get(0); | ||
|
||
// Given the total number of elements and non-null fields, we can compute the max size of any array field | ||
long maxArrayElementCount = 1 + (nonNullElementCount - numNonNullRecords); | ||
final boolean blockCannotMatch = size.filter( | ||
(eq) -> eq > maxArrayElementCount, | ||
(lt) -> false, | ||
(lte) -> false, | ||
(gt) -> gt >= maxArrayElementCount, | ||
(gte) -> gte > maxArrayElementCount); | ||
|
||
return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, these are true for only un-nested list type, right?
Thanks for the review!! I should have time to address everything early next week at the latest 👍 |
BTW, the level histogram might not be available when max_level is 0 because there is only single level (i.e. 0) and its count can be deduced from |
956a740
to
d2e8dc3
Compare
Rationale for this change
this PR continues the work outlined in #1452. It implements a
size()
predicate for filtering on # of elements in repeated fields:What changes are included in this PR?
Size()
andnot(size())
implemented for all list fields withrequired
element type. Attempting to filter on a list of optional elements will throw an exception in the schema validator. This is because the existing record-level filtering setup (IncrementallyUpdatedFilterPredicateEvaluator
) only feeds in non-null values to theValueInspectors
. thus if you had an array [1,2, null, 4] it would only count 3 elements. I can file a ticket to support this eventually but I think we'd have to rework the FilteringRecordMaterializer to be aware of repetition/definition levels.The list group itself can be
optional
orrequired
. Null lists are treated as having size 0. Again, this is due to difficulty disambiguating them at the record-level filtering step. (Would love feedback on both these design decisions!!)Are these changes tested?
Unit tests + tested a snapshot build locally with real datasets
Are there any user-facing changes?
New Operators API
Part of #1452