-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce duplication in taxonomy facets; always do counts #12966
Conversation
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Net/net this looks like a great change to me -- removing tons of code dup, at a possible small perf hit due to added boxing/unboxing while collecting top N. I think the tradeoff is worth it, and we can watch the nightly benchy to see if facet performance was unduly impacted?
@@ -202,7 +202,7 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I | |||
} | |||
reuse = q.insertWithOverflow(reuse); | |||
if (q.size() == topN) { | |||
bottomCount = q.top().value; | |||
bottomCount = (int) q.top().value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why is this cast necessary? Oh -- I see, this value
is now a Number
. Hence the warning about added boxing/unboxing in hotspots here... thanks.
Maybe defer this to a separate issue? I can see callers expecting a consistent type, though, if you cast |
I found a fun HeisenBug in one of the tests. When we iterate cursors from For those who want to follow along, here are the exact numbers we are adding in the test in two orderings which produce different results:
|
I've also run the benchmarks ( The regression in the taxo task is explained in the profiler. Boxing is not cheap: @mikecan (thank you for the review!) - how should I interpret the other tasks which show a significant change? Are they just noisy?
|
Oh the joys of floating point math.
Thank you for diving deep here and making such a simple reproduction.
Good question -- it makes no sense that e.g. |
Hmm this is sort of spooky -- should we aim to keep the specialization somehow (avoid the boxing)? Is there a middle ground where we can avoid the boxing but still remove much of / some of this duplicated code? Java is annoying sometimes :) |
What I've done is I've only taken advantage of the boxing for genericity when collecting results |
/** Intermediate result to store top children for a given path before resolving labels, etc. */ | ||
record TopChildrenForPath(Number pathValue, int childCount, TopOrdAndNumberQueue childQueue) {} | ||
|
||
private static class DimValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] should we call this just Dim
and String dimPath
instead of String dim
? I see later that we've used int dimValue
and this is getting quickly overloaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we called it dim
and not dimPath
because it's just one label in the path, just the dimension, so it doesn't feel right to call it a path.
|
||
/** Get the aggregation value for this ordinal. */ | ||
protected Number getAggregationValue(int ordinal) { | ||
// By default, this is just the count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the default implementation of this method and getValue
should be same as that in IntTaxonomyFacets
and FloatTaxonomyFacets
to reduce duplication further? FastTaxonomyFacets
can either extend from IntTaxonomyFacets
or do this sort of a count based customisation to these methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point, but I think it's better for the default behaviour to be getting counts. We need the getAggregationValue
level of abstraction to be able to call getValue
with different signatures for IntTaxonomyFacets
and FloatTaxonomyFacets
.
* the aggregation values, keeping aggregation efficient. | ||
*/ | ||
protected void updateValueFromRollup(int ordinal, int childOrdinal) throws IOException { | ||
setCount(ordinal, getCount(ordinal) + rollup(childOrdinal)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we assume an aggregationFunction is passed in this parent class and implement this method similar to IntTaxonomyFacets
and FloatTaxonomyFacets
since this bit seems to be duplicated in both?
Further, FastTaxonomyFacetCounts
can either override this and do a count based updateValuefromRollup
since it doesn't use an aggregation function or even continue to extend from IntTaxonomyFacets
.
@@ -67,6 +91,17 @@ public int compare(FacetResult a, FacetResult b) { | |||
/** Maps an ordinal to its parent, or -1 if there is no parent (root node). */ | |||
final int[] parents; | |||
|
|||
/** Dense ordinal counts. */ | |||
int[] counts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this Number[] values
so that IntTaxonomyFacets
and FloatTaxonomyFacets
don't need to define their own values
data structure and this class is generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important that IntTaxonomyFacets
and FloatTaxonomyFacets
have their own data structures for efficiency. This array here only keep counts and not other aggregations.
/** Apply an aggregation to the two values and return the result. */ | ||
protected Number aggregate(Number existingVal, Number newVal) { | ||
// By default, we are computing counts, so the values are interpreted as integers and summed. | ||
return (int) existingVal + (int) newVal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the concept of an aggregation function while combining in this method. (In line with my previous comment about making the logic for IntTaxonomyFacets
and FloatTaxonomyFacets
the default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky bit. You'll see that when we override, we do use an aggregation function, but the default implementation is to count.
float currentValue = getValue(ord); | ||
float newValue = aggregationFunction.aggregate(currentValue, value); | ||
setValue(ord, newValue); | ||
setCount(ord, getCount(ord) + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to always track counts too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return new FacetResult(dim, path, aggregatedValue, labelValues, ordinals.size()); | ||
} | ||
|
||
private TopOrdAndNumberQueue.OrdAndValue insertIntoQueue( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! This bit was often duplicated. Can we make this a utility method or maybe even a method like insert*
method on the Queue so StringValueFacetCounts
and AbstractSortedSetDocValue
can use it too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added to #13175, where we can target improvements related to the way we access these queues.
* Determine the top-n children for a specified dimension + path. Results are in an intermediate | ||
* form. | ||
*/ | ||
protected TopChildrenForPath getTopChildrenForPath(DimConfig dimConfig, int pathOrd, int topN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an abstract signature for this method to the Facets
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid making API changes in this PR. It's an interesting question whether all Facets
should have this.
bottomCount = (int) q.top().value; | ||
bottomOrd = (int) q.top().value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can remove these bottomX
optimizations here and in other places, I think insertWithOverflow
essentially does the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, opened #13175.
public abstract class TopOrdAndNumberQueue extends PriorityQueue<TopOrdAndNumberQueue.OrdAndValue> { | ||
|
||
/** Holds a single entry. */ | ||
public static final class OrdAndValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making this class final and lessThan abstract, maybe we should make this class abstract, with abstract compare
method which we implement separately for floats/ints/multi-aggregations? This way we can use primitive types in OrdAndValue implementations and hopefully reduce some boxing costs?
|
||
LabelAndValue[] labelValues = new LabelAndValue[q.size()]; | ||
int[] ordinals = new int[labelValues.length]; | ||
Number[] values = new Number[labelValues.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe using Number
here and in LabelAndValue is one of the things that limits our ability to add new types of facet results, for example, multi-aggregate facets that you've mentioned. I'd suggest that we use generic <T>
(which may need to implement Comparable?) in these classes, and use Integer, Float, etc in TaxonomyFacets implementation. This would require API changes though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can consider this separately? I hope we can avoid all API changes in this PR.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Thank you all for reviewing! I confirmed that the performance impact was from result collection, not from the aggregations themselves, and I've managed to claw back the performance hit. Most of the improvement comes from the changes to
|
@gsmiller - I know you may not have time to review, but I want to at least notify you, since this is a big change and you've been very invovled in this area of the code. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Hi reviewers! This PR has become stale. Could anyone have a look at it? It has several nice improvements for taxonomy facets, with no API changes, and it sets us up to launch new features in a future release: multiple aggregations in one go and retrieving counts with aggregation facets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- thank you for clawing back that performance loss by adding a bit of non-generics specialization back. I like this compromise.
I left a minor comment, not a blocker for merging.
@stefanvodita I think you should merge this in a day or two if there's no more feedback? Lazy consensus ...
@Override | ||
public boolean lessThan(OrdAndValue other) { | ||
OrdAndInt otherOrdAndInt = (OrdAndInt) other; | ||
if (value < otherOrdAndInt.value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might use Integer.compare
here -- not sure if it's actually faster. You'd still need to get the result and check if it's != 0
for the tiebreak (which could also be Integer.compare
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how Integer.compare
is implemented:
public static int compare(int x, int y) {
return (x < y) ? -1 : ((x == y) ? 0 : 1);
}
And lessThan
would become:
public boolean lessThan(OrdAndValue other) {
OrdAndInt otherOrdAndInt = (OrdAndInt) other;
int cmp = Integer.compare(value, otherOrdAndInt.value);
if (cmp == 0) {
cmp = Integer.compare(otherOrdAndInt.value, ord);
}
return cmp < 0;
}
I think we end up doing more comparisons overall? I might be missing something though.
e18ff28
to
f14bcba
Compare
Thank you for reviewing @mikemccand! I had to rebase after #12966. I'll push tomorrow maybe if there are no objections. |
I did another benchmark run after the rebase just to make sure I haven't broken anything when integrating the split taxo arrays change. I see no significant changes.
|
I'm finding this difficult to port to 9x because of the way the classes have diverged and I'm not sure it's worthwhile, since a lot of the benefits here are for future development and to support API changes that would go in Lucene 10. I'll move the CHANGES entries and milestones to Lucene 10 unless anyone thinks it's worth backporting. |
Now that #12408 was backported in #13300 can we now backport this to 9.x? Or was it already done in an un-linked PR or so? Remembering to backport is proving challenging and error-proned (it always has been), not just in all of us consistently agreeing on the criteria for backport (we should always aim to backport unless it breaks non-experimental/internal public APIs?), but also in actually remembering to do it after a PR is merged to main. I wish GH provided some stronger mechanisms for us here ... |
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
I was just working on it today actually and finally got it in shape: #13358. Sorry it took so long! |
#12966 (#13358) Reduce duplication in taxonomy facets; always do counts (#12966) This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
I was skeptical this would work out at first, but I think we have a successful backport in the end, so the changes will go out with 9.11. |
Note
This is a large change, refactoring most of the taxonomy facets code and changing internal behavior, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.
What does this PR do well?
TaxonomyFacets
implementations toTaxonomyFacets
itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in [DISCUSS] Identifying Gaps in Lucene’s Faceting #12553.FloatTaxonomyFacets
, which previously used dense values always. This issue is part of Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels #12576.count > 0
), while previously we relied on the aggregation value to be positive. Closes Is it correct for facets to assume positive aggregation values? #12585.What's not ideal about this approach?
We could see some performance decreases. The more critical part of the work, aggregating, should be unaffected. There are a few extra method calls / dispatches / branches. Ranking and collecting results might be impacted because we are boxing / unboxing results to / fromNumber
to avoid the primitive types.The way theTopOrdAndNumberQueue
s work is a bit awkward and inefficient. It required small changes to classes outside the scope of this change. Maybe we can come up with something better.What is next?
Is it important to preserve a default aggregation value of the right type in the results (i.e.-1
for int aggregations,-1f
for float aggregations)? If not, we can make a small simplification to always return-1
.