Reduce duplication in taxonomy facets; always do counts #12966

stefanvodita · 2023-12-22T10:14:01Z

Note

This is a large change, refactoring most of the taxonomy facets code and changing internal behavior, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

What does this PR do well?

Moves most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in [DISCUSS] Identifying Gaps in Lucene’s Faceting #12553.
As a consequence, it introduces sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels #12576.
It computes counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses Support getting counts from "association" facets [LUCENE-10246] #11282.
As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes Is it correct for facets to assume positive aggregation values? #12585.
It introduces the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in Compute multiple aggregations in one iteration of the match-set #12546.
It doesn't change the API. The only change in behavior users should notice is the fix for non-positive aggregation values, which were previously discarded.
It adds tests which were missing for sparse/dense values and non-positive aggregations.

What's not ideal about this approach?

We could see some performance decreases. The more critical part of the work, aggregating, should be unaffected. There are a few extra method calls / dispatches / branches. Ranking and collecting results might be impacted because we are boxing / unboxing results to / from Number to avoid the primitive types.
~~The way the TopOrdAndNumberQueues work is a bit awkward and inefficient. It required small changes to classes outside the scope of this change. Maybe we can come up with something better.~~

What is next?

I'd like to know if the approach makes sense to others.
We can try running some benchmarks to see if there are any performance changes.
Is it important to preserve a default aggregation value of the right type in the results (i.e. -1 for int aggregations, -1f for float aggregations)? If not, we can make a small simplification to always return -1.

github-actions · 2024-01-08T12:22:00Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

mikemccand

Net/net this looks like a great change to me -- removing tons of code dup, at a possible small perf hit due to added boxing/unboxing while collecting top N. I think the tradeoff is worth it, and we can watch the nightly benchy to see if facet performance was unduly impacted?

mikemccand · 2024-01-08T12:35:26Z

lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java

@@ -202,7 +202,7 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
          }
          reuse = q.insertWithOverflow(reuse);
          if (q.size() == topN) {
-            bottomCount = q.top().value;
+            bottomCount = (int) q.top().value;


Hmm why is this cast necessary? Oh -- I see, this value is now a Number. Hence the warning about added boxing/unboxing in hotspots here... thanks.

mikemccand · 2024-01-08T12:42:43Z

3. Is it important to preserve a default aggregation value of the right type in the results (i.e. -1 for int aggregations, -1f for float aggregations)? If not, we can make a small simplification to always return -1.

Maybe defer this to a separate issue? I can see callers expecting a consistent type, though, if you cast (float) Number where Number is an int, the cast would be fine.

stefanvodita · 2024-01-13T11:13:59Z

I found a fun HeisenBug in one of the tests. When we iterate cursors from IntFloatHashMap, the order is not deterministic. Float summation is not commutative, so the result we get by aggregating the floats in the map can be different depending on the order in which we perform the iteration. For a particular seed, running the test was producing an ordering that was not favorable, while running the debugger produced an ordering that was. The test is fixed in the latest commit and I've opened an issue to do Kahan summation over the floats instead, to reduce the error we're seeing.

For those who want to follow along, here are the exact numbers we are adding in the test in two orderings which produce different results:

class FloatSunIsNotCommutative {
    public static void main(String[] args) {
        float x = 177182.61f;
        float y = 238089.27f;
        float z = 255214.66f;
        float acc;
        
        acc = 0;
        acc += x;
        acc += y;
        acc += z;
        System.out.println(acc);
        
        acc = 0;
        acc += z;
        acc += y;
        acc += x;
        System.out.println(acc);
    }
}

stefanvodita · 2024-01-13T11:14:16Z

I've also run the benchmarks (python3 src/python/localrun.py -source wikimediumall). There is measurable regression in the BrowseRandomLabelTaxoFacets task, but not in other taxonomy tasks. The benchmarker also reports improvements in PKLookup, Wildcard, Respell, Fuzzy2, Fuzzy1.

The regression in the taxo task is explained in the profiler. Boxing is not cheap:
11.24% 10402M java.lang.Integer#valueOf()

@mikecan (thank you for the review!) - how should I interpret the other tasks which show a significant change? Are they just noisy?

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
     BrowseRandomLabelTaxoFacets        3.75      (1.8%)        3.53      (1.6%)   -6.0% (  -9% -   -2%) 0.000
          OrHighMedDayTaxoFacets        1.35      (7.4%)        1.31      (9.2%)   -2.7% ( -17% -   15%) 0.308
                          IntNRQ       21.64      (7.0%)       21.35      (7.4%)   -1.3% ( -14% -   14%) 0.561
                      AndHighLow      366.49     (11.2%)      362.21     (10.3%)   -1.2% ( -20% -   22%) 0.731
                    OrHighNotLow      271.40      (5.3%)      269.03      (4.5%)   -0.9% ( -10% -    9%) 0.573
                         LowTerm      604.77      (5.9%)      599.96      (4.8%)   -0.8% ( -10% -   10%) 0.640
                      TermDTSort      140.65      (2.3%)      139.58      (1.4%)   -0.8% (  -4% -    3%) 0.210
                     LowSpanNear        5.00      (2.8%)        4.96      (4.1%)   -0.7% (  -7% -    6%) 0.522
                    HighSpanNear        4.77      (3.0%)        4.74      (3.6%)   -0.7% (  -7% -    6%) 0.522
                     MedSpanNear       11.24      (2.1%)       11.18      (2.5%)   -0.6% (  -5% -    4%) 0.432
                       MedPhrase      242.61      (2.2%)      241.23      (2.0%)   -0.6% (  -4% -    3%) 0.386
                      HighPhrase       83.17      (2.1%)       82.75      (2.9%)   -0.5% (  -5% -    4%) 0.538
                   OrHighNotHigh      160.48      (4.5%)      159.81      (3.5%)   -0.4% (  -8% -    7%) 0.744
           HighTermDayOfYearSort      215.60      (2.2%)      214.81      (2.0%)   -0.4% (  -4% -    3%) 0.576
                 MedSloppyPhrase       14.07      (2.0%)       14.03      (2.4%)   -0.3% (  -4% -    4%) 0.655
                       LowPhrase       21.15      (1.3%)       21.09      (1.5%)   -0.3% (  -3% -    2%) 0.508
        AndHighHighDayTaxoFacets       10.49      (1.2%)       10.46      (1.6%)   -0.3% (  -3% -    2%) 0.547
                HighSloppyPhrase       13.80      (3.0%)       13.77      (3.1%)   -0.3% (  -6% -    5%) 0.791
                         MedTerm      479.88      (5.1%)      478.82      (4.8%)   -0.2% (  -9% -   10%) 0.887
                    OrHighNotMed      329.08      (4.5%)      328.39      (3.5%)   -0.2% (  -7% -    8%) 0.870
                        HighTerm      264.78      (5.3%)      264.27      (5.2%)   -0.2% ( -10% -   10%) 0.908
               HighTermMonthSort     1930.74      (4.4%)     1928.03      (5.2%)   -0.1% (  -9% -    9%) 0.926
                    OrNotHighMed      217.72      (2.9%)      217.51      (2.2%)   -0.1% (  -5% -    5%) 0.905
            MedTermDayTaxoFacets       16.72      (2.1%)       16.71      (1.7%)   -0.1% (  -3% -    3%) 0.892
       BrowseDayOfYearSSDVFacets        4.12      (2.7%)        4.11      (2.9%)   -0.1% (  -5% -    5%) 0.931
            BrowseDateTaxoFacets        4.68      (5.1%)        4.67      (4.6%)   -0.1% (  -9% -   10%) 0.970
                   OrNotHighHigh      231.09      (4.5%)      230.99      (3.5%)   -0.0% (  -7% -    8%) 0.975
         AndHighMedDayTaxoFacets       16.88      (1.1%)       16.88      (1.5%)   -0.0% (  -2% -    2%) 0.963
       BrowseDayOfYearTaxoFacets        4.76      (5.2%)        4.76      (4.6%)    0.0% (  -9% -   10%) 1.000
                    OrNotHighLow      464.54      (2.6%)      464.56      (2.3%)    0.0% (  -4% -    5%) 0.995
            HighIntervalsOrdered        1.81      (4.6%)        1.81      (5.0%)    0.0% (  -9% -   10%) 0.990
            HighTermTitleBDVSort        5.39      (4.8%)        5.40      (4.4%)    0.1% (  -8% -    9%) 0.968
           BrowseMonthSSDVFacets        4.40      (2.6%)        4.40      (2.6%)    0.1% (  -4% -    5%) 0.873
             MedIntervalsOrdered        1.84      (5.5%)        1.84      (5.8%)    0.2% ( -10% -   12%) 0.918
             LowIntervalsOrdered       32.12      (5.4%)       32.18      (5.6%)    0.2% ( -10% -   11%) 0.913
                       OrHighMed       67.77      (3.1%)       67.97      (3.4%)    0.3% (  -5% -    6%) 0.779
     BrowseRandomLabelSSDVFacets        2.89      (2.0%)        2.90      (1.4%)    0.3% (  -3% -    3%) 0.569
           BrowseMonthTaxoFacets        9.36     (10.9%)        9.40     (10.4%)    0.4% ( -18% -   24%) 0.896
               HighTermTitleSort      132.89      (1.9%)      133.56      (3.9%)    0.5% (  -5% -    6%) 0.600
                      OrHighHigh       20.24      (3.5%)       20.37      (3.9%)    0.6% (  -6% -    8%) 0.608
                      AndHighMed       81.65      (8.6%)       82.65      (9.8%)    1.2% ( -15% -   21%) 0.676
                 LowSloppyPhrase        4.92      (5.9%)        5.01      (6.4%)    1.6% ( -10% -   14%) 0.397
            BrowseDateSSDVFacets        1.20     (11.5%)        1.22      (9.1%)    2.1% ( -16% -   25%) 0.529
                         Prefix3      138.46      (4.9%)      141.54      (4.5%)    2.2% (  -6% -   12%) 0.138
                       OrHighLow      167.60      (7.5%)      171.65      (4.2%)    2.4% (  -8% -   15%) 0.211
                        PKLookup      169.39      (4.5%)      174.22      (4.5%)    2.9% (  -5% -   12%) 0.043
                     AndHighHigh       31.23      (9.5%)       32.15     (12.4%)    2.9% ( -17% -   27%) 0.399
                        Wildcard       66.79      (3.4%)       69.28      (3.6%)    3.7% (  -3% -   11%) 0.001
                         Respell       48.03      (2.0%)       50.35      (2.3%)    4.8% (   0% -    9%) 0.000
                          Fuzzy2       68.13      (1.3%)       71.67      (1.4%)    5.2% (   2% -    7%) 0.000
                          Fuzzy1       74.70      (1.5%)       79.47      (1.8%)    6.4% (   3% -    9%) 0.000

mikemccand · 2024-01-13T13:15:19Z

I found a fun HeisenBug in one of the tests.

Oh the joys of floating point math.

For those who want to follow along, here are the exact numbers we are adding in the test in two orderings which produce different results:

Thank you for diving deep here and making such a simple reproduction.

how should I interpret the other tasks which show a significant change? Are they just noisy?

Good question -- it makes no sense that e.g. Respell/Fuzzy1/2 got faster with this change, though the benchy seems to think it is significant (p=0.000). I'm not sure what to make of it!

mikemccand · 2024-01-13T13:16:26Z

The regression in the taxo task is explained in the profiler. Boxing is not cheap:
11.24% 10402M java.lang.Integer#valueOf()

Hmm this is sort of spooky -- should we aim to keep the specialization somehow (avoid the boxing)? Is there a middle ground where we can avoid the boxing but still remove much of / some of this duplicated code? Java is annoying sometimes :)

stefanvodita · 2024-01-14T06:33:27Z

What I've done is I've only taken advantage of the boxing for genericity when collecting results getTop... and not use it while performing the aggregations themselves. Most of the taxonomy tasks are not showing a significant performance change. I wonder if the one that has slowed down spends more time collecting the aggregation values than calculating them.

Shradha26 · 2024-01-11T15:31:19Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

+  /** Intermediate result to store top children for a given path before resolving labels, etc. */
+  record TopChildrenForPath(Number pathValue, int childCount, TopOrdAndNumberQueue childQueue) {}
+
+  private static class DimValue {


[nit] should we call this just Dim and String dimPath instead of String dim? I see later that we've used int dimValue and this is getting quickly overloaded?

I think we called it dim and not dimPath because it's just one label in the path, just the dimension, so it doesn't feel right to call it a path.

Shradha26 · 2024-01-11T15:58:35Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

+
+  /** Get the aggregation value for this ordinal. */
+  protected Number getAggregationValue(int ordinal) {
+    // By default, this is just the count.


Can the default implementation of this method and getValue should be same as that in IntTaxonomyFacets and FloatTaxonomyFacets to reduce duplication further? FastTaxonomyFacets can either extend from IntTaxonomyFacets or do this sort of a count based customisation to these methods.

It's a good point, but I think it's better for the default behaviour to be getting counts. We need the getAggregationValue level of abstraction to be able to call getValue with different signatures for IntTaxonomyFacets and FloatTaxonomyFacets.

Shradha26 · 2024-01-15T14:07:58Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

+   * the aggregation values, keeping aggregation efficient.
+   */
+  protected void updateValueFromRollup(int ordinal, int childOrdinal) throws IOException {
+    setCount(ordinal, getCount(ordinal) + rollup(childOrdinal));


Shall we assume an aggregationFunction is passed in this parent class and implement this method similar to IntTaxonomyFacets and FloatTaxonomyFacets since this bit seems to be duplicated in both?

Further, FastTaxonomyFacetCounts can either override this and do a count based updateValuefromRollup since it doesn't use an aggregation function or even continue to extend from IntTaxonomyFacets.

Shradha26 · 2024-01-15T14:11:26Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

@@ -67,6 +91,17 @@ public int compare(FacetResult a, FacetResult b) {
  /** Maps an ordinal to its parent, or -1 if there is no parent (root node). */
  final int[] parents;

+  /** Dense ordinal counts. */
+  int[] counts;


Can we make this Number[] values so that IntTaxonomyFacets and FloatTaxonomyFacets don't need to define their own values data structure and this class is generic?

It's important that IntTaxonomyFacets and FloatTaxonomyFacets have their own data structures for efficiency. This array here only keep counts and not other aggregations.

Shradha26 · 2024-01-15T14:15:04Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

+  /** Apply an aggregation to the two values and return the result. */
+  protected Number aggregate(Number existingVal, Number newVal) {
+    // By default, we are computing counts, so the values are interpreted as integers and summed.
+    return (int) existingVal + (int) newVal;


Can we use the concept of an aggregation function while combining in this method. (In line with my previous comment about making the logic for IntTaxonomyFacets and FloatTaxonomyFacets the default)

This is a tricky bit. You'll see that when we override, we do use an aggregation function, but the default implementation is to count.

Shradha26 · 2024-01-15T16:04:03Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetFloatAssociations.java

+          float currentValue = getValue(ord);
+          float newValue = aggregationFunction.aggregate(currentValue, value);
+          setValue(ord, newValue);
+          setCount(ord, getCount(ord) + 1);


Why do we want to always track counts too?

I think it has some nice advantages, e.g. it will resolve #11282 and #12585.

Shradha26 · 2024-01-15T16:08:24Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

+    return new FacetResult(dim, path, aggregatedValue, labelValues, ordinals.size());
+  }
+
+  private TopOrdAndNumberQueue.OrdAndValue insertIntoQueue(


This is great! This bit was often duplicated. Can we make this a utility method or maybe even a method like insert* method on the Queue so StringValueFacetCounts and AbstractSortedSetDocValue can use it too?

Good point. Added to #13175, where we can target improvements related to the way we access these queues.

Shradha26 · 2024-01-15T16:24:44Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

+   * Determine the top-n children for a specified dimension + path. Results are in an intermediate
+   * form.
+   */
+  protected TopChildrenForPath getTopChildrenForPath(DimConfig dimConfig, int pathOrd, int topN)


Let's add an abstract signature for this method to the Facets class?

I would like to avoid making API changes in this PR. It's an interesting question whether all Facets should have this.

epotyom · 2024-01-18T15:11:45Z

...e/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java

+            bottomCount = (int) q.top().value;
+            bottomOrd = (int) q.top().value;


I wonder if we can remove these bottomX optimizations here and in other places, I think insertWithOverflow essentially does the same?

Good point, opened #13175.

epotyom · 2024-01-18T15:49:57Z

lucene/facet/src/java/org/apache/lucene/facet/TopOrdAndNumberQueue.java

+public abstract class TopOrdAndNumberQueue extends PriorityQueue<TopOrdAndNumberQueue.OrdAndValue> {
+
+  /** Holds a single entry. */
+  public static final class OrdAndValue {


Instead of making this class final and lessThan abstract, maybe we should make this class abstract, with abstract compare method which we implement separately for floats/ints/multi-aggregations? This way we can use primitive types in OrdAndValue implementations and hopefully reduce some boxing costs?

epotyom · 2024-01-18T15:58:16Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java

+
+    LabelAndValue[] labelValues = new LabelAndValue[q.size()];
+    int[] ordinals = new int[labelValues.length];
+    Number[] values = new Number[labelValues.length];


I believe using Number here and in LabelAndValue is one of the things that limits our ability to add new types of facet results, for example, multi-aggregate facets that you've mentioned. I'd suggest that we use generic <T> (which may need to implement Comparable?) in these classes, and use Integer, Float, etc in TaxonomyFacets implementation. This would require API changes though...

Maybe we can consider this separately? I hope we can avoid all API changes in this PR.

github-actions · 2024-02-02T00:12:44Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

stefanvodita · 2024-03-09T23:49:05Z

Thank you all for reviewing! I confirmed that the performance impact was from result collection, not from the aggregations themselves, and I've managed to claw back the performance hit. Most of the improvement comes from the changes to getTopChildrenForPath, which no longer usese intermediary Numbers. I've also integrated the performance-related suggestions from @epotyom (thank you for those!). I'll address the rest of the comments too, just wanted to get this out while it's fresh to see if you all have more feedback on the performance front.

python3 src/python/localrun.py -source wikimediumall

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
            BrowseDateSSDVFacets        1.24      (6.6%)        1.21      (9.6%)   -2.5% ( -17% -   14%) 0.334
     BrowseRandomLabelTaxoFacets        3.76      (3.7%)        3.69      (3.5%)   -1.8% (  -8% -    5%) 0.120
                       MedPhrase       11.46      (2.8%)       11.30      (2.6%)   -1.3% (  -6% -    4%) 0.112
               HighTermMonthSort     2290.51      (4.4%)     2262.12      (4.2%)   -1.2% (  -9% -    7%) 0.360
                    OrHighNotMed      327.20      (3.3%)      323.36      (3.2%)   -1.2% (  -7% -    5%) 0.252
                    OrHighNotLow      318.99      (3.7%)      315.45      (4.2%)   -1.1% (  -8% -    7%) 0.377
                       LowPhrase        4.74      (3.1%)        4.69      (3.0%)   -1.0% (  -6% -    5%) 0.310
                   OrNotHighHigh      244.33      (3.1%)      242.52      (3.0%)   -0.7% (  -6% -    5%) 0.443
                   OrHighNotHigh      227.54      (2.9%)      225.86      (3.2%)   -0.7% (  -6% -    5%) 0.438
                    OrNotHighMed      333.78      (2.6%)      331.35      (2.8%)   -0.7% (  -5% -    4%) 0.391
                      HighPhrase       70.04      (3.2%)       69.53      (3.3%)   -0.7% (  -6% -    5%) 0.478
                     AndHighHigh       23.27      (7.9%)       23.11      (7.1%)   -0.7% ( -14% -   15%) 0.777
                        Wildcard       51.02      (4.3%)       50.71      (4.2%)   -0.6% (  -8% -    8%) 0.652
                     MedSpanNear       29.20      (3.0%)       29.05      (2.5%)   -0.5% (  -5% -    5%) 0.561
                        HighTerm      475.59      (4.1%)      473.22      (4.7%)   -0.5% (  -8% -    8%) 0.721
                        PKLookup      176.36      (3.0%)      175.50      (2.7%)   -0.5% (  -6% -    5%) 0.589
                    HighSpanNear       10.52      (2.7%)       10.47      (2.2%)   -0.4% (  -5% -    4%) 0.612
                         MedTerm      470.14      (4.4%)      468.33      (5.4%)   -0.4% (  -9% -    9%) 0.804
       BrowseDayOfYearSSDVFacets        4.08      (3.9%)        4.06      (4.2%)   -0.4% (  -8% -    8%) 0.775
                    OrNotHighLow      322.80      (2.9%)      321.71      (2.4%)   -0.3% (  -5% -    5%) 0.692
            HighIntervalsOrdered        3.60      (4.8%)        3.59      (4.8%)   -0.3% (  -9% -    9%) 0.868
                      AndHighMed       83.14      (3.5%)       82.93      (3.9%)   -0.2% (  -7% -    7%) 0.833
       BrowseDayOfYearTaxoFacets        4.69      (4.5%)        4.68      (4.4%)   -0.2% (  -8% -    9%) 0.902
            BrowseDateTaxoFacets        4.61      (4.5%)        4.60      (4.3%)   -0.1% (  -8% -    9%) 0.937
                         Respell       53.50      (2.2%)       53.46      (1.8%)   -0.1% (  -3% -    4%) 0.902
         AndHighMedDayTaxoFacets       43.57      (1.5%)       43.54      (1.6%)   -0.1% (  -3% -    3%) 0.891
                          Fuzzy1       66.17      (2.4%)       66.20      (2.0%)    0.0% (  -4% -    4%) 0.951
                      AndHighLow      525.57      (2.6%)      525.90      (4.2%)    0.1% (  -6% -    7%) 0.955
                       OrHighMed       76.00      (3.2%)       76.05      (3.9%)    0.1% (  -6% -    7%) 0.953
            HighTermTitleBDVSort        6.93      (7.3%)        6.94      (6.8%)    0.2% ( -13% -   15%) 0.943
             MedIntervalsOrdered        2.77      (3.6%)        2.78      (3.2%)    0.2% (  -6% -    7%) 0.883
                          Fuzzy2       43.83      (1.9%)       43.90      (1.7%)    0.2% (  -3% -    3%) 0.770
                     LowSpanNear        6.13      (2.1%)        6.14      (1.9%)    0.2% (  -3% -    4%) 0.785
                HighSloppyPhrase        5.52      (3.4%)        5.53      (3.7%)    0.2% (  -6% -    7%) 0.851
           BrowseMonthSSDVFacets        4.34      (5.1%)        4.35      (4.7%)    0.2% (  -9% -   10%) 0.891
                         Prefix3       68.56      (4.6%)       68.70      (6.0%)    0.2% (  -9% -   11%) 0.899
             LowIntervalsOrdered       18.33      (2.8%)       18.38      (2.5%)    0.3% (  -4% -    5%) 0.737
                 LowSloppyPhrase       20.67      (2.2%)       20.73      (1.9%)    0.3% (  -3% -    4%) 0.627
        AndHighHighDayTaxoFacets        7.57      (2.3%)        7.59      (2.5%)    0.3% (  -4% -    5%) 0.669
           HighTermDayOfYearSort      206.91      (2.9%)      207.68      (2.6%)    0.4% (  -5% -    6%) 0.670
               HighTermTitleSort      140.79      (1.6%)      141.32      (2.0%)    0.4% (  -3% -    3%) 0.508
                         LowTerm      438.67      (7.1%)      441.44      (7.9%)    0.6% ( -13% -   16%) 0.790
                 MedSloppyPhrase       21.78      (3.1%)       21.95      (3.4%)    0.8% (  -5% -    7%) 0.454
            MedTermDayTaxoFacets       21.51      (2.2%)       21.71      (1.6%)    0.9% (  -2% -    4%) 0.122
                      TermDTSort      118.13      (3.0%)      119.30      (3.4%)    1.0% (  -5% -    7%) 0.329
           BrowseMonthTaxoFacets        9.58      (8.6%)        9.68      (8.8%)    1.1% ( -14% -   20%) 0.691
     BrowseRandomLabelSSDVFacets        2.88      (2.3%)        2.91      (1.8%)    1.1% (  -2% -    5%) 0.093
                      OrHighHigh       33.81      (7.6%)       34.24      (8.4%)    1.3% ( -13% -   18%) 0.618
                       OrHighLow      319.44      (6.2%)      323.88      (3.9%)    1.4% (  -8% -   12%) 0.393
                          IntNRQ       27.52      (5.2%)       27.96      (5.9%)    1.6% (  -8% -   13%) 0.360
          OrHighMedDayTaxoFacets        2.83      (3.3%)        2.88      (5.2%)    1.6% (  -6% -   10%) 0.243

stefanvodita · 2024-03-14T13:54:06Z

@gsmiller - I know you may not have time to review, but I want to at least notify you, since this is a big change and you've been very invovled in this area of the code.

github-actions · 2024-03-29T00:17:25Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

stefanvodita · 2024-03-29T07:02:50Z

Hi reviewers! This PR has become stale. Could anyone have a look at it? It has several nice improvements for taxonomy facets, with no API changes, and it sets us up to launch new features in a future release: multiple aggregations in one go and retrieving counts with aggregation facets.

mikemccand

Looks great -- thank you for clawing back that performance loss by adding a bit of non-generics specialization back. I like this compromise.

I left a minor comment, not a blocker for merging.

@stefanvodita I think you should merge this in a day or two if there's no more feedback? Lazy consensus ...

mikemccand · 2024-04-01T15:59:28Z

lucene/facet/src/java/org/apache/lucene/facet/TopOrdAndIntQueue.java

+    @Override
+    public boolean lessThan(OrdAndValue other) {
+      OrdAndInt otherOrdAndInt = (OrdAndInt) other;
+      if (value < otherOrdAndInt.value) {


You might use Integer.compare here -- not sure if it's actually faster. You'd still need to get the result and check if it's != 0 for the tiebreak (which could also be Integer.compare).

This is how Integer.compare is implemented:

public static int compare(int x, int y) { return (x < y) ? -1 : ((x == y) ? 0 : 1); }

And lessThan would become:

public boolean lessThan(OrdAndValue other) { OrdAndInt otherOrdAndInt = (OrdAndInt) other; int cmp = Integer.compare(value, otherOrdAndInt.value); if (cmp == 0) { cmp = Integer.compare(otherOrdAndInt.value, ord); } return cmp < 0; }

I think we end up doing more comparisons overall? I might be missing something though.

stefanvodita · 2024-04-04T19:40:02Z

Thank you for reviewing @mikemccand! I had to rebase after #12966. I'll push tomorrow maybe if there are no objections.

stefanvodita · 2024-04-05T11:19:38Z

I did another benchmark run after the rebase just to make sure I haven't broken anything when integrating the split taxo arrays change. I see no significant changes.

python3 src/python/localrun.py -source wikimediumall

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
           BrowseMonthTaxoFacets        8.68      (8.6%)        8.41      (8.6%)   -3.1% ( -18% -   15%) 0.257
                      OrHighHigh       24.38      (4.8%)       24.09      (4.9%)   -1.2% ( -10% -    8%) 0.424
                     AndHighHigh       26.10      (4.6%)       25.80      (2.2%)   -1.1% (  -7% -    5%) 0.315
                        HighTerm      254.91      (7.0%)      252.20      (5.9%)   -1.1% ( -13% -   12%) 0.604
           HighTermDayOfYearSort      307.54      (2.0%)      305.21      (2.1%)   -0.8% (  -4% -    3%) 0.249
                    OrNotHighLow      506.28      (2.2%)      502.52      (2.6%)   -0.7% (  -5% -    4%) 0.327
                         LowTerm      497.25      (6.3%)      493.71      (5.7%)   -0.7% ( -11% -   12%) 0.709
                       OrHighMed      102.21      (3.8%)      101.52      (4.2%)   -0.7% (  -8% -    7%) 0.589
                         MedTerm      505.87      (6.8%)      502.44      (5.9%)   -0.7% ( -12% -   12%) 0.737
                      TermDTSort      130.10      (2.4%)      129.27      (2.0%)   -0.6% (  -4% -    3%) 0.359
                    OrHighNotLow      420.65      (3.9%)      418.28      (3.8%)   -0.6% (  -7% -    7%) 0.644
                      AndHighMed       89.03      (2.4%)       88.53      (1.4%)   -0.6% (  -4% -    3%) 0.365
     BrowseRandomLabelTaxoFacets        3.72      (1.8%)        3.70      (1.4%)   -0.5% (  -3% -    2%) 0.303
            HighTermTitleBDVSort       10.39      (4.7%)       10.34      (4.4%)   -0.4% (  -9% -    9%) 0.775
                         Prefix3      131.17      (2.0%)      130.64      (3.3%)   -0.4% (  -5% -    5%) 0.645
               HighTermTitleSort      155.59      (2.2%)      155.00      (2.2%)   -0.4% (  -4% -    4%) 0.590
          OrHighMedDayTaxoFacets        4.50      (5.4%)        4.49      (5.5%)   -0.4% ( -10% -   11%) 0.825
         AndHighMedDayTaxoFacets       17.89      (1.9%)       17.85      (1.5%)   -0.3% (  -3% -    3%) 0.636
            BrowseDateTaxoFacets        4.57      (1.8%)        4.56      (1.5%)   -0.3% (  -3% -    3%) 0.639
                      AndHighLow      677.34      (2.6%)      675.67      (1.8%)   -0.2% (  -4% -    4%) 0.729
                    OrHighNotMed      349.74      (3.7%)      348.93      (2.8%)   -0.2% (  -6% -    6%) 0.823
                   OrHighNotHigh      321.44      (3.1%)      320.71      (3.0%)   -0.2% (  -6% -    6%) 0.815
                   OrNotHighHigh      229.84      (2.9%)      229.33      (2.7%)   -0.2% (  -5% -    5%) 0.805
       BrowseDayOfYearTaxoFacets        4.63      (1.7%)        4.62      (1.5%)   -0.2% (  -3% -    3%) 0.675
                       OrHighLow      377.28      (1.3%)      376.48      (1.3%)   -0.2% (  -2% -    2%) 0.601
                       MedPhrase      447.55      (2.2%)      446.61      (2.6%)   -0.2% (  -4% -    4%) 0.781
        AndHighHighDayTaxoFacets        2.48      (3.9%)        2.47      (2.7%)   -0.2% (  -6% -    6%) 0.882
                    HighSpanNear        2.84      (2.2%)        2.84      (2.0%)   -0.1% (  -4% -    4%) 0.835
                        Wildcard      294.36      (2.4%)      293.99      (2.8%)   -0.1% (  -5% -    5%) 0.879
                          Fuzzy2       61.91      (1.2%)       61.85      (1.3%)   -0.1% (  -2% -    2%) 0.814
                     LowSpanNear       36.58      (1.9%)       36.56      (1.8%)   -0.1% (  -3% -    3%) 0.923
                       LowPhrase       41.87      (1.2%)       41.85      (1.6%)   -0.0% (  -2% -    2%) 0.925
            MedTermDayTaxoFacets       23.10      (2.5%)       23.10      (2.5%)    0.0% (  -4% -    5%) 0.991
                          Fuzzy1       88.20      (0.9%)       88.23      (1.3%)    0.0% (  -2% -    2%) 0.935
                         Respell       46.76      (1.8%)       46.77      (1.8%)    0.0% (  -3% -    3%) 0.950
                    OrNotHighMed      325.18      (2.3%)      325.71      (2.0%)    0.2% (  -4% -    4%) 0.811
                     MedSpanNear        6.23      (4.0%)        6.24      (3.8%)    0.2% (  -7% -    8%) 0.846
                      HighPhrase       20.42      (1.9%)       20.47      (2.8%)    0.3% (  -4% -    5%) 0.737
            HighIntervalsOrdered        9.90      (4.4%)        9.94      (2.9%)    0.4% (  -6% -    8%) 0.763
             LowIntervalsOrdered       14.11      (4.2%)       14.17      (2.4%)    0.4% (  -5% -    7%) 0.698
           BrowseMonthSSDVFacets        4.15      (1.5%)        4.17      (2.1%)    0.4% (  -3% -    4%) 0.438
                        PKLookup      190.68      (1.8%)      191.62      (1.7%)    0.5% (  -2% -    4%) 0.381
             MedIntervalsOrdered        4.54      (4.3%)        4.57      (2.9%)    0.5% (  -6% -    8%) 0.649
                HighSloppyPhrase       14.51      (2.0%)       14.62      (2.1%)    0.7% (  -3% -    4%) 0.243
     BrowseRandomLabelSSDVFacets        2.83      (6.1%)        2.85      (5.7%)    0.8% ( -10% -   13%) 0.674
                 LowSloppyPhrase       13.09      (2.1%)       13.20      (2.4%)    0.8% (  -3% -    5%) 0.231
               HighTermMonthSort     2155.96      (3.5%)     2177.02      (3.6%)    1.0% (  -5% -    8%) 0.382
       BrowseDayOfYearSSDVFacets        4.00      (2.2%)        4.05      (2.1%)    1.2% (  -3% -    5%) 0.073
                 MedSloppyPhrase       12.84      (4.2%)       13.04      (4.7%)    1.6% (  -7% -   10%) 0.260
            BrowseDateSSDVFacets        1.17      (9.3%)        1.19      (7.0%)    1.9% ( -13% -   20%) 0.458
                          IntNRQ       21.04     (26.3%)       22.13     (25.7%)    5.2% ( -37% -   77%) 0.531

stefanvodita · 2024-04-05T13:27:10Z

I'm finding this difficult to port to 9x because of the way the classes have diverged and I'm not sure it's worthwhile, since a lot of the benefits here are for future development and to support API changes that would go in Lucene 10. I'll move the CHANGES entries and milestones to Lucene 10 unless anyone thinks it's worth backporting.

mikemccand · 2024-05-10T16:28:43Z

Now that #12408 was backported in #13300 can we now backport this to 9.x? Or was it already done in an un-linked PR or so?

Remembering to backport is proving challenging and error-proned (it always has been), not just in all of us consistently agreeing on the criteria for backport (we should always aim to backport unless it breaks non-experimental/internal public APIs?), but also in actually remembering to do it after a PR is merged to main. I wish GH provided some stronger mechanisms for us here ...

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

stefanvodita · 2024-05-10T22:23:01Z

I was just working on it today actually and finally got it in shape: #13358. Sorry it took so long!

#12966 (#13358) Reduce duplication in taxonomy facets; always do counts (#12966) This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

stefanvodita · 2024-05-14T10:07:01Z

I was skeptical this would work out at first, but I think we have a successful backport in the end, so the changes will go out with 9.11.

github-actions bot added the Stale label Jan 8, 2024

mikemccand approved these changes Jan 8, 2024

View reviewed changes

github-actions bot removed the Stale label Jan 9, 2024

Shradha26 reviewed Jan 15, 2024

View reviewed changes

epotyom reviewed Jan 18, 2024

View reviewed changes

github-actions bot added the Stale label Feb 2, 2024

stefanvodita mentioned this pull request Feb 15, 2024

Compute multiple float aggregations in one go #12547

Open

github-actions bot removed the Stale label Mar 10, 2024

stefanvodita mentioned this pull request Mar 10, 2024

Stop double-checking priority queue inserts #13175

Closed

stefanvodita mentioned this pull request Mar 22, 2024

Fix TestTaxonomyFacetValueSource.testRandom #13198

Open

github-actions bot added the Stale label Mar 29, 2024

github-actions bot removed the Stale label Mar 30, 2024

mikemccand approved these changes Apr 1, 2024

View reviewed changes

Reduce duplication in taxonomy facets; always do counts

f14bcba

stefanvodita force-pushed the assoc-facets-count branch from e18ff28 to f14bcba Compare April 4, 2024 19:38

Add CHANGES entries

417a138

stefanvodita merged commit 9ba4af7 into apache:main Apr 5, 2024
3 checks passed

This was referenced Apr 10, 2024

Don't assume non-zero aggregations in getTopDims or test checkResults #13287

Open

Backport to 9x: Initialize facet counting data structures lazily #12408 #13300

Merged

stefanvodita mentioned this pull request May 10, 2024

Backport to 9x: Reduce duplication in taxonomy facets; always do counts #12966 #13358

Merged

stefanvodita added this to the 9.11.0 milestone May 14, 2024

stefanvodita mentioned this pull request May 24, 2024

Allow users to retrieve counts from taxo association facets #13414

Merged

stefanvodita mentioned this pull request Sep 7, 2024

Have value and count in LabelAndValue only for TaxonomyFacets #13740

Closed

stefanvodita mentioned this pull request Nov 22, 2024

Taxonomy counts are incorrect due to ordinal sorting #14008

Closed

		bottomCount = (int) q.top().value;
		bottomOrd = (int) q.top().value;

Reduce duplication in taxonomy facets; always do counts #12966

Reduce duplication in taxonomy facets; always do counts #12966

Conversation

stefanvodita commented Dec 22, 2023 • edited Loading

Note

What does this PR do well?

What's not ideal about this approach?

What is next?

github-actions bot commented Jan 8, 2024

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented Jan 8, 2024

stefanvodita commented Jan 13, 2024

stefanvodita commented Jan 13, 2024

mikemccand commented Jan 13, 2024

mikemccand commented Jan 13, 2024

stefanvodita commented Jan 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 2, 2024

stefanvodita commented Mar 9, 2024

stefanvodita commented Mar 14, 2024

github-actions bot commented Mar 29, 2024

stefanvodita commented Mar 29, 2024

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanvodita commented Apr 4, 2024

stefanvodita commented Apr 5, 2024

stefanvodita commented Apr 5, 2024

mikemccand commented May 10, 2024

stefanvodita commented May 10, 2024

stefanvodita commented May 14, 2024

stefanvodita commented Dec 22, 2023 •

edited

Loading