Enable comparing main facets module agains sandbox facets implementation #325

epotyom · 2025-01-07T23:58:43Z

Auxiliary changes:

SearchTask#getFacetResultsMsec now includes time to run #search, because in the sandbox module we can't measure search and facet compute times separately. But it might be the right thing to do anyway, as otherwise we don't account for time spent to build doc ID sets in FacetsCollector?
Enable attaching a context to Task, to be able to use different implementations for the same task and Lucene code in baseline/candidate
Fix facets result overlap check: in python SearchTask's equals and hash methods compare facets requests, not results
Added facetsWikimediumAll config that contains all taxonomy facets tasks from wikimediumall

Command to run:

python src/python/localrunFacets.py -source facetsWikimediumAll

Results on my laptop

Report after iter 19:

                            TaskQPS classic_facets      StdDevQPS sandbox_facets      StdDev                Pct diff p-value
     BrowseRandomLabelTaxoFacets        5.99      (6.0%)        2.64      (1.6%)  -55.9% ( -59% -  -51%) 0.000
                        PKLookup      297.27      (5.8%)      251.45      (4.9%)  -15.4% ( -24% -   -5%) 0.000
         AndHighMedDayTaxoFacets      154.40      (3.2%)      153.00     (14.8%)   -0.9% ( -18% -   17%) 0.788
        AndHighHighDayTaxoFacets       13.13      (3.4%)       13.04     (11.7%)   -0.7% ( -15% -   14%) 0.792
          OrHighMedDayTaxoFacets       16.27      (5.5%)       19.52     (20.0%)   20.0% (  -5% -   48%) 0.000
       BrowseDayOfYearTaxoFacets        6.52      (7.7%)        9.30     (12.9%)   42.8% (  20% -   68%) 0.000
            BrowseDateTaxoFacets        6.48      (7.6%)        9.68     (13.9%)   49.3% (  25% -   76%) 0.000
           BrowseMonthTaxoFacets        6.21      (6.9%)        9.38     (15.1%)   51.0% (  27% -   78%) 0.000
            MedTermDayTaxoFacets       38.03      (3.9%)       71.68     (25.3%)   88.5% (  57% -  122%) 0.000

I'll look into why there is regression for BrowseRandomLabelTaxoFacets and PKLookup

Command to run: python src/python/localrunFacets.py -source facetsWikimediumAll Auxiliary changes: - Enable attaching context to a Task, to be able to use different implementations for the same task and Lucene code in baseline/candidate - Fix facets result overlap check: in python SearchTask's equals and hash methods to compare facets request, not results - Added facetsWikimediumAll config that contains all taxonomy facets tasks from wikimediumall

epotyom · 2025-01-09T16:25:44Z

There is BrowseRandomLabelTaxoFacets regression because RandomLabel.taxonomy field is the one that uses most of the sidecar taxonomy index, and it seems to be the only field for which dense counting (in array) pays off; total array (and taxon index) size is 1559911, and when counting is done, only 3422 (0.2%) elements are still zeros.

We can think about implementing dense counting for the sandbox facet module, but it looks like a rare use case to me - not only we need a field with large number of unique values, but it also needs to be a MatchAllDocsQuery.

In any case, seems like the result is expected. Looking into PKLookup regression now.

epotyom added 3 commits January 8, 2025 02:09

Minor follow up changes

3299a4c

localrunFacets comment updates

c446471

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable comparing main facets module agains sandbox facets implementation #325

Enable comparing main facets module agains sandbox facets implementation #325

epotyom commented Jan 7, 2025

epotyom commented Jan 9, 2025

Enable comparing main facets module agains sandbox facets implementation #325

Are you sure you want to change the base?

Enable comparing main facets module agains sandbox facets implementation #325

Conversation

epotyom commented Jan 7, 2025

epotyom commented Jan 9, 2025