Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable comparing main facets module agains sandbox facets implementation #325

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

epotyom
Copy link

@epotyom epotyom commented Jan 7, 2025

Auxiliary changes:

  • SearchTask#getFacetResultsMsec now includes time to run #search, because in the sandbox module we can't measure search and facet compute times separately. But it might be the right thing to do anyway, as otherwise we don't account for time spent to build doc ID sets in FacetsCollector?
  • Enable attaching a context to Task, to be able to use different implementations for the same task and Lucene code in baseline/candidate
  • Fix facets result overlap check: in python SearchTask's equals and hash methods compare facets requests, not results
  • Added facetsWikimediumAll config that contains all taxonomy facets tasks from wikimediumall

Command to run:

python src/python/localrunFacets.py -source facetsWikimediumAll

Results on my laptop

Report after iter 19:

                            TaskQPS classic_facets      StdDevQPS sandbox_facets      StdDev                Pct diff p-value
     BrowseRandomLabelTaxoFacets        5.99      (6.0%)        2.64      (1.6%)  -55.9% ( -59% -  -51%) 0.000
                        PKLookup      297.27      (5.8%)      251.45      (4.9%)  -15.4% ( -24% -   -5%) 0.000
         AndHighMedDayTaxoFacets      154.40      (3.2%)      153.00     (14.8%)   -0.9% ( -18% -   17%) 0.788
        AndHighHighDayTaxoFacets       13.13      (3.4%)       13.04     (11.7%)   -0.7% ( -15% -   14%) 0.792
          OrHighMedDayTaxoFacets       16.27      (5.5%)       19.52     (20.0%)   20.0% (  -5% -   48%) 0.000
       BrowseDayOfYearTaxoFacets        6.52      (7.7%)        9.30     (12.9%)   42.8% (  20% -   68%) 0.000
            BrowseDateTaxoFacets        6.48      (7.6%)        9.68     (13.9%)   49.3% (  25% -   76%) 0.000
           BrowseMonthTaxoFacets        6.21      (6.9%)        9.38     (15.1%)   51.0% (  27% -   78%) 0.000
            MedTermDayTaxoFacets       38.03      (3.9%)       71.68     (25.3%)   88.5% (  57% -  122%) 0.000

I'll look into why there is regression for BrowseRandomLabelTaxoFacets and PKLookup

Command to run:

python src/python/localrunFacets.py -source facetsWikimediumAll

Auxiliary changes:
- Enable attaching context to a Task, to be able to use different implementations for the same task and Lucene code in baseline/candidate
- Fix facets result overlap check: in python SearchTask's equals and hash methods to compare facets request, not results
- Added facetsWikimediumAll config that contains all taxonomy facets tasks from wikimediumall
@epotyom
Copy link
Author

epotyom commented Jan 9, 2025

There is BrowseRandomLabelTaxoFacets regression because RandomLabel.taxonomy field is the one that uses most of the sidecar taxonomy index, and it seems to be the only field for which dense counting (in array) pays off; total array (and taxon index) size is 1559911, and when counting is done, only 3422 (0.2%) elements are still zeros.

We can think about implementing dense counting for the sandbox facet module, but it looks like a rare use case to me - not only we need a field with large number of unique values, but it also needs to be a MatchAllDocsQuery.

In any case, seems like the result is expected. Looking into PKLookup regression now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant