Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deterministic score tiebreaker #1608

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

missinglink
Copy link
Member

partner PR for pelias/query#130

@orangejulius
Copy link
Member

Huh, fun. Looks like this actually fails because loading the _id field from disk is expensive:

[exception] java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [3399601624/3.1gb], which is larger than the limit of [3373164134/3.1gb]]

@missinglink
Copy link
Member Author

missinglink commented Feb 28, 2022

Wow ok, problem with using _doc instead is it changes between builds.

@missinglink
Copy link
Member Author

I'd be interested to know which query caused that, I assumed sort was after hits are pruned

@missinglink
Copy link
Member Author

Or at least the second sort criteria was only a secondary sorting based on the top $size hits of the first criteria

@missinglink
Copy link
Member Author

CircuitBreakingException

https://gitlab.com/crossref/event_data_query/-/issues/33

@missinglink
Copy link
Member Author

missinglink commented Mar 1, 2022

reference for _id https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-id-field.html
it mentions this:

In case sorting or aggregating on the _id field is required, it is advised to duplicate the content of the _id field into another field that has doc_values enabled.

It seems that since docvalues is not enabled for _id that the entire document needs to be loaded from disk to use this one field for sorting, which is clearly prohibitively expensive.

So there's a couple options if we'd like to pursue this:

  1. use [_score, _doc], this provides a secondary scoring index based on the document insertion order. This would be deterministic within a single build but not deterministic across multiple builds. My assumption is that this is 'cheap' since this _doc integer is probably already available within the results and wouldn't need to be fetched, but I don't think it really solves the problem, or at least half-solves a problem we might not have 🤔

  2. use some fields with docvalues, such as something like [_score, layer, source, source_id] (the first two being defined in the schema as keyword_with_doc_values and the latter being keyword which would need to change). This would achieve the desired result but would potentially cause a bit of disk spin and therefore a perf hit.

[edit] reversing the fields to have the less common terms first would be preferable!

@missinglink
Copy link
Member Author

My personal preference would be either 3. abandon this and move on /or 2. use docvalues fields and hope that the fields are small enough that they are always in RAM and therefore don't suffer a perf hit.

@orangejulius
Copy link
Member

Yeah, using [_score, layer, source, source_id] makes sense: the source_id field in our elasticsearch documents has been without a purpose for quite some time, since it's duplicated in the _id field and easily calculated. But if a separate field from _id is the only way to use docvalues, then that's perfect.

@missinglink
Copy link
Member Author

Seems we can probably use only source_id since it's almost unique we won't require any further sorting conditions.
The dilemma here is that using such a field will also use the most RAM 🤔

@orangejulius
Copy link
Member

Maybe we want to try a different approach then? What if we re-sorted results in the API after retrieving them from Elasticsearch? Sort first by score, then by gid. This will always be fast because it's sorting at most ~80 records at once (with size=40 and a factor for extra records to remove dupes), and won't require any changes to our ES queries or their performance characteristics.

If sorting by source_id in Elasticsearch turned out to be free, then that would also be fine, but I suspect it's not free.

@missinglink
Copy link
Member Author

Yeah I had a similar thought, the concern there is if the top n 'hits' returned from ES to the API were non-deterministic then it wouldn't really solve the core issue

@orangejulius
Copy link
Member

Ahh, good point. It would probably handle most cases but definitely not every case.

So lets try a build with source_id docvalues enabled, and see how bad it is?

If it's really bad maybe we do it API-side for now and work on introducing more scoring differentiators into the queries in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants