Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1931373 - Add FTS matching data #6531

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bendk
Copy link
Contributor

@bendk bendk commented Dec 18, 2024

Added extra data to Suggestion::Fakespot to capture how the FTS match was made. The plan is to use this as a facet for our metrics to help us consider how to tune the matching logic (i.e. maybe we should not use stemming, maybe we should reqiure that terms are close together).

Added Suggest CLI flag to print out the full debug repr for suggestions. This provides an easy way to test the new functionality.

Pull Request checklist

  • Breaking changes: This PR follows our breaking change policy
    • This PR follows the breaking change policy:
      • This PR has no breaking API changes, or
      • There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
  • Quality: This PR builds and tests run cleanly
    • Note:
      • For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
      • If this pull request includes a breaking change, consider cutting a new release after merging.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our dependency management guidelines
    • Any new dependencies are accompanied by a summary of the due diligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

@bendk bendk requested review from 0c0w3 and a team December 18, 2024 20:41
@bendk
Copy link
Contributor Author

bendk commented Dec 18, 2024

I think this is working based on running some tests with the CLI:

I used variations on "running shoe" to check the stemming/prefix matching flags:


> cargo suggest query --fts-match-info "running shoe"

============================== Results  ==============================
* Brooks Running Shoe (http://amazon.com/dp/B0BJ131YJQ) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* Altra Running Shoe (http://amazon.com/dp/B01HNL5KI0) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* HOKA ONE ONE Running Shoes (http://amazon.com/dp/B0B14G2MJN) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* WHITIN Barefoot Running Shoes | Minimalist, Zero Drop Sole (http://amazon.com/dp/B0CQX1YVK1) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* Salomon Trail Running Shoes | Gore-tex (http://amazon.com/dp/B0992GHCSX) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* SAGUARO Trail Running Shoes | Walking, Running, Minimalist (http://amazon.com/dp/B084DTQPDW) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* HUMTTO Hiking Shoes | Trail Running, Climbing, Breathable, Non-Slip (http://amazon.com/dp/B0B2VJ4L7P) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* Zappos - Official Site (https://www.zappos.com/?utm_source=admarketplace&utm_medium=sem_a&utm_campaign=Zappos&utm_term=Zappos&utm_content=319154514us46192024121815&mfadid=adm) (with icon)


> cargo suggest query --fts-match-info "running sho"

============================== Results  ==============================
* Brooks Running Shoe (http://amazon.com/dp/B0BJ131YJQ) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* Altra Running Shoe (http://amazon.com/dp/B01HNL5KI0) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* HOKA ONE ONE Running Shoes (http://amazon.com/dp/B0B14G2MJN) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* WHITIN Barefoot Running Shoes | Minimalist, Zero Drop Sole (http://amazon.com/dp/B0CQX1YVK1) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* Salomon Trail Running Shoes | Gore-tex (http://amazon.com/dp/B0992GHCSX) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* SAGUARO Trail Running Shoes | Walking, Running, Minimalist (http://amazon.com/dp/B084DTQPDW) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* HUMTTO Hiking Shoes | Trail Running, Climbing, Breathable, Non-Slip (http://amazon.com/dp/B0B2VJ4L7P) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* Zappos - Official Site (https://www.zappos.com/?utm_source=admarketplace&utm_medium=sem_a&utm_campaign=Zappos&utm_term=Zappos&utm_content=319154514us46192024121815&mfadid=adm) (with icon)

> cargo suggest query --fts-match-info "run shoe"

============================== Results  ==============================
* Brooks Running Shoe (http://amazon.com/dp/B0BJ131YJQ) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* Altra Running Shoe (http://amazon.com/dp/B01HNL5KI0) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* WHITIN Barefoot Running Shoes | Minimalist, Zero Drop Sole (http://amazon.com/dp/B0CQX1YVK1) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* Salomon Trail Running Shoes | Gore-tex (http://amazon.com/dp/B0992GHCSX) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* HOKA ONE ONE Running Shoes (http://amazon.com/dp/B0B14G2MJN) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* SAGUARO Trail Running Shoes | Walking, Running, Minimalist (http://amazon.com/dp/B084DTQPDW) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* HUMTTO Hiking Shoes | Trail Running, Climbing, Breathable, Non-Slip (http://amazon.com/dp/B0B2VJ4L7P) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }


 > cargo suggest query --fts-match-info "run sho"
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/examples-suggest-cli query --fts-match-info 'run sho'`

============================== Results  ==============================
* Brooks Running Shoe (http://amazon.com/dp/B0BJ131YJQ) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* WHITIN Barefoot Running Shoes | Minimalist, Zero Drop Sole (http://amazon.com/dp/B0CQX1YVK1) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* Salomon Trail Running Shoes | Gore-tex (http://amazon.com/dp/B0992GHCSX) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* Altra Running Shoe (http://amazon.com/dp/B01HNL5KI0) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* HOKA ONE ONE Running Shoes (http://amazon.com/dp/B0B14G2MJN) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* SAGUARO Trail Running Shoes | Walking, Running, Minimalist (http://amazon.com/dp/B084DTQPDW) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* HUMTTO Hiking Shoes | Trail Running, Climbing, Breathable, Non-Slip (http://amazon.com/dp/B0B2VJ4L7P) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }

I used these queries to test the term distance:


> cargo suggest query --fts-match-info "new shoe"

============================== Results  ==============================
* New Balance Baseball Shoe (http://amazon.com/dp/B08PCF4RWJ) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Track and Field Shoe (http://amazon.com/dp/B07HMK152T) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Medium }


> cargo suggest query --fts-match-info "mangrove holder"

============================== Results  ==============================
* Mangrove Pickleball Bag, Pickleball Backpack | Adjustable Sling, Upgraded Capacity, Safety Pocket, Water Bottle Holder (http://amazon.com/dp/B0972GTS8W) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Far }

@bendk bendk force-pushed the push-mollqwzklyms branch from 474496c to 1ffb901 Compare December 18, 2024 20:47
// All terms in a 5-term chunk
Medium,
// No 5-term chunk that contains all the terms
Far,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The segments above were arbitrarily picked by me. Are there different numbers we should choose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, what's the intuition behind collecting this distance data?

3 seems kind of like a large distance to start with. If I'm reading the fts5 doc right, that means terms can have 3 words between them and the match will still be successful. I would think we'd want to start at 1 at most, maybe even 0?

Is there a reason for using an enum and not reporting the numeric distance itself, as like min_term_distance?

How hard would it be to add tests for non-near distances?

Copy link
Contributor Author

@bendk bendk Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on this one, the docs confused me and I thought 3 meant the total amount of words in the clump was 3. I changed these numbers to 1 and 3, which seems like more of a reasonable start.

That said, I'm still not sure what's correct to test for, I'm open to changing "Near" to meaning 0.

The one thing I don't think we can do is calculate actual minimum term distance number. AFAICT, there's no function for that, you just have to make a bunch of queries and see if they match or not. We have variants like Adjacent=0, Medium=1 and Far=2 or greater though.

@bendk bendk force-pushed the push-mollqwzklyms branch from 1ffb901 to 302ec01 Compare December 18, 2024 20:54

fn split_terms(phrase: &str) -> Vec<&str> {
phrase
.split([' ', '(', ')', ':', '^', '*', '"', ','])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the , char in the list of things to split on. This way the comma in Trail Running, isn't included in the search terms. The FTS tokenizer ignores that, but it was messing up the stemming check logic.

@bendk bendk force-pushed the push-mollqwzklyms branch 3 times, most recently from 33d369c to e6880b6 Compare December 18, 2024 21:27
Copy link
Contributor

@0c0w3 0c0w3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm. It's too bad several extra queries may be needed to get the match info, but I'm guessing it's not a big problem (in terms of latency at least) since they should all be using indexes? You could probably do one big query with SELECT subqueries to get all the info in one, but it's probably not worth it.

// This is used when passing the keywords into an FTS search. It:
// - Strips out any `():^*"` chars. These are typically used for advanced searches, which
// we don't support and it would be weird to only support for FTS searches, which
// currently means Fakespot searches.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would remove the part about Fakespot so we don't need to remember to update this when we add more FTS suggestions.

pub fn sqlite_match_without_prefix_match(&self) -> &str {
self.sqlite_match
.strip_suffix('*')
.unwrap_or(&self.sqlite_match)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of needing a method to compute this, couldn't you have a FtsQuery::sqlite_match_without_prefix_match string that you would initialize in new() as part of the prefix_match if-statement?

components/suggest/src/query.rs Outdated Show resolved Hide resolved
components/suggest/src/store.rs Outdated Show resolved Hide resolved
components/suggest/src/suggestion.rs Outdated Show resolved Hide resolved
components/suggest/src/suggestion.rs Outdated Show resolved Hide resolved
// All terms in a 5-term chunk
Medium,
// No 5-term chunk that contains all the terms
Far,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, what's the intuition behind collecting this distance data?

3 seems kind of like a large distance to start with. If I'm reading the fts5 doc right, that means terms can have 3 words between them and the match will still be successful. I would think we'd want to start at 1 at most, maybe even 0?

Is there a reason for using an enum and not reporting the numeric distance itself, as like min_term_distance?

How hard would it be to add tests for non-near distances?

components/suggest/src/db.rs Outdated Show resolved Hide resolved
},
Exposure {
suggestion_type: String,
score: f64,
},
}

/// Additional data about how an FTS match was made(https://bugzilla.mozilla.org/show_bug.cgi?id=1931373)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Missing a space in "made(" but I'd just leave out the bug URL. If we need to find the bug/PR where this change was made, we can check blame.

Added extra data to `Suggestion::Fakespot` to capture how the FTS match
was made.  The plan is to use this as a facet for our metrics to help us
consider how to tune the matching logic (i.e. maybe we should not use
stemming, maybe we should reqiure that terms are close together).

Added Suggest CLI flag to print out the FTS match info.
@bendk bendk force-pushed the push-mollqwzklyms branch from e6880b6 to 074c301 Compare December 23, 2024 22:06
@bendk
Copy link
Contributor Author

bendk commented Dec 23, 2024

I got distracted last week, but I'm picking it back up now. Thanks for the great review, I think this one is looking much better now.

@bendk
Copy link
Contributor Author

bendk commented Dec 23, 2024

Thanks, lgtm. It's too bad several extra queries may be needed to get the match info, but I'm guessing it's not a big problem (in terms of latency at least) since they should all be using indexes? You could probably do one big query with SELECT subqueries to get all the info in one, but it's probably not worth it.

I refactored this to use subqueries and I think it's actually cleaner this way, so let's go for it.

@bendk bendk requested a review from 0c0w3 December 27, 2024 16:16
Copy link
Contributor

@0c0w3 0c0w3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm! I was thinking we would do at most two SQL queries total: the main query to match suggestions, and then if that succeeds, a second query to get the match info. That way queries that don't match Fakespot at all -- which is most queries -- don't pay the extra cost of the match-info subqueries. Sorry for not making that clear. But I can't say if that would be worth it or not, so up to you if you want to stick with this or try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants