-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCO-2503] Suggest: Pocket suggestion ingestion #5841
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #5841 +/- ##
==========================================
- Coverage 36.91% 36.83% -0.08%
==========================================
Files 346 347 +1
Lines 33430 33501 +71
==========================================
Hits 12340 12340
- Misses 21090 21161 +71
☔ View full report in Codecov by Sentry. |
28c7e12
to
45cd72e
Compare
Thank you so much, Tif—and thanks for making the changes since yesterday. Gentle reminder to bump the schema version and migration function in I tested your patch out against the real dataset in Remote Settings, and saw some unique constraint errors on both
(I dropped the The ones with different suggestion IDs— But the ones with multiple copies of the same suggestion ID are more interesting. It really looks like their keywords arrays have duplicates!
We can fix that by using either a /cc @ncloudioj @0c0w3 for visibility, too! |
Yes, we can easily dedup keywords for each suggestion at the ingestion phase over here (BTW, Pocket suggestions are ingested by this Merino job instead of quickssugest-rs). Do we also want to do deduping among low and high-confidence keywords for each suggestion? @0c0w3 What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added a few comments for your consideration.
components/suggest/src/db.rs
Outdated
@@ -123,14 +124,29 @@ impl<'a> SuggestDao<'a> { | |||
|
|||
/// Fetches suggestions that match the given keyword from the database. | |||
pub fn fetch_by_keyword(&self, keyword: &str) -> Result<Vec<Suggestion>> { | |||
let pocket_keyword = match keyword.contains(' ') { | |||
true => keyword.to_string(), | |||
false => keyword.to_owned() + " ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads if the keyword doesn't have whitespace in it, then we append one to the end. Hmm, why do we need this extra trailing space for Pocket keywords?
Also, a nitpick that it's a little inconsistent to use to_string()
and to_owned()
in both arms. Maybe use Cow
instead:
use std::borrow::Cow;
let mut pocket_keyword = Cow::from(keyword);
if !keyword.contains(' ') {
pocket_keyword.to_mut().push(' ');
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Drew's comment, now I understand what's going on here. I'd recommend to add some comments to outline how keyword lookup is done for each suggestion type as to me it's much harder to fathom that by looking at the SQL query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, I really like Nan's suggestion of using Cow
that way!
But I think that this logic won't handle single-word low-confidence keywords correctly. For example, I don't think that workaholism
will match the Pocket suggestion in the test now—but it should. Is that right?
IIUC, another way to write our algorithm for matching low-confidence keywords is:
- The head (first word) of the query must be the same as the low-confidence keyword's head.
- The tail (anything after the first word) of the query—if any—should be a prefix of the keyword's tail.
So all these queries should match the Pocket suggestion: workaholism
, soft
(first word of soft life
), soft l
, soft li
(first words match; l
and li
are prefixes of life
).
And these shouldn't, because their first words don't match: work
, sof
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way we could make this work is to store the head and tail in separate columns when we insert them into the table. Then, when we fetch suggestions, we could split the user's query the same way, and match the head and tail prefix.
Here's a quick sketch of how that all could work! Let's say we have two suggestions, and all their low-confidence keywords start with breakfast
:
title: "How to Build a Better Breakfast"
keywords: ["breakfast"]
title: Your New Favorite Breakfast Sandwich
keywords: ["breakfast crunch wrap", "breakfast torta"]
If we stored those suggestions and keywords in a schema like this:
CREATE TABLE suggestions(id INTEGER PRIMARY KEY, title TEXT NOT NULL);
CREATE TABLE keywords(keyword_head TEXT NOT NULL, keyword_tail TEXT NOT NULL, suggestion_id INTEGER NOT NULL REFERENCES suggestions(id), PRIMARY KEY (keyword_head, keyword_tail));
INSERT INTO suggestions(id, title) VALUES(1, 'How to Build a Better Breakfast');
INSERT INTO suggestions(id, title) VALUES(2, 'Your New Favorite Breakfast Sandwich');
INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', '', 1);
INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', 'crunch wrap', 2);
INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', 'torta', 2);
Here's how a query plan for breakfast
would look:
sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN '' AND '' || x'ffff' GROUP BY s.id, s.title;
QUERY PLAN
|--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?)
|--SEARCH s USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR GROUP BY
1|How to Build a Better Breakfast
2|Your New Favorite Breakfast Sandwich
(To play around with that schema in the sqlite3
shell, I saved the schema into a file, then ran sqlite3 -init file.sql
to load it).
Since Your New Favorite Breakfast Sandwich
has two keywords that start with breakfast
, we need a GROUP BY
or a DISTINCT
to filter out duplicates—or we'll get two rows for it, one for each of its keywords. But the search over keywords
can use all columns of the primary key (that's the sqlite_autoindex_keywords_1
automatic index), so the most expensive part of the query is still pretty efficient. By the time we build the temporary b-tree for the GROUP BY
, we should have reduced the number of rows to a small handful.
The query plan for breakfast cru
looks the same, and finds just the one suggestion:
sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN 'cru' AND 'cru' || x'ffff' GROUP BY s.id, s.title;
QUERY PLAN
|--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?)
|--SEARCH s USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR GROUP BY
2|Your New Favorite Breakfast Sandwich
And so does breakfast torta
:
sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN 'torta' AND 'torta' || x'ffff' GROUP BY s.id, s.title;
QUERY PLAN
|--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?)
|--SEARCH s USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR GROUP BY
2|Your New Favorite Breakfast Sandwich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's some really cool SQL magic there! 👏
I'd like to offer another approach: How about we move this qualifying logic over to the row handler closure?
We currently only fetch data from other tables and then assemble the suggestion accordingly in that closure. I think we can also use that as the post-processor for each candidate. Specifically, the post-processor can transform, filter, and augment each row if needed. Two benefits I can see here:
- In certain cases, it's much easier to do stuff in native code than in SQL
- To keep those keywords table schema lean and generic. Imagine if we need to change the matching logic for the low-confidence keywords, we can still do prefix queries in SQL and apply more specific matching logic in the post-processor. No change is needed to the schema!
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we move this qualifying logic over to the row handler closure?
Yes!
Filtering in the row handler is definitely a great option, with a small result set. For example, if our query returns 5 rows, and we filter all of them out—that's OK, because we only had to scan and read 5 rows. But if our query returns hundreds or thousands of rows, filtering in the row handler isn't as efficient—we still end up scanning and reading in all those rows, and then filtering them out in code.
For low-confidence Pocket suggestions, this is a bit tricky to do naively, because we want the whole first word to match, and only match prefixes for subsequent words. If we wanted to keep the schema super generic, with a single keyword
column, I think we'd need to do a prefix match unconditionally. So, if the user searches for b
or br
, we'd return suggestions for all keywords that start with b
or br
—and then filter out all of them, because b
and br
aren't whole words.
We could combine the approaches, through—a less generic schema, but also doing more work in Rust code instead of SQL:
- Split each keyword into
keyword_head
andkeyword_tail
(I'm realizingkeyword_prefix
andkeyword_suffix
might actually be better names, so I'll use that instead of "head" and "tail" now! 😅), like I suggested, and... - Filter on just
keyword_prefix
in the SQL query, then filter onkeyword_suffix
in the row handler, like you suggested!
I think this would also line up nicely with @0c0w3's suggestion to use a single table to store the keywords, and have a column for whether it's low or high-confidence—and doing the filtering after.
So we'd have a schema like:
CREATE TABLE pocket_keywords(
keyword_prefix TEXT NOT NULL,
keyword_suffix TEXT NOT NULL DEFAULT "",
confidence_type INTEGER NOT NULL,
suggestion_id INTEGER NOT NULL REFERENCES suggestions(id),
PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id)
);
With the data from Drew's example like:
keyword_prefix | keyword_suffix | confidence_type
---------------+----------------+----------------
esther | the wonder pig | LOW
wildlife | stories | HIGH
(Assuming LOW
and HIGH
are integer constants there).
And then, in the row handler, we'd have logic like:
let suffixes_match = match row.get("confidence_type")? {
LOW => row.get("keyword_suffix")?.starts_with(user_keyword_suffix),
HIGH => row.get("keyword_suffix")? == user_keyword_suffix,
};
Let's say the user typed in esther th
. Splitting the user's query on the first space, the prefix is esther
, and the suffix is th
. The SQL query for that would look like this:
SELECT s.*, k.keyword_suffix, k.confidence_type
FROM suggestions s
JOIN pocket_keywords k ON k.suggestion_id = s.id
WHERE k.keyword_prefix = 'esther'
That SQL query would return rows for all Pocket suggestions with keywords whose first word is esther
. Then, the logic in our row handler would see that it's a low-confidence keyword, and take the LOW
branch. the wonder pig
starts with th
, so it's a match for the user's query.
Now let's say they typed in wildlife stories
. The prefix is wildlife
; the suffix is stories
. The SQL query for that looks the same as for esther th
:
SELECT s.*, k.keyword_suffix, k.confidence_type
FROM suggestions s
JOIN pocket_keywords k ON k.suggestion_id = s.id
WHERE k.keyword_prefix = 'wildlife'
...Then our row handler sees that it's a high-confidence keyword, so it takes the HIGH
branch. stories
== stories
, so it's a match.
Now what if they type just wildlife
(no suffix), or wildlife st
(suffix = st
)? The SQL query for that is the same, and would return the same suggestion:
SELECT s.*, k.keyword_suffix, k.confidence_type
FROM suggestions s
JOIN pocket_keywords k ON k.suggestion_id = s.id
WHERE k.keyword_prefix = 'wildlife'
But then our row handler would see that this is a high-confidence keyword, and the suffixes don't match: st
!= stories
. So it filters out the suggestion.
I'm really liking this hybrid approach: it lets us use a single table for Pocket keywords, it avoids fancy suffix matching and conditionals in SQL, and it still gets us great query performance, and limits the number of rows we have to scan and read.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that'll work like a charm!
components/suggest/src/store.rs
Outdated
dao.drop_suggestions(&record_id)?; | ||
|
||
// Ingest (or re-ingest) all suggestions in the | ||
// attachment. | ||
dao.insert_pocket_suggestions(&record_id, attachment.suggestions())?; | ||
|
||
// Remove this record's ID from the list of unparsable | ||
// records, since we understand it now. | ||
dao.drop_unparsable_record_id(&record_id)?; | ||
|
||
// Advance the last fetch time, so that we can resume | ||
// fetching after this record if we're interrupted. | ||
dao.put_last_ingest_if_newer(record.last_modified)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this sequence of (drop-insert-drop-put) has been used for all the suggestion types, shall we wrap them up into another helper in DAO to keep ingest_records()
DRY?
components/suggest/src/schema.rs
Outdated
@@ -43,6 +57,12 @@ pub const SQL: &str = " | |||
ON DELETE CASCADE | |||
); | |||
|
|||
CREATE TABLE pocket_custom_details( | |||
suggestion_id INTEGER PRIMARY KEY REFERENCES suggestions(id) ON DELETE CASCADE, | |||
description TEXT NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like description
is not included in the Pocket suggestion & the udl below, also looks like Firefox doesn't use it for now. Shall we exclude it from the schema here as well? We can always add it back later if needed.
Wow it's great to see so much progress being made on the Rust implementation. Yeah the Pocket keywords were hand-edited by a group of people and it's entirely possible some keyword lists contain duplicates. I agree we should remove the duplicates at ingestion time, which we can do as you said, @ncloudioj. I can take care of that.
It looks like this isn't a problem for the Rust implementation (please correct me if I'm wrong), so I would vote for not worrying about it. Arguably this type of duplication isn't wrong; imagine if users could turn off high-confidence keywords and the related UI but keep the low-confidence ones, for example. |
Should be fixed now. |
Sounds good, thanks for confirming! @0c0w3 Could you briefly outline how low/high-confidence keywords work in Firefox? |
Sure, Tif and I were chatting about this last week and here's the gist: When the user triggers a suggestion by typing a high confidence keyword, the suggestion is shown as a top pick. When they trigger it using a low confidence keyword, it’s shown at the bottom in the usual Suggest position. Also, high confidence keywords must be typed in full, but Firefox will start matching on a low confidence keyword after its first word is typed. References:
Example:
|
Ah, I see. Thanks for sharing that here! And now I understand why the keyword lookup queries were made that way in this PR. |
components/suggest/src/db.rs
Outdated
@@ -123,14 +124,29 @@ impl<'a> SuggestDao<'a> { | |||
|
|||
/// Fetches suggestions that match the given keyword from the database. | |||
pub fn fetch_by_keyword(&self, keyword: &str) -> Result<Vec<Suggestion>> { | |||
let pocket_keyword = match keyword.contains(' ') { | |||
true => keyword.to_string(), | |||
false => keyword.to_owned() + " ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Drew's comment, now I understand what's going on here. I'd recommend to add some comments to outline how keyword lookup is done for each suggestion type as to me it's much harder to fathom that by looking at the SQL query.
components/suggest/src/db.rs
Outdated
FROM suggestions s | ||
JOIN keywords k ON k.suggestion_id = s.id | ||
WHERE k.keyword = :keyword | ||
UNION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: in SQL, UNION
will remove duplicates among all the SELECT rows, whereas UNION ALL
will not, so UNION ALL
should run faster here as there shouldn't be any duplicates in this case.
components/suggest/src/db.rs
Outdated
FROM suggestions s | ||
JOIN pocket_high_confidence_keywords k ON k.suggestion_id = s.id | ||
WHERE k.keyword = :keyword | ||
UNION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it already finds a record from the high confidence keywords table, doing another lookup against the low-confidence keywords table is unnecessary. Looks like it's not easy to inject that conditional query logic inside this query, guess we will live with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet there's a way to do it, maybe:
UNION
SELECT IFNULL(
(
SELECT s.id, k.rank, s.title, s.url, s.provider, true as is_top_pick
FROM suggestions s
JOIN pocket_high_confidence_keywords k ON k.suggestion_id = s.id
WHERE k.keyword = :keyword
),
(
SELECT s.id, k.rank, s.title, s.url, s.provider, false as is_top_pick
FROM suggestions s
JOIN pocket_low_confidence_keywords k ON k.suggestion_id = s.id
WHERE k.keyword BETWEEN :pocket_keyword AND :pocket_keyword || x'ffff'
)
)
The high and low keywords could also be stored in the same table, so you do one lookup to get the matching keyword and its type, whether it's high or low, and then a join to get the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but unfortunately, both IFNULL
and COALESCE
in SQLite only take scalars as the arguments, selecting multiple columns will lead to a syntax error.
components/suggest/src/db.rs
Outdated
SELECT s.id, k.rank, s.title, s.url, s.provider, false as is_top_pick | ||
FROM suggestions s | ||
JOIN pocket_low_confidence_keywords k ON k.suggestion_id = s.id | ||
WHERE k.keyword BETWEEN :pocket_keyword AND :pocket_keyword || x'ffff' | ||
LIMIT 1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have multiple keyword sources, maybe it's time to return all the hits back (sorted by score) to the consumer instead of 1. Even returning the one with the highest score could lead to some surprises, e.g. if the user dismissed that suggestion, all other candidates for the same keyword will never get a chance to show.
Also cc @linabutler @0c0w3 for their thoughts.
51e89e2
to
528868e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tiftran! Putting up a quick round of comments before breakfast; I'll do a more thorough pass later!
use rusqlite::types::{FromSql, FromSqlError, FromSqlResult, ToSqlOutput, ValueRef}; | ||
use rusqlite::{Result as RusqliteResult, ToSql}; | ||
|
||
/// Classification of Pocket confidence keywords. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expand this docstring a bit to explain the difference between a high-confidence and low-confidence keyword, so that someone looking at it for the first time (and our future selves!) remember the semantics.
components/suggest/src/schema.rs
Outdated
CREATE TABLE pocket_keywords( | ||
keyword_prefix TEXT NOT NULL, | ||
keyword_suffix TEXT NOT NULL DEFAULT '', | ||
confidence_type INTEGER NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confidence_type INTEGER NOT NULL, | |
confidence INTEGER NOT NULL, | |
rank INTEGER NOT NULL |
Let's rename confidence_type
to confidence
(to match our other integer enum field, provider
), and add rank
—for the position of the keyword in the list of keywords.
components/suggest/src/store.rs
Outdated
// Ingest (or re-ingest) all suggestions in the | ||
// attachment. | ||
block(dao, &record_id, attachment.suggestions())?; | ||
// dao.insert_amp_wikipedia_suggestions(&record_id, attachment.suggestions())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// dao.insert_amp_wikipedia_suggestions(&record_id, attachment.suggestions())?; |
Nit: Let's remove this commented-out code.
components/suggest/src/store.rs
Outdated
// An AMP-Wikipedia record should always have an | ||
// attachment with suggestions. If it doesn't, it's | ||
// malformed, so skip to the next record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// An AMP-Wikipedia record should always have an | |
// attachment with suggestions. If it doesn't, it's | |
// malformed, so skip to the next record. | |
// A record should always have an attachment with suggestions. | |
// If it doesn't, it's malformed, so we can't ingest it. |
Nit: This isn't specific to AMP-Wikipedia anymore! 🎉
components/suggest/src/schema.rs
Outdated
keyword_suffix TEXT NOT NULL DEFAULT '', | ||
confidence_type INTEGER NOT NULL, | ||
suggestion_id INTEGER NOT NULL REFERENCES suggestions(id), | ||
PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) ); | |
PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) | |
); |
Style nit.
components/suggest/src/db.rs
Outdated
let (keyword_prefix, keyword_suffix) = match keyword.find(' ') { | ||
Some(index) => { | ||
let (prefix, suffix) = keyword.split_at(index); | ||
(prefix.to_string(), suffix[1..].to_string()) | ||
} | ||
None => (keyword.clone(), String::new()), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's lean on the Rust standard library here!
let (keyword_prefix, keyword_suffix) = match keyword.find(' ') { | |
Some(index) => { | |
let (prefix, suffix) = keyword.split_at(index); | |
(prefix.to_string(), suffix[1..].to_string()) | |
} | |
None => (keyword.clone(), String::new()), | |
}; | |
let (keyword_prefix, keyword_suffix) = keyword.split_once(' ').unwrap_or_else(|| (keyword, "")); |
split_once
will do the checking and slicing for you, and returns references, so we can also avoid the extra copies with to_string()
and clone()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These loops also look mostly identical! We can use some of Rust's iterator methods to cut down on the repetition:
for ((rank, keyword), confidence) in suggestion
.high_confidence_keywords
.iter()
.enumerate()
.zip(std::iter::repeat(PocketKeywordConfidence::High))
.chain(
suggestion
.low_confidence_keywords
.iter()
.enumerate()
.zip(std::iter::repeat(PocketKeywordConfidence::Low)),
)
{
// ...
}
std::iter::repeat
returns an infinite iterator, and zip
-ping another iterator with it will combine elements from both of them, and stop as soon as one of the iterators is exhausted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😵 wow this is fancy
components/suggest/src/db.rs
Outdated
let (keyword_prefix, keyword_suffix) = match keyword.find(' ') { | ||
Some(index) => { | ||
let (prefix, suffix) = keyword.split_at(index); | ||
(prefix.to_string(), suffix[1..].to_string()) | ||
} | ||
None => (keyword.clone(), String::new()), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about using split_once
.
components/suggest/src/db.rs
Outdated
PocketKeywordConfidence::Low => row.get::<_, String>("keyword_suffix")?.starts_with(&keyword_suffix), | ||
PocketKeywordConfidence::High => row.get::<_, String>("keyword_suffix")? == keyword_suffix, | ||
}; | ||
if suffixes_match{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if suffixes_match{ | |
if suffixes_match { |
components/suggest/src/db.rs
Outdated
let (keyword_prefix, keyword_suffix) = match keyword.find(' ') { | ||
Some(index) => { | ||
let (prefix, suffix) = keyword.split_at(index); | ||
(prefix.to_string(), suffix[1..].to_string()) | ||
} | ||
None => (keyword.to_string(), String::new()), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block was used three times in this module, you can extract this as a helper function using Lina's suggestion.
components/suggest/src/db.rs
Outdated
None => (keyword.to_string(), String::new()), | ||
}; | ||
Ok(self.conn.query_rows_and_then_cached( | ||
"SELECT s.id, k.rank, s.title, s.url, s.provider, -1 as confidence_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend using NULL
instead of -1
for the missing field. Also, how about we move this field to pocket_custom_details
as it's Pocket-specific? We will need to do more in the handler, but I think it's still better than introducing a dummy column to the suggestion table.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sooo confidence is sits in the pocket_keywords table, the -1 is used as a dummy value cause UNION requires the same number of columns. Is there a way to smush the two results together without having the same number of columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's in the pocket_keywords table. If we don't include it here, we'd have to do another lookup against the keywords table. NVM, let's use NULL as the dummy value then.
components/suggest/src/db.rs
Outdated
SuggestionProvider::Pocket => { | ||
let confidence_type = row.get("confidence_type")?; | ||
self.conn.query_row_and_then( | ||
"SELECT keyword_suffix FROM pocket_keywords WHERE suggestion_id = :suggestion_id AND keyword_prefix = :keyword_prefix", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SELECT keyword_suffix FROM pocket_keywords WHERE suggestion_id = :suggestion_id AND keyword_prefix = :keyword_prefix", | |
"SELECT k.keyword_suffix FROM pocket_keywords k | |
WHERE k.suggestion_id = :suggestion_id | |
AND k.keyword_prefix = :keyword_prefix", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙈
e3879bc
to
33d3470
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, thanks for folding in all those changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for cranking it out!
r=nanj with a few nits and comments for discussion.
self.conn.query_rows_and_then_cached( | ||
"SELECT s.id, k.rank, s.title, s.url, s.provider | ||
let (keyword_prefix, keyword_suffix) = split_keyword(keyword); | ||
Ok(self.conn.query_rows_and_then_cached( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to rotate my monitor to read the whole function body ;-). Looks like we can extract some more helper functions from the pattern match expression below to make it more readable. Feel free to do it with a follwup.
FROM suggestions s | ||
JOIN keywords k ON k.suggestion_id = s.id | ||
WHERE k.keyword = :keyword | ||
LIMIT 1", | ||
UNION ALL | ||
SELECT s.id, k.rank, s.title, s.url, s.provider, k.confidence, k.keyword_suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that for Pocket suggestions, it will still try loading keywords from the keywords
table (line 149), although there should be none. Seems like keywords loading can be skipped for AMO and Pocket suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's get a follow-up on file to let callers specify which providers they want—we can use that to optimize which query we run, and on mobile to restrict suggestions to AMP and Wikipedia.
}, | ||
) | ||
} | ||
)?.into_iter().flatten().collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion: we could provide some extra convenience by sorting the returning suggestions by certain criteria, e.g. by score or rank. For example, if a keyword matches both the high confidence and the low confidence keywords for a Pocket suggestion, it'll be nice to put the high confidence one in the front.
The pull request has been modified, dismissing previous reviews.
components/suggest/src/pocket.rs
Outdated
/// assert_eq!(split_keyword("foo"), ("foo", "")); | ||
/// assert_eq!(split_keyword("foo bar baz"), ("foo", "bar baz")); | ||
/// ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiftran Did you happen to get a build failure with the backticks? 😅
The triple backticks will also compile and run the code between them as a "doctest". Under the hood, rustdoc builds each doctest as a separate program, so it has to import any functions and types that it uses, like this:
/// ```
/// # use suggest::pocket::split_keyword;
/// assert_eq!(split_keyword("foo"), ("foo", ""));
/// assert_eq!(split_keyword("foo bar baz"), ("foo", "bar baz"));
/// ```
(The #
hides the import from the rendered snippet).
But that runs into another issue: the pocket
module is private, so the doctest can't use it. It's called out (though a bit buried) in this section:
Note that they will still link against only the public items of your crate; if you need to test private items, you need to write a unit test.
We have a few options to make this work:
- Add the
ignore
attribute to the code block. - Make this module public in
lib.rs
, withpub mod pocket;
, and add the# use
above. It does mean that our Rust component's public API is different than the UniFFIed API—but, given that the Rust component is never consumed directly by other Rust code, I think that's okay. - Rewrite this doctest as a unit test. This is what
keyword.rs
andsuggestion.rs
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful! r+wc from me, too!
The pull request has been modified, dismissing previous reviews.
1df5e5a
to
2e27f25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all—they're all ingested correctly from Remote Settings now! 🎉
Please fold in @ncloudioj's last suggestion, squash, and let's get this landed!
The pull request has been modified, dismissing previous reviews.
2a0dc4b
to
0ec35cc
Compare
Pull Request checklist
[ci full]
to the PR title.Branch builds: add
[firefox-android: branch-name]
to the PR title.