[DISCO-2503] Suggest: Pocket suggestion ingestion #5841

tiftran · 2023-09-25T18:56:14Z

Pull Request checklist

Breaking changes: This PR follows our breaking change policy
- This PR follows the breaking change policy:
  - This PR has no breaking API changes, or
  - There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
Quality: This PR builds and tests run cleanly
- Note:
  - For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
  - If this pull request includes a breaking change, consider cutting a new release after merging.
Tests: This PR includes thorough tests or an explanation of why it does not
Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
- Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
Dependencies: This PR follows our dependency management guidelines
- Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

codecov-commenter · 2023-09-25T19:02:46Z

Codecov Report

Attention: 104 lines in your changes are missing coverage. Please review.

Comparison is base (399fe73) 36.91% compared to head (2a0dc4b) 36.83%.
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5841      +/-   ##
==========================================
- Coverage   36.91%   36.83%   -0.08%     
==========================================
  Files         346      347       +1     
  Lines       33430    33501      +71     
==========================================
  Hits        12340    12340              
- Misses      21090    21161      +71

Files	Coverage Δ
components/suggest/src/lib.rs	`0.00% <ø> (ø)`
components/suggest/src/provider.rs	`0.00% <0.00%> (ø)`
components/suggest/src/schema.rs	`0.00% <0.00%> (ø)`
components/suggest/src/suggestion.rs	`0.00% <0.00%> (ø)`
components/suggest/src/rs.rs	`0.00% <0.00%> (ø)`
components/suggest/src/pocket.rs	`0.00% <0.00%> (ø)`
components/suggest/src/store.rs	`0.00% <0.00%> (ø)`
components/suggest/src/db.rs	`0.00% <0.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

linabutler · 2023-09-26T17:28:58Z

Thank you so much, Tif—and thanks for making the changes since yesterday. Gentle reminder to bump the schema version and migration function in schema.rs again 😊 I'll do another review pass later today!

I tested your patch out against the real dataset in Remote Settings, and saw some unique constraint errors on both pocket_high_confidence_keywords and pocket_low_confidence_keywords. There are only a handful of duplicates, but they are there:

sqlite> select keyword, group_concat(suggestion_id, ',') from pocket_high_confidence_keywords group by keyword having count(*) > 1;
african billionaires|71,71
april fools day|7,311
arthur brooks column|24,24
arthur c brooks|24,24
arthur c brooks column|24,24,24,24
budget-friendly sustainability|247,247
cheesy corn dip|26,26
dallas cowboy cheerleaders|111,111
festive vegan dishes|17,275
how to focus|119,119
how to get clearer skin|231,231
how to recognize burnout|232,232
how to write anything|168,168
making new friends as adults|251,251
skin maintenance|231,231
sustainable habits|247,247
the atlantic arthur brooks|24,24

(I dropped the PRIMARY KEY definition from both tables to get a database that I could query to build that list).

The ones with different suggestion IDs—april fools day, festive vegan dishes—shouldn't be a problem, because the Pocket keywords tables have PRIMARY KEY (keyword, suggestion_id), so it's OK for different suggestions to have the same keyword. We might need to do an ORDER BY score in the SELECT for fetch_by_keyword to break ties, though, or drop the LIMIT 1.

But the ones with multiple copies of the same suggestion ID are more interesting. It really looks like their keywords arrays have duplicates!

DownloadedPocketSuggestion {
  description: "The best advice for finding your people, staying close, and getting through the hard parts.",
  url: "https://getpocket.com/collections/the-grown-ups-guide-to-making-and-keeping-friends",
  title: "The Grown-Up’s Guide to Making and Keeping Friends",
  low_confidence_keywords: ["what to do if you're lonely", "dealing with loneliness", "lonely what to do", "what to do lonely"],
  high_confidence_keywords: ["how to make friends as adults", "making new friends as adults", "how to make friends as an adult", "making new friends as adults", "making new friends as an adult", "how to make friends", "making friendships", "keeping friendships", "friendship maintenance", "adult social networks", "new friends", "adult social life", "making friends", "making new friends", "how to make new friends", "how to keep friends", "how to maintain friendships", "maintain friendships", "maintaining friendships", "adult friendships", "socializing tips", "nurturing friendships", "friendship longevity", "friendships in adulthood", "grown up friendships", "grown-up friendships"],
  score: 0.25 }

We can fix that by using either a HashSet<String> instead of a Vec<String> for the keywords lists in DownloadedPocketSuggestion, or an INSERT OR IGNORE when we insert them into the tables—but I wonder if we should take care of this in quicksuggest-rs, during ingestion instead. WDYT?

/cc @ncloudioj @0c0w3 for visibility, too!

ncloudioj · 2023-09-26T18:29:34Z

but I wonder if we should take care of this in quicksuggest-rs, during ingestion instead. WDYT?

Yes, we can easily dedup keywords for each suggestion at the ingestion phase over here (BTW, Pocket suggestions are ingested by this Merino job instead of quickssugest-rs). Do we also want to do deduping among low and high-confidence keywords for each suggestion?

@0c0w3 What do you think?

ncloudioj

Looks good, added a few comments for your consideration.

ncloudioj · 2023-09-26T18:59:33Z

components/suggest/src/db.rs

@@ -123,14 +124,29 @@ impl<'a> SuggestDao<'a> {

    /// Fetches suggestions that match the given keyword from the database.
    pub fn fetch_by_keyword(&self, keyword: &str) -> Result<Vec<Suggestion>> {
+        let pocket_keyword = match keyword.contains(' ') {
+            true => keyword.to_string(),
+            false => keyword.to_owned() + " ",


This reads if the keyword doesn't have whitespace in it, then we append one to the end. Hmm, why do we need this extra trailing space for Pocket keywords?

Also, a nitpick that it's a little inconsistent to use to_string() and to_owned() in both arms. Maybe use Cow instead:

use std::borrow::Cow; let mut pocket_keyword = Cow::from(keyword); if !keyword.contains(' ') { pocket_keyword.to_mut().push(' '); }

With Drew's comment, now I understand what's going on here. I'd recommend to add some comments to outline how keyword lookup is done for each suggestion type as to me it's much harder to fathom that by looking at the SQL query.

Oooh, I really like Nan's suggestion of using Cow that way!

But I think that this logic won't handle single-word low-confidence keywords correctly. For example, I don't think that workaholism will match the Pocket suggestion in the test now—but it should. Is that right?

IIUC, another way to write our algorithm for matching low-confidence keywords is:

The head (first word) of the query must be the same as the low-confidence keyword's head.

The tail (anything after the first word) of the query—if any—should be a prefix of the keyword's tail.

So all these queries should match the Pocket suggestion: workaholism, soft (first word of soft life), soft l, soft li (first words match; l and li are prefixes of life).

And these shouldn't, because their first words don't match: work, sof, etc.

One way we could make this work is to store the head and tail in separate columns when we insert them into the table. Then, when we fetch suggestions, we could split the user's query the same way, and match the head and tail prefix.

Here's a quick sketch of how that all could work! Let's say we have two suggestions, and all their low-confidence keywords start with breakfast:

title: "How to Build a Better Breakfast" keywords: ["breakfast"] title: Your New Favorite Breakfast Sandwich keywords: ["breakfast crunch wrap", "breakfast torta"]

If we stored those suggestions and keywords in a schema like this:

CREATE TABLE suggestions(id INTEGER PRIMARY KEY, title TEXT NOT NULL); CREATE TABLE keywords(keyword_head TEXT NOT NULL, keyword_tail TEXT NOT NULL, suggestion_id INTEGER NOT NULL REFERENCES suggestions(id), PRIMARY KEY (keyword_head, keyword_tail)); INSERT INTO suggestions(id, title) VALUES(1, 'How to Build a Better Breakfast'); INSERT INTO suggestions(id, title) VALUES(2, 'Your New Favorite Breakfast Sandwich'); INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', '', 1); INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', 'crunch wrap', 2); INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', 'torta', 2);

Here's how a query plan for breakfast would look:

sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN '' AND '' || x'ffff' GROUP BY s.id, s.title; QUERY PLAN |--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?) |--SEARCH s USING INTEGER PRIMARY KEY (rowid=?) `--USE TEMP B-TREE FOR GROUP BY 1|How to Build a Better Breakfast 2|Your New Favorite Breakfast Sandwich

(To play around with that schema in the sqlite3 shell, I saved the schema into a file, then ran sqlite3 -init file.sql to load it).

Since Your New Favorite Breakfast Sandwich has two keywords that start with breakfast, we need a GROUP BY or a DISTINCT to filter out duplicates—or we'll get two rows for it, one for each of its keywords. But the search over keywords can use all columns of the primary key (that's the sqlite_autoindex_keywords_1 automatic index), so the most expensive part of the query is still pretty efficient. By the time we build the temporary b-tree for the GROUP BY, we should have reduced the number of rows to a small handful.

The query plan for breakfast cru looks the same, and finds just the one suggestion:

sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN 'cru' AND 'cru' || x'ffff' GROUP BY s.id, s.title; QUERY PLAN |--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?) |--SEARCH s USING INTEGER PRIMARY KEY (rowid=?) `--USE TEMP B-TREE FOR GROUP BY 2|Your New Favorite Breakfast Sandwich

And so does breakfast torta:

sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN 'torta' AND 'torta' || x'ffff' GROUP BY s.id, s.title; QUERY PLAN |--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?) |--SEARCH s USING INTEGER PRIMARY KEY (rowid=?) `--USE TEMP B-TREE FOR GROUP BY 2|Your New Favorite Breakfast Sandwich

Wow, that's some really cool SQL magic there! 👏

I'd like to offer another approach: How about we move this qualifying logic over to the row handler closure?

We currently only fetch data from other tables and then assemble the suggestion accordingly in that closure. I think we can also use that as the post-processor for each candidate. Specifically, the post-processor can transform, filter, and augment each row if needed. Two benefits I can see here:

In certain cases, it's much easier to do stuff in native code than in SQL

To keep those keywords table schema lean and generic. Imagine if we need to change the matching logic for the low-confidence keywords, we can still do prefix queries in SQL and apply more specific matching logic in the post-processor. No change is needed to the schema!

WDYT?

How about we move this qualifying logic over to the row handler closure?

Yes!

Filtering in the row handler is definitely a great option, with a small result set. For example, if our query returns 5 rows, and we filter all of them out—that's OK, because we only had to scan and read 5 rows. But if our query returns hundreds or thousands of rows, filtering in the row handler isn't as efficient—we still end up scanning and reading in all those rows, and then filtering them out in code.

For low-confidence Pocket suggestions, this is a bit tricky to do naively, because we want the whole first word to match, and only match prefixes for subsequent words. If we wanted to keep the schema super generic, with a single keyword column, I think we'd need to do a prefix match unconditionally. So, if the user searches for b or br, we'd return suggestions for all keywords that start with b or br—and then filter out all of them, because b and br aren't whole words.

We could combine the approaches, through—a less generic schema, but also doing more work in Rust code instead of SQL:

Split each keyword into keyword_head and keyword_tail (I'm realizing keyword_prefix and keyword_suffix might actually be better names, so I'll use that instead of "head" and "tail" now! 😅), like I suggested, and...

Filter on just keyword_prefix in the SQL query, then filter on keyword_suffix in the row handler, like you suggested!

I think this would also line up nicely with @0c0w3's suggestion to use a single table to store the keywords, and have a column for whether it's low or high-confidence—and doing the filtering after.

So we'd have a schema like:

CREATE TABLE pocket_keywords( keyword_prefix TEXT NOT NULL, keyword_suffix TEXT NOT NULL DEFAULT "", confidence_type INTEGER NOT NULL, suggestion_id INTEGER NOT NULL REFERENCES suggestions(id), PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) );

With the data from Drew's example like:

keyword_prefix | keyword_suffix | confidence_type ---------------+----------------+---------------- esther | the wonder pig | LOW wildlife | stories | HIGH

(Assuming LOW and HIGH are integer constants there).

And then, in the row handler, we'd have logic like:

let suffixes_match = match row.get("confidence_type")? { LOW => row.get("keyword_suffix")?.starts_with(user_keyword_suffix), HIGH => row.get("keyword_suffix")? == user_keyword_suffix, };

Let's say the user typed in esther th. Splitting the user's query on the first space, the prefix is esther, and the suffix is th. The SQL query for that would look like this:

SELECT s.*, k.keyword_suffix, k.confidence_type FROM suggestions s JOIN pocket_keywords k ON k.suggestion_id = s.id WHERE k.keyword_prefix = 'esther'

That SQL query would return rows for all Pocket suggestions with keywords whose first word is esther. Then, the logic in our row handler would see that it's a low-confidence keyword, and take the LOW branch. the wonder pig starts with th, so it's a match for the user's query.

Now let's say they typed in wildlife stories. The prefix is wildlife; the suffix is stories. The SQL query for that looks the same as for esther th:

SELECT s.*, k.keyword_suffix, k.confidence_type FROM suggestions s JOIN pocket_keywords k ON k.suggestion_id = s.id WHERE k.keyword_prefix = 'wildlife'

...Then our row handler sees that it's a high-confidence keyword, so it takes the HIGH branch. stories == stories, so it's a match.

Now what if they type just wildlife (no suffix), or wildlife st (suffix = st)? The SQL query for that is the same, and would return the same suggestion:

SELECT s.*, k.keyword_suffix, k.confidence_type FROM suggestions s JOIN pocket_keywords k ON k.suggestion_id = s.id WHERE k.keyword_prefix = 'wildlife'

But then our row handler would see that this is a high-confidence keyword, and the suffixes don't match: st != stories. So it filters out the suggestion.

I'm really liking this hybrid approach: it lets us use a single table for Pocket keywords, it avoids fancy suffix matching and conditionals in SQL, and it still gets us great query performance, and limits the number of rows we have to scan and read.

WDYT?

Nice, that'll work like a charm!

ncloudioj · 2023-09-26T19:13:01Z

components/suggest/src/store.rs

+                        dao.drop_suggestions(&record_id)?;
+
+                        // Ingest (or re-ingest) all suggestions in the
+                        // attachment.
+                        dao.insert_pocket_suggestions(&record_id, attachment.suggestions())?;
+
+                        // Remove this record's ID from the list of unparsable
+                        // records, since we understand it now.
+                        dao.drop_unparsable_record_id(&record_id)?;
+
+                        // Advance the last fetch time, so that we can resume
+                        // fetching after this record if we're interrupted.
+                        dao.put_last_ingest_if_newer(record.last_modified)?;


Looks like this sequence of (drop-insert-drop-put) has been used for all the suggestion types, shall we wrap them up into another helper in DAO to keep ingest_records() DRY?

ncloudioj · 2023-09-26T19:17:03Z

components/suggest/src/schema.rs

@@ -43,6 +57,12 @@ pub const SQL: &str = "
        ON DELETE CASCADE
    );

+    CREATE TABLE pocket_custom_details(
+        suggestion_id INTEGER PRIMARY KEY REFERENCES suggestions(id) ON DELETE CASCADE,
+        description TEXT NOT NULL,


Looks like description is not included in the Pocket suggestion & the udl below, also looks like Firefox doesn't use it for now. Shall we exclude it from the schema here as well? We can always add it back later if needed.

0c0w3 · 2023-09-26T21:37:05Z

Wow it's great to see so much progress being made on the Rust implementation.

Yeah the Pocket keywords were hand-edited by a group of people and it's entirely possible some keyword lists contain duplicates. I agree we should remove the duplicates at ingestion time, which we can do as you said, @ncloudioj. I can take care of that.

Do we also want to do deduping among low and high-confidence keywords for each suggestion?

It looks like this isn't a problem for the Rust implementation (please correct me if I'm wrong), so I would vote for not worrying about it. Arguably this type of duplication isn't wrong; imagine if users could turn off high-confidence keywords and the related UI but keep the low-confidence ones, for example.

0c0w3 · 2023-09-27T00:24:18Z

Yeah the Pocket keywords were hand-edited by a group of people and it's entirely possible some keyword lists contain duplicates. I agree we should remove the duplicates at ingestion time, which we can do as you said, @ncloudioj. I can take care of that.

Should be fixed now.

ncloudioj · 2023-09-27T01:06:01Z

Arguably this type of duplication isn't wrong; imagine if users could turn off high-confidence keywords and the related UI but keep the low-confidence ones, for example.

Sounds good, thanks for confirming! @0c0w3 Could you briefly outline how low/high-confidence keywords work in Firefox?

0c0w3 · 2023-09-27T01:18:03Z

Sure, Tif and I were chatting about this last week and here's the gist:

When the user triggers a suggestion by typing a high confidence keyword, the suggestion is shown as a top pick. When they trigger it using a low confidence keyword, it’s shown at the bottom in the usual Suggest position. Also, high confidence keywords must be typed in full, but Firefox will start matching on a low confidence keyword after its first word is typed.

References:

Product spec
Desktop impl
Figma
To enable on desktop (118+): browser.urlbar.pocket.featureGate = true and browser.urlbar.bestMatch.enabled = true (also make sure Suggest is enabled, it is by default if you’re in the U.S.)

Example:

High: ["wildlife stories"]
Low: ["esther the wonder pig"]

Query                   | Match
------------------------+------
esthe                   | none
esther                  | low
esther (trailing space) | low
esther t                | low
esther th               | low
(etc.)                  | low
esther the wonder pig   | low
wildlif                 | none
wildlife                | none
(etc.)                  | none
wildlife storie         | none
wildlife stories        | high

ncloudioj · 2023-09-27T03:07:46Z

Ah, I see. Thanks for sharing that here! And now I understand why the keyword lookup queries were made that way in this PR.

ncloudioj · 2023-09-27T03:27:02Z

components/suggest/src/db.rs

@@ -123,14 +124,29 @@ impl<'a> SuggestDao<'a> {

    /// Fetches suggestions that match the given keyword from the database.
    pub fn fetch_by_keyword(&self, keyword: &str) -> Result<Vec<Suggestion>> {
+        let pocket_keyword = match keyword.contains(' ') {
+            true => keyword.to_string(),
+            false => keyword.to_owned() + " ",


With Drew's comment, now I understand what's going on here. I'd recommend to add some comments to outline how keyword lookup is done for each suggestion type as to me it's much harder to fathom that by looking at the SQL query.

ncloudioj · 2023-09-27T03:57:42Z

components/suggest/src/db.rs

             FROM suggestions s
             JOIN keywords k ON k.suggestion_id = s.id
             WHERE k.keyword = :keyword
+             UNION


Note: in SQL, UNION will remove duplicates among all the SELECT rows, whereas UNION ALL will not, so UNION ALL should run faster here as there shouldn't be any duplicates in this case.

ncloudioj · 2023-09-27T04:02:57Z

components/suggest/src/db.rs

+             FROM suggestions s
+             JOIN pocket_high_confidence_keywords k ON k.suggestion_id = s.id
+             WHERE k.keyword = :keyword
+             UNION


If it already finds a record from the high confidence keywords table, doing another lookup against the low-confidence keywords table is unnecessary. Looks like it's not easy to inject that conditional query logic inside this query, guess we will live with that.

I bet there's a way to do it, maybe:

UNION SELECT IFNULL( ( SELECT s.id, k.rank, s.title, s.url, s.provider, true as is_top_pick FROM suggestions s JOIN pocket_high_confidence_keywords k ON k.suggestion_id = s.id WHERE k.keyword = :keyword ), ( SELECT s.id, k.rank, s.title, s.url, s.provider, false as is_top_pick FROM suggestions s JOIN pocket_low_confidence_keywords k ON k.suggestion_id = s.id WHERE k.keyword BETWEEN :pocket_keyword AND :pocket_keyword || x'ffff' ) )

The high and low keywords could also be stored in the same table, so you do one lookup to get the matching keyword and its type, whether it's high or low, and then a join to get the suggestion.

Good point, but unfortunately, both IFNULL and COALESCE in SQLite only take scalars as the arguments, selecting multiple columns will lead to a syntax error.

ncloudioj · 2023-09-27T04:14:53Z

components/suggest/src/db.rs

+             SELECT s.id, k.rank, s.title, s.url, s.provider, false as is_top_pick
+             FROM suggestions s
+             JOIN pocket_low_confidence_keywords k ON k.suggestion_id = s.id
+             WHERE k.keyword BETWEEN :pocket_keyword AND :pocket_keyword || x'ffff'
             LIMIT 1",


Now that we have multiple keyword sources, maybe it's time to return all the hits back (sorted by score) to the consumer instead of 1. Even returning the one with the highest score could lead to some surprises, e.g. if the user dismissed that suggestion, all other candidates for the same keyword will never get a chance to show.

Also cc @linabutler @0c0w3 for their thoughts.

linabutler

Thanks @tiftran! Putting up a quick round of comments before breakfast; I'll do a more thorough pass later!

components/suggest/src/pocket_confidence.rs

linabutler · 2023-10-03T16:46:09Z

components/suggest/src/pocket_confidence.rs

+use rusqlite::types::{FromSql, FromSqlError, FromSqlResult, ToSqlOutput, ValueRef};
+use rusqlite::{Result as RusqliteResult, ToSql};
+
+/// Classification of Pocket confidence keywords.


Let's expand this docstring a bit to explain the difference between a high-confidence and low-confidence keyword, so that someone looking at it for the first time (and our future selves!) remember the semantics.

linabutler · 2023-10-03T16:46:35Z

components/suggest/src/schema.rs

+    CREATE TABLE pocket_keywords( 
+        keyword_prefix TEXT NOT NULL, 
+        keyword_suffix TEXT NOT NULL DEFAULT '', 
+        confidence_type INTEGER NOT NULL, 


Suggested change

confidence_type INTEGER NOT NULL,

confidence INTEGER NOT NULL,

rank INTEGER NOT NULL

Let's rename confidence_type to confidence (to match our other integer enum field, provider), and add rank—for the position of the keyword in the list of keywords.

linabutler · 2023-10-03T17:03:21Z

components/suggest/src/store.rs

+            // Ingest (or re-ingest) all suggestions in the
+            // attachment.
+            block(dao, &record_id, attachment.suggestions())?;
+            // dao.insert_amp_wikipedia_suggestions(&record_id, attachment.suggestions())?;


Suggested change

// dao.insert_amp_wikipedia_suggestions(&record_id, attachment.suggestions())?;

Nit: Let's remove this commented-out code.

linabutler · 2023-10-03T17:05:38Z

components/suggest/src/store.rs

+            // An AMP-Wikipedia record should always have an
+            // attachment with suggestions. If it doesn't, it's
+            // malformed, so skip to the next record.


Suggested change

// An AMP-Wikipedia record should always have an

// attachment with suggestions. If it doesn't, it's

// malformed, so skip to the next record.

// A record should always have an attachment with suggestions.

// If it doesn't, it's malformed, so we can't ingest it.

Nit: This isn't specific to AMP-Wikipedia anymore! 🎉

linabutler · 2023-10-03T17:06:08Z

components/suggest/src/schema.rs

+        keyword_suffix TEXT NOT NULL DEFAULT '', 
+        confidence_type INTEGER NOT NULL, 
+        suggestion_id INTEGER NOT NULL REFERENCES suggestions(id), 
+        PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) );


Suggested change

PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) );

PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id)

);

Style nit.

linabutler · 2023-10-03T17:11:31Z

components/suggest/src/db.rs

+                let (keyword_prefix, keyword_suffix) = match keyword.find(' ') {
+                    Some(index) => {
+                        let (prefix, suffix) = keyword.split_at(index);
+                        (prefix.to_string(), suffix[1..].to_string())
+                    }
+                    None => (keyword.clone(), String::new()),
+                };


Let's lean on the Rust standard library here!

Suggested change

let (keyword_prefix, keyword_suffix) = match keyword.find(' ') {

Some(index) => {

let (prefix, suffix) = keyword.split_at(index);

(prefix.to_string(), suffix[1..].to_string())

}

None => (keyword.clone(), String::new()),

};

let (keyword_prefix, keyword_suffix) = keyword.split_once(' ').unwrap_or_else(|| (keyword, ""));

split_once will do the checking and slicing for you, and returns references, so we can also avoid the extra copies with to_string() and clone().

These loops also look mostly identical! We can use some of Rust's iterator methods to cut down on the repetition:

for ((rank, keyword), confidence) in suggestion .high_confidence_keywords .iter() .enumerate() .zip(std::iter::repeat(PocketKeywordConfidence::High)) .chain( suggestion .low_confidence_keywords .iter() .enumerate() .zip(std::iter::repeat(PocketKeywordConfidence::Low)), ) { // ... }

std::iter::repeat returns an infinite iterator, and zip-ping another iterator with it will combine elements from both of them, and stop as soon as one of the iterators is exhausted.

😵 wow this is fancy

components/suggest/src/store.rs

linabutler · 2023-10-03T17:26:06Z

components/suggest/src/db.rs

+                let (keyword_prefix, keyword_suffix) = match keyword.find(' ') {
+                    Some(index) => {
+                        let (prefix, suffix) = keyword.split_at(index);
+                        (prefix.to_string(), suffix[1..].to_string())
+                    }
+                    None => (keyword.clone(), String::new()),
+                };


Same comment here about using split_once.

ncloudioj · 2023-10-03T18:42:12Z

components/suggest/src/db.rs

+                                    PocketKeywordConfidence::Low => row.get::<_, String>("keyword_suffix")?.starts_with(&keyword_suffix),
+                                    PocketKeywordConfidence::High => row.get::<_, String>("keyword_suffix")? == keyword_suffix,
+                                };
+                                if suffixes_match{


Suggested change

if suffixes_match{

if suffixes_match {

components/suggest/src/db.rs

ncloudioj · 2023-10-03T18:54:27Z

components/suggest/src/db.rs

+        let (keyword_prefix, keyword_suffix) = match keyword.find(' ') {
+            Some(index) => {
+                let (prefix, suffix) = keyword.split_at(index);
+                (prefix.to_string(), suffix[1..].to_string())
+            }
+            None => (keyword.to_string(), String::new()),
+        };


This block was used three times in this module, you can extract this as a helper function using Lina's suggestion.

ncloudioj · 2023-10-03T19:08:30Z

components/suggest/src/db.rs

+            None => (keyword.to_string(), String::new()),
+        };
+        Ok(self.conn.query_rows_and_then_cached(
+            "SELECT s.id, k.rank, s.title, s.url, s.provider, -1 as confidence_type


I'd recommend using NULL instead of -1 for the missing field. Also, how about we move this field to pocket_custom_details as it's Pocket-specific? We will need to do more in the handler, but I think it's still better than introducing a dummy column to the suggestion table.

WDYT?

sooo confidence is sits in the pocket_keywords table, the -1 is used as a dummy value cause UNION requires the same number of columns. Is there a way to smush the two results together without having the same number of columns?

Right, it's in the pocket_keywords table. If we don't include it here, we'd have to do another lookup against the keywords table. NVM, let's use NULL as the dummy value then.

ncloudioj · 2023-10-03T19:13:03Z

components/suggest/src/db.rs

+                    SuggestionProvider::Pocket => {
+                        let confidence_type = row.get("confidence_type")?;
+                        self.conn.query_row_and_then(
+                            "SELECT keyword_suffix FROM pocket_keywords WHERE suggestion_id = :suggestion_id AND keyword_prefix = :keyword_prefix", 


Suggested change

"SELECT keyword_suffix FROM pocket_keywords WHERE suggestion_id = :suggestion_id AND keyword_prefix = :keyword_prefix",

"SELECT k.keyword_suffix FROM pocket_keywords k

WHERE k.suggestion_id = :suggestion_id

AND k.keyword_prefix = :keyword_prefix",

components/suggest/src/pocket_confidence.rs

components/suggest/src/store.rs

components/suggest/src/db.rs

linabutler

This is looking great, thanks for folding in all those changes!

components/suggest/src/db.rs

components/suggest/src/schema.rs

components/suggest/src/store.rs

ncloudioj

Looks great, thanks for cranking it out!
r=nanj with a few nits and comments for discussion.

components/suggest/src/pocket.rs

ncloudioj · 2023-10-04T01:32:56Z

components/suggest/src/db.rs

-        self.conn.query_rows_and_then_cached(
-            "SELECT s.id, k.rank, s.title, s.url, s.provider
+        let (keyword_prefix, keyword_suffix) = split_keyword(keyword);
+        Ok(self.conn.query_rows_and_then_cached(


Had to rotate my monitor to read the whole function body ;-). Looks like we can extract some more helper functions from the pattern match expression below to make it more readable. Feel free to do it with a follwup.

ncloudioj · 2023-10-04T01:58:35Z

components/suggest/src/db.rs

             FROM suggestions s
             JOIN keywords k ON k.suggestion_id = s.id
             WHERE k.keyword = :keyword
-             LIMIT 1",
+             UNION ALL
+             SELECT s.id, k.rank, s.title, s.url, s.provider, k.confidence, k.keyword_suffix


Note that for Pocket suggestions, it will still try loading keywords from the keywords table (line 149), although there should be none. Seems like keywords loading can be skipped for AMO and Pocket suggestions.

Yes, let's get a follow-up on file to let callers specify which providers they want—we can use that to optimize which query we run, and on mobile to restrict suggestions to AMP and Wikipedia.

ncloudioj · 2023-10-04T02:07:41Z

components/suggest/src/db.rs

-            },
-        )
+            }
+        )?.into_iter().flatten().collect())


Discussion: we could provide some extra convenience by sorting the returning suggestions by certain criteria, e.g. by score or rank. For example, if a keyword matches both the high confidence and the low confidence keywords for a Pocket suggestion, it'll be nice to put the high confidence one in the front.

The pull request has been modified, dismissing previous reviews.

ncloudioj · 2023-10-04T13:55:09Z

components/suggest/src/pocket.rs

 /// assert_eq!(split_keyword("foo"), ("foo", ""));
 /// assert_eq!(split_keyword("foo bar baz"), ("foo", "bar baz"));
-/// ```


Oh, those backticks are needed for cargo doc to render inline snippets properly.

w/o:

w/:

@tiftran Did you happen to get a build failure with the backticks? 😅

The triple backticks will also compile and run the code between them as a "doctest". Under the hood, rustdoc builds each doctest as a separate program, so it has to import any functions and types that it uses, like this:

/// ``` /// # use suggest::pocket::split_keyword; /// assert_eq!(split_keyword("foo"), ("foo", "")); /// assert_eq!(split_keyword("foo bar baz"), ("foo", "bar baz")); /// ```

(The # hides the import from the rendered snippet).

But that runs into another issue: the pocket module is private, so the doctest can't use it. It's called out (though a bit buried) in this section:

Note that they will still link against only the public items of your crate; if you need to test private items, you need to write a unit test.

We have a few options to make this work:

Add the ignore attribute to the code block.

Make this module public in lib.rs, with pub mod pocket;, and add the # use above. It does mean that our Rust component's public API is different than the UniFFIed API—but, given that the Rust component is never consumed directly by other Rust code, I think that's okay.

Rewrite this doctest as a unit test. This is what keyword.rs and suggestion.rs does.

components/suggest/src/schema.rs

linabutler

Wonderful! r+wc from me, too!

The pull request has been modified, dismissing previous reviews.

components/suggest/src/pocket.rs

components/suggest/src/schema.rs

components/suggest/src/store.rs

linabutler

Thanks all—they're all ingested correctly from Remote Settings now! 🎉

Please fold in @ncloudioj's last suggestion, squash, and let's get this landed!

The pull request has been modified, dismissing previous reviews.

tiftran requested a review from linabutler September 25, 2023 18:56

tiftran marked this pull request as draft September 25, 2023 20:02

tiftran force-pushed the pocket-suggestions branch 2 times, most recently from 28c7e12 to 45cd72e Compare September 26, 2023 06:42

ncloudioj reviewed Sep 26, 2023

View reviewed changes

0c0w3 mentioned this pull request Sep 26, 2023

feat: modify csv uploader job to remove dupe keywords mozilla-services/merino-py#426

Merged

5 tasks

ncloudioj reviewed Sep 27, 2023

View reviewed changes

tiftran force-pushed the pocket-suggestions branch 2 times, most recently from 51e89e2 to 528868e Compare October 3, 2023 06:56

tiftran marked this pull request as ready for review October 3, 2023 16:41

linabutler reviewed Oct 3, 2023

View reviewed changes

ncloudioj reviewed Oct 3, 2023

View reviewed changes

components/suggest/src/db.rs Outdated Show resolved Hide resolved

tiftran force-pushed the pocket-suggestions branch from e3879bc to 33d3470 Compare October 3, 2023 22:32

linabutler reviewed Oct 3, 2023

View reviewed changes

ncloudioj previously approved these changes Oct 4, 2023

View reviewed changes

ncloudioj reviewed Oct 4, 2023

View reviewed changes

linabutler reviewed Oct 4, 2023

View reviewed changes

components/suggest/src/schema.rs Outdated Show resolved Hide resolved

linabutler previously approved these changes Oct 4, 2023

View reviewed changes

tiftran force-pushed the pocket-suggestions branch from 1df5e5a to 2e27f25 Compare October 4, 2023 19:19

linabutler reviewed Oct 4, 2023

View reviewed changes

components/suggest/src/pocket.rs Outdated Show resolved Hide resolved

linabutler reviewed Oct 4, 2023

View reviewed changes

components/suggest/src/schema.rs Show resolved Hide resolved

0c0w3 mentioned this pull request Oct 5, 2023

feat: modify csv uploader job to remove low keywords that dupe high keywords mozilla-services/merino-py#434

Merged

5 tasks

ncloudioj previously approved these changes Oct 5, 2023

View reviewed changes

components/suggest/src/store.rs Outdated Show resolved Hide resolved

linabutler previously approved these changes Oct 5, 2023

View reviewed changes

[DISCO-2503] Suggest: Pocket suggestion ingestion

0ec35cc

tiftran force-pushed the pocket-suggestions branch from 2a0dc4b to 0ec35cc Compare October 5, 2023 17:33

linabutler approved these changes Oct 5, 2023

View reviewed changes

linabutler added this pull request to the merge queue Oct 5, 2023

Merged via the queue into main with commit 9308fff Oct 5, 2023
16 checks passed

linabutler deleted the pocket-suggestions branch October 5, 2023 19:37

	confidence_type INTEGER NOT NULL,
	confidence INTEGER NOT NULL,
	rank INTEGER NOT NULL

	PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) );
	PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id)
	);

-                            "SELECT keyword_suffix FROM pocket_keywords WHERE suggestion_id = :suggestion_id AND keyword_prefix = :keyword_prefix",
+                            "SELECT k.keyword_suffix FROM pocket_keywords k
+                             WHERE k.suggestion_id = :suggestion_id
+                             AND k.keyword_prefix = :keyword_prefix",

[DISCO-2503] Suggest: Pocket suggestion ingestion #5841

[DISCO-2503] Suggest: Pocket suggestion ingestion #5841

Conversation

tiftran commented Sep 25, 2023

Pull Request checklist

codecov-commenter commented Sep 25, 2023 • edited Loading

Codecov Report

linabutler commented Sep 26, 2023

ncloudioj commented Sep 26, 2023

ncloudioj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linabutler Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0c0w3 commented Sep 26, 2023

0c0w3 commented Sep 27, 2023

ncloudioj commented Sep 27, 2023

0c0w3 commented Sep 27, 2023 • edited Loading

ncloudioj commented Sep 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linabutler left a comment

Choose a reason for hiding this comment

linabutler Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linabutler left a comment

Choose a reason for hiding this comment

ncloudioj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linabutler left a comment

Choose a reason for hiding this comment

linabutler left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 25, 2023 •

edited

Loading

linabutler Sep 27, 2023 •

edited

Loading

0c0w3 commented Sep 27, 2023 •

edited

Loading

linabutler Oct 3, 2023 •

edited

Loading