Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCO-2503] Suggest: Pocket suggestion ingestion #5841

Merged
merged 1 commit into from
Oct 5, 2023

Conversation

tiftran
Copy link
Contributor

@tiftran tiftran commented Sep 25, 2023

Pull Request checklist

  • Breaking changes: This PR follows our breaking change policy
    • This PR follows the breaking change policy:
      • This PR has no breaking API changes, or
      • There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
  • Quality: This PR builds and tests run cleanly
    • Note:
      • For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
      • If this pull request includes a breaking change, consider cutting a new release after merging.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our dependency management guidelines
    • Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

@tiftran tiftran requested a review from linabutler September 25, 2023 18:56
@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2023

Codecov Report

Attention: 104 lines in your changes are missing coverage. Please review.

Comparison is base (399fe73) 36.91% compared to head (2a0dc4b) 36.83%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5841      +/-   ##
==========================================
- Coverage   36.91%   36.83%   -0.08%     
==========================================
  Files         346      347       +1     
  Lines       33430    33501      +71     
==========================================
  Hits        12340    12340              
- Misses      21090    21161      +71     
Files Coverage Δ
components/suggest/src/lib.rs 0.00% <ø> (ø)
components/suggest/src/provider.rs 0.00% <0.00%> (ø)
components/suggest/src/schema.rs 0.00% <0.00%> (ø)
components/suggest/src/suggestion.rs 0.00% <0.00%> (ø)
components/suggest/src/rs.rs 0.00% <0.00%> (ø)
components/suggest/src/pocket.rs 0.00% <0.00%> (ø)
components/suggest/src/store.rs 0.00% <0.00%> (ø)
components/suggest/src/db.rs 0.00% <0.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tiftran tiftran marked this pull request as draft September 25, 2023 20:02
@tiftran tiftran force-pushed the pocket-suggestions branch 2 times, most recently from 28c7e12 to 45cd72e Compare September 26, 2023 06:42
@linabutler
Copy link
Contributor

Thank you so much, Tif—and thanks for making the changes since yesterday. Gentle reminder to bump the schema version and migration function in schema.rs again 😊 I'll do another review pass later today!

I tested your patch out against the real dataset in Remote Settings, and saw some unique constraint errors on both pocket_high_confidence_keywords and pocket_low_confidence_keywords. There are only a handful of duplicates, but they are there:

sqlite> select keyword, group_concat(suggestion_id, ',') from pocket_high_confidence_keywords group by keyword having count(*) > 1;
african billionaires|71,71
april fools day|7,311
arthur brooks column|24,24
arthur c brooks|24,24
arthur c brooks column|24,24,24,24
budget-friendly sustainability|247,247
cheesy corn dip|26,26
dallas cowboy cheerleaders|111,111
festive vegan dishes|17,275
how to focus|119,119
how to get clearer skin|231,231
how to recognize burnout|232,232
how to write anything|168,168
making new friends as adults|251,251
skin maintenance|231,231
sustainable habits|247,247
the atlantic arthur brooks|24,24

(I dropped the PRIMARY KEY definition from both tables to get a database that I could query to build that list).

The ones with different suggestion IDs—april fools day, festive vegan dishes—shouldn't be a problem, because the Pocket keywords tables have PRIMARY KEY (keyword, suggestion_id), so it's OK for different suggestions to have the same keyword. We might need to do an ORDER BY score in the SELECT for fetch_by_keyword to break ties, though, or drop the LIMIT 1.

But the ones with multiple copies of the same suggestion ID are more interesting. It really looks like their keywords arrays have duplicates!

DownloadedPocketSuggestion {
  description: "The best advice for finding your people, staying close, and getting through the hard parts.",
  url: "https://getpocket.com/collections/the-grown-ups-guide-to-making-and-keeping-friends",
  title: "The Grown-Up’s Guide to Making and Keeping Friends",
  low_confidence_keywords: ["what to do if you're lonely", "dealing with loneliness", "lonely what to do", "what to do lonely"],
  high_confidence_keywords: ["how to make friends as adults", "making new friends as adults", "how to make friends as an adult", "making new friends as adults", "making new friends as an adult", "how to make friends", "making friendships", "keeping friendships", "friendship maintenance", "adult social networks", "new friends", "adult social life", "making friends", "making new friends", "how to make new friends", "how to keep friends", "how to maintain friendships", "maintain friendships", "maintaining friendships", "adult friendships", "socializing tips", "nurturing friendships", "friendship longevity", "friendships in adulthood", "grown up friendships", "grown-up friendships"],
  score: 0.25 }

We can fix that by using either a HashSet<String> instead of a Vec<String> for the keywords lists in DownloadedPocketSuggestion, or an INSERT OR IGNORE when we insert them into the tables—but I wonder if we should take care of this in quicksuggest-rs, during ingestion instead. WDYT?

/cc @ncloudioj @0c0w3 for visibility, too!

@ncloudioj
Copy link
Member

but I wonder if we should take care of this in quicksuggest-rs, during ingestion instead. WDYT?

Yes, we can easily dedup keywords for each suggestion at the ingestion phase over here (BTW, Pocket suggestions are ingested by this Merino job instead of quickssugest-rs). Do we also want to do deduping among low and high-confidence keywords for each suggestion?

@0c0w3 What do you think?

Copy link
Member

@ncloudioj ncloudioj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, added a few comments for your consideration.

@@ -123,14 +124,29 @@ impl<'a> SuggestDao<'a> {

/// Fetches suggestions that match the given keyword from the database.
pub fn fetch_by_keyword(&self, keyword: &str) -> Result<Vec<Suggestion>> {
let pocket_keyword = match keyword.contains(' ') {
true => keyword.to_string(),
false => keyword.to_owned() + " ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads if the keyword doesn't have whitespace in it, then we append one to the end. Hmm, why do we need this extra trailing space for Pocket keywords?

Also, a nitpick that it's a little inconsistent to use to_string() and to_owned() in both arms. Maybe use Cow instead:

       
    use std::borrow::Cow;

    let mut pocket_keyword = Cow::from(keyword);
    if !keyword.contains(' ') {
        pocket_keyword.to_mut().push(' ');
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Drew's comment, now I understand what's going on here. I'd recommend to add some comments to outline how keyword lookup is done for each suggestion type as to me it's much harder to fathom that by looking at the SQL query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, I really like Nan's suggestion of using Cow that way!

But I think that this logic won't handle single-word low-confidence keywords correctly. For example, I don't think that workaholism will match the Pocket suggestion in the test now—but it should. Is that right?

IIUC, another way to write our algorithm for matching low-confidence keywords is:

  1. The head (first word) of the query must be the same as the low-confidence keyword's head.
  2. The tail (anything after the first word) of the query—if any—should be a prefix of the keyword's tail.

So all these queries should match the Pocket suggestion: workaholism, soft (first word of soft life), soft l, soft li (first words match; l and li are prefixes of life).

And these shouldn't, because their first words don't match: work, sof, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way we could make this work is to store the head and tail in separate columns when we insert them into the table. Then, when we fetch suggestions, we could split the user's query the same way, and match the head and tail prefix.

Here's a quick sketch of how that all could work! Let's say we have two suggestions, and all their low-confidence keywords start with breakfast:

title: "How to Build a Better Breakfast"
keywords: ["breakfast"]

title: Your New Favorite Breakfast Sandwich
keywords: ["breakfast crunch wrap", "breakfast torta"]

If we stored those suggestions and keywords in a schema like this:

CREATE TABLE suggestions(id INTEGER PRIMARY KEY, title TEXT NOT NULL);
CREATE TABLE keywords(keyword_head TEXT NOT NULL, keyword_tail TEXT NOT NULL, suggestion_id INTEGER NOT NULL REFERENCES suggestions(id), PRIMARY KEY (keyword_head, keyword_tail));

INSERT INTO suggestions(id, title) VALUES(1, 'How to Build a Better Breakfast');
INSERT INTO suggestions(id, title) VALUES(2, 'Your New Favorite Breakfast Sandwich');

INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', '', 1);
INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', 'crunch wrap', 2);
INSERT INTO keywords(keyword_head, keyword_tail, suggestion_id) VALUES('breakfast', 'torta', 2);

Here's how a query plan for breakfast would look:

sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN '' AND '' || x'ffff' GROUP BY s.id, s.title;
QUERY PLAN
|--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?)
|--SEARCH s USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR GROUP BY
1|How to Build a Better Breakfast
2|Your New Favorite Breakfast Sandwich

(To play around with that schema in the sqlite3 shell, I saved the schema into a file, then ran sqlite3 -init file.sql to load it).

Since Your New Favorite Breakfast Sandwich has two keywords that start with breakfast, we need a GROUP BY or a DISTINCT to filter out duplicates—or we'll get two rows for it, one for each of its keywords. But the search over keywords can use all columns of the primary key (that's the sqlite_autoindex_keywords_1 automatic index), so the most expensive part of the query is still pretty efficient. By the time we build the temporary b-tree for the GROUP BY, we should have reduced the number of rows to a small handful.

The query plan for breakfast cru looks the same, and finds just the one suggestion:

sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN 'cru' AND 'cru' || x'ffff' GROUP BY s.id, s.title;
QUERY PLAN
|--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?)
|--SEARCH s USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR GROUP BY
2|Your New Favorite Breakfast Sandwich

And so does breakfast torta:

sqlite> SELECT s.id, s.title FROM suggestions s JOIN keywords k ON k.suggestion_id = s.id WHERE k.keyword_head = 'breakfast' AND k.keyword_tail BETWEEN 'torta' AND 'torta' || x'ffff' GROUP BY s.id, s.title;
QUERY PLAN
|--SEARCH k USING INDEX sqlite_autoindex_keywords_1 (keyword_head=? AND keyword_tail>? AND keyword_tail<?)
|--SEARCH s USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR GROUP BY
2|Your New Favorite Breakfast Sandwich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's some really cool SQL magic there! 👏

I'd like to offer another approach: How about we move this qualifying logic over to the row handler closure?

We currently only fetch data from other tables and then assemble the suggestion accordingly in that closure. I think we can also use that as the post-processor for each candidate. Specifically, the post-processor can transform, filter, and augment each row if needed. Two benefits I can see here:

  • In certain cases, it's much easier to do stuff in native code than in SQL
  • To keep those keywords table schema lean and generic. Imagine if we need to change the matching logic for the low-confidence keywords, we can still do prefix queries in SQL and apply more specific matching logic in the post-processor. No change is needed to the schema!

WDYT?

Copy link
Contributor

@linabutler linabutler Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we move this qualifying logic over to the row handler closure?

Yes!

Filtering in the row handler is definitely a great option, with a small result set. For example, if our query returns 5 rows, and we filter all of them out—that's OK, because we only had to scan and read 5 rows. But if our query returns hundreds or thousands of rows, filtering in the row handler isn't as efficient—we still end up scanning and reading in all those rows, and then filtering them out in code.

For low-confidence Pocket suggestions, this is a bit tricky to do naively, because we want the whole first word to match, and only match prefixes for subsequent words. If we wanted to keep the schema super generic, with a single keyword column, I think we'd need to do a prefix match unconditionally. So, if the user searches for b or br, we'd return suggestions for all keywords that start with b or br—and then filter out all of them, because b and br aren't whole words.

We could combine the approaches, through—a less generic schema, but also doing more work in Rust code instead of SQL:

  1. Split each keyword into keyword_head and keyword_tail (I'm realizing keyword_prefix and keyword_suffix might actually be better names, so I'll use that instead of "head" and "tail" now! 😅), like I suggested, and...
  2. Filter on just keyword_prefix in the SQL query, then filter on keyword_suffix in the row handler, like you suggested!

I think this would also line up nicely with @0c0w3's suggestion to use a single table to store the keywords, and have a column for whether it's low or high-confidence—and doing the filtering after.

So we'd have a schema like:

CREATE TABLE pocket_keywords(
    keyword_prefix TEXT NOT NULL,
    keyword_suffix TEXT NOT NULL DEFAULT "",
    confidence_type INTEGER NOT NULL,
    suggestion_id INTEGER NOT NULL REFERENCES suggestions(id),
    PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id)
);

With the data from Drew's example like:

keyword_prefix | keyword_suffix | confidence_type
---------------+----------------+----------------
esther         | the wonder pig | LOW
wildlife       | stories        | HIGH

(Assuming LOW and HIGH are integer constants there).

And then, in the row handler, we'd have logic like:

let suffixes_match = match row.get("confidence_type")? {
    LOW => row.get("keyword_suffix")?.starts_with(user_keyword_suffix),
    HIGH => row.get("keyword_suffix")? == user_keyword_suffix,
};

Let's say the user typed in esther th. Splitting the user's query on the first space, the prefix is esther, and the suffix is th. The SQL query for that would look like this:

SELECT s.*, k.keyword_suffix, k.confidence_type
FROM suggestions s
JOIN pocket_keywords k ON k.suggestion_id = s.id
WHERE k.keyword_prefix = 'esther'

That SQL query would return rows for all Pocket suggestions with keywords whose first word is esther. Then, the logic in our row handler would see that it's a low-confidence keyword, and take the LOW branch. the wonder pig starts with th, so it's a match for the user's query.

Now let's say they typed in wildlife stories. The prefix is wildlife; the suffix is stories. The SQL query for that looks the same as for esther th:

SELECT s.*, k.keyword_suffix, k.confidence_type
FROM suggestions s
JOIN pocket_keywords k ON k.suggestion_id = s.id
WHERE k.keyword_prefix = 'wildlife'

...Then our row handler sees that it's a high-confidence keyword, so it takes the HIGH branch. stories == stories, so it's a match.

Now what if they type just wildlife (no suffix), or wildlife st (suffix = st)? The SQL query for that is the same, and would return the same suggestion:

SELECT s.*, k.keyword_suffix, k.confidence_type
FROM suggestions s
JOIN pocket_keywords k ON k.suggestion_id = s.id
WHERE k.keyword_prefix = 'wildlife'

But then our row handler would see that this is a high-confidence keyword, and the suffixes don't match: st != stories. So it filters out the suggestion.

I'm really liking this hybrid approach: it lets us use a single table for Pocket keywords, it avoids fancy suffix matching and conditionals in SQL, and it still gets us great query performance, and limits the number of rows we have to scan and read.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that'll work like a charm!

Comment on lines 369 to 381
dao.drop_suggestions(&record_id)?;

// Ingest (or re-ingest) all suggestions in the
// attachment.
dao.insert_pocket_suggestions(&record_id, attachment.suggestions())?;

// Remove this record's ID from the list of unparsable
// records, since we understand it now.
dao.drop_unparsable_record_id(&record_id)?;

// Advance the last fetch time, so that we can resume
// fetching after this record if we're interrupted.
dao.put_last_ingest_if_newer(record.last_modified)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this sequence of (drop-insert-drop-put) has been used for all the suggestion types, shall we wrap them up into another helper in DAO to keep ingest_records() DRY?

@@ -43,6 +57,12 @@ pub const SQL: &str = "
ON DELETE CASCADE
);

CREATE TABLE pocket_custom_details(
suggestion_id INTEGER PRIMARY KEY REFERENCES suggestions(id) ON DELETE CASCADE,
description TEXT NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like description is not included in the Pocket suggestion & the udl below, also looks like Firefox doesn't use it for now. Shall we exclude it from the schema here as well? We can always add it back later if needed.

@0c0w3
Copy link
Contributor

0c0w3 commented Sep 26, 2023

Wow it's great to see so much progress being made on the Rust implementation.

Yeah the Pocket keywords were hand-edited by a group of people and it's entirely possible some keyword lists contain duplicates. I agree we should remove the duplicates at ingestion time, which we can do as you said, @ncloudioj. I can take care of that.

Do we also want to do deduping among low and high-confidence keywords for each suggestion?

It looks like this isn't a problem for the Rust implementation (please correct me if I'm wrong), so I would vote for not worrying about it. Arguably this type of duplication isn't wrong; imagine if users could turn off high-confidence keywords and the related UI but keep the low-confidence ones, for example.

@0c0w3
Copy link
Contributor

0c0w3 commented Sep 27, 2023

Yeah the Pocket keywords were hand-edited by a group of people and it's entirely possible some keyword lists contain duplicates. I agree we should remove the duplicates at ingestion time, which we can do as you said, @ncloudioj. I can take care of that.

Should be fixed now.

@ncloudioj
Copy link
Member

Arguably this type of duplication isn't wrong; imagine if users could turn off high-confidence keywords and the related UI but keep the low-confidence ones, for example.

Sounds good, thanks for confirming! @0c0w3 Could you briefly outline how low/high-confidence keywords work in Firefox?

@0c0w3
Copy link
Contributor

0c0w3 commented Sep 27, 2023

Sure, Tif and I were chatting about this last week and here's the gist:

When the user triggers a suggestion by typing a high confidence keyword, the suggestion is shown as a top pick. When they trigger it using a low confidence keyword, it’s shown at the bottom in the usual Suggest position. Also, high confidence keywords must be typed in full, but Firefox will start matching on a low confidence keyword after its first word is typed.

References:

  • Product spec
  • Desktop impl
  • Figma
  • To enable on desktop (118+): browser.urlbar.pocket.featureGate = true and browser.urlbar.bestMatch.enabled = true (also make sure Suggest is enabled, it is by default if you’re in the U.S.)

Example:

  • High: ["wildlife stories"]
  • Low: ["esther the wonder pig"]
Query                   | Match
------------------------+------
esthe                   | none
esther                  | low
esther (trailing space) | low
esther t                | low
esther th               | low
(etc.)                  | low
esther the wonder pig   | low
wildlif                 | none
wildlife                | none
(etc.)                  | none
wildlife storie         | none
wildlife stories        | high

@ncloudioj
Copy link
Member

Ah, I see. Thanks for sharing that here! And now I understand why the keyword lookup queries were made that way in this PR.

@@ -123,14 +124,29 @@ impl<'a> SuggestDao<'a> {

/// Fetches suggestions that match the given keyword from the database.
pub fn fetch_by_keyword(&self, keyword: &str) -> Result<Vec<Suggestion>> {
let pocket_keyword = match keyword.contains(' ') {
true => keyword.to_string(),
false => keyword.to_owned() + " ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Drew's comment, now I understand what's going on here. I'd recommend to add some comments to outline how keyword lookup is done for each suggestion type as to me it's much harder to fathom that by looking at the SQL query.

FROM suggestions s
JOIN keywords k ON k.suggestion_id = s.id
WHERE k.keyword = :keyword
UNION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: in SQL, UNION will remove duplicates among all the SELECT rows, whereas UNION ALL will not, so UNION ALL should run faster here as there shouldn't be any duplicates in this case.

FROM suggestions s
JOIN pocket_high_confidence_keywords k ON k.suggestion_id = s.id
WHERE k.keyword = :keyword
UNION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it already finds a record from the high confidence keywords table, doing another lookup against the low-confidence keywords table is unnecessary. Looks like it's not easy to inject that conditional query logic inside this query, guess we will live with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet there's a way to do it, maybe:

UNION
SELECT IFNULL(
  (
    SELECT s.id, k.rank, s.title, s.url, s.provider, true as is_top_pick
    FROM suggestions s
    JOIN pocket_high_confidence_keywords k ON k.suggestion_id = s.id
    WHERE k.keyword = :keyword
  ),
  (
    SELECT s.id, k.rank, s.title, s.url, s.provider, false as is_top_pick
    FROM suggestions s
    JOIN pocket_low_confidence_keywords k ON k.suggestion_id = s.id
    WHERE k.keyword BETWEEN :pocket_keyword AND :pocket_keyword || x'ffff'
  )
)

The high and low keywords could also be stored in the same table, so you do one lookup to get the matching keyword and its type, whether it's high or low, and then a join to get the suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but unfortunately, both IFNULL and COALESCE in SQLite only take scalars as the arguments, selecting multiple columns will lead to a syntax error.

SELECT s.id, k.rank, s.title, s.url, s.provider, false as is_top_pick
FROM suggestions s
JOIN pocket_low_confidence_keywords k ON k.suggestion_id = s.id
WHERE k.keyword BETWEEN :pocket_keyword AND :pocket_keyword || x'ffff'
LIMIT 1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have multiple keyword sources, maybe it's time to return all the hits back (sorted by score) to the consumer instead of 1. Even returning the one with the highest score could lead to some surprises, e.g. if the user dismissed that suggestion, all other candidates for the same keyword will never get a chance to show.

Also cc @linabutler @0c0w3 for their thoughts.

@tiftran tiftran force-pushed the pocket-suggestions branch 2 times, most recently from 51e89e2 to 528868e Compare October 3, 2023 06:56
@tiftran tiftran marked this pull request as ready for review October 3, 2023 16:41
Copy link
Contributor

@linabutler linabutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tiftran! Putting up a quick round of comments before breakfast; I'll do a more thorough pass later!

components/suggest/src/pocket_confidence.rs Outdated Show resolved Hide resolved
use rusqlite::types::{FromSql, FromSqlError, FromSqlResult, ToSqlOutput, ValueRef};
use rusqlite::{Result as RusqliteResult, ToSql};

/// Classification of Pocket confidence keywords.
Copy link
Contributor

@linabutler linabutler Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's expand this docstring a bit to explain the difference between a high-confidence and low-confidence keyword, so that someone looking at it for the first time (and our future selves!) remember the semantics.

CREATE TABLE pocket_keywords(
keyword_prefix TEXT NOT NULL,
keyword_suffix TEXT NOT NULL DEFAULT '',
confidence_type INTEGER NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
confidence_type INTEGER NOT NULL,
confidence INTEGER NOT NULL,
rank INTEGER NOT NULL

Let's rename confidence_type to confidence (to match our other integer enum field, provider), and add rank—for the position of the keyword in the list of keywords.

// Ingest (or re-ingest) all suggestions in the
// attachment.
block(dao, &record_id, attachment.suggestions())?;
// dao.insert_amp_wikipedia_suggestions(&record_id, attachment.suggestions())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// dao.insert_amp_wikipedia_suggestions(&record_id, attachment.suggestions())?;

Nit: Let's remove this commented-out code.

Comment on lines 360 to 362
// An AMP-Wikipedia record should always have an
// attachment with suggestions. If it doesn't, it's
// malformed, so skip to the next record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// An AMP-Wikipedia record should always have an
// attachment with suggestions. If it doesn't, it's
// malformed, so skip to the next record.
// A record should always have an attachment with suggestions.
// If it doesn't, it's malformed, so we can't ingest it.

Nit: This isn't specific to AMP-Wikipedia anymore! 🎉

keyword_suffix TEXT NOT NULL DEFAULT '',
confidence_type INTEGER NOT NULL,
suggestion_id INTEGER NOT NULL REFERENCES suggestions(id),
PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id) );
PRIMARY KEY (keyword_prefix, keyword_suffix, suggestion_id)
);

Style nit.

Comment on lines 525 to 531
let (keyword_prefix, keyword_suffix) = match keyword.find(' ') {
Some(index) => {
let (prefix, suffix) = keyword.split_at(index);
(prefix.to_string(), suffix[1..].to_string())
}
None => (keyword.clone(), String::new()),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's lean on the Rust standard library here!

Suggested change
let (keyword_prefix, keyword_suffix) = match keyword.find(' ') {
Some(index) => {
let (prefix, suffix) = keyword.split_at(index);
(prefix.to_string(), suffix[1..].to_string())
}
None => (keyword.clone(), String::new()),
};
let (keyword_prefix, keyword_suffix) = keyword.split_once(' ').unwrap_or_else(|| (keyword, ""));

split_once will do the checking and slicing for you, and returns references, so we can also avoid the extra copies with to_string() and clone().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These loops also look mostly identical! We can use some of Rust's iterator methods to cut down on the repetition:

for ((rank, keyword), confidence) in suggestion
    .high_confidence_keywords
    .iter()
    .enumerate()
    .zip(std::iter::repeat(PocketKeywordConfidence::High))
    .chain(
        suggestion
            .low_confidence_keywords
            .iter()
            .enumerate()
            .zip(std::iter::repeat(PocketKeywordConfidence::Low)),
    )
{
    // ...
}

std::iter::repeat returns an infinite iterator, and zip-ping another iterator with it will combine elements from both of them, and stop as soon as one of the iterators is exhausted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😵 wow this is fancy

components/suggest/src/store.rs Show resolved Hide resolved
Comment on lines 555 to 561
let (keyword_prefix, keyword_suffix) = match keyword.find(' ') {
Some(index) => {
let (prefix, suffix) = keyword.split_at(index);
(prefix.to_string(), suffix[1..].to_string())
}
None => (keyword.clone(), String::new()),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about using split_once.

PocketKeywordConfidence::Low => row.get::<_, String>("keyword_suffix")?.starts_with(&keyword_suffix),
PocketKeywordConfidence::High => row.get::<_, String>("keyword_suffix")? == keyword_suffix,
};
if suffixes_match{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if suffixes_match{
if suffixes_match {

components/suggest/src/db.rs Outdated Show resolved Hide resolved
Comment on lines 128 to 134
let (keyword_prefix, keyword_suffix) = match keyword.find(' ') {
Some(index) => {
let (prefix, suffix) = keyword.split_at(index);
(prefix.to_string(), suffix[1..].to_string())
}
None => (keyword.to_string(), String::new()),
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block was used three times in this module, you can extract this as a helper function using Lina's suggestion.

None => (keyword.to_string(), String::new()),
};
Ok(self.conn.query_rows_and_then_cached(
"SELECT s.id, k.rank, s.title, s.url, s.provider, -1 as confidence_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using NULL instead of -1 for the missing field. Also, how about we move this field to pocket_custom_details as it's Pocket-specific? We will need to do more in the handler, but I think it's still better than introducing a dummy column to the suggestion table.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sooo confidence is sits in the pocket_keywords table, the -1 is used as a dummy value cause UNION requires the same number of columns. Is there a way to smush the two results together without having the same number of columns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's in the pocket_keywords table. If we don't include it here, we'd have to do another lookup against the keywords table. NVM, let's use NULL as the dummy value then.

SuggestionProvider::Pocket => {
let confidence_type = row.get("confidence_type")?;
self.conn.query_row_and_then(
"SELECT keyword_suffix FROM pocket_keywords WHERE suggestion_id = :suggestion_id AND keyword_prefix = :keyword_prefix",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"SELECT keyword_suffix FROM pocket_keywords WHERE suggestion_id = :suggestion_id AND keyword_prefix = :keyword_prefix",
"SELECT k.keyword_suffix FROM pocket_keywords k
WHERE k.suggestion_id = :suggestion_id
AND k.keyword_prefix = :keyword_prefix",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

components/suggest/src/pocket_confidence.rs Outdated Show resolved Hide resolved
components/suggest/src/store.rs Outdated Show resolved Hide resolved
@tiftran tiftran force-pushed the pocket-suggestions branch from e3879bc to 33d3470 Compare October 3, 2023 22:32
Copy link
Contributor

@linabutler linabutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks for folding in all those changes!

components/suggest/src/db.rs Outdated Show resolved Hide resolved
components/suggest/src/db.rs Outdated Show resolved Hide resolved
components/suggest/src/db.rs Outdated Show resolved Hide resolved
components/suggest/src/db.rs Outdated Show resolved Hide resolved
components/suggest/src/db.rs Outdated Show resolved Hide resolved
components/suggest/src/db.rs Outdated Show resolved Hide resolved
components/suggest/src/schema.rs Show resolved Hide resolved
components/suggest/src/store.rs Show resolved Hide resolved
ncloudioj
ncloudioj previously approved these changes Oct 4, 2023
Copy link
Member

@ncloudioj ncloudioj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for cranking it out!
r=nanj with a few nits and comments for discussion.

components/suggest/src/pocket.rs Outdated Show resolved Hide resolved
self.conn.query_rows_and_then_cached(
"SELECT s.id, k.rank, s.title, s.url, s.provider
let (keyword_prefix, keyword_suffix) = split_keyword(keyword);
Ok(self.conn.query_rows_and_then_cached(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to rotate my monitor to read the whole function body ;-). Looks like we can extract some more helper functions from the pattern match expression below to make it more readable. Feel free to do it with a follwup.

FROM suggestions s
JOIN keywords k ON k.suggestion_id = s.id
WHERE k.keyword = :keyword
LIMIT 1",
UNION ALL
SELECT s.id, k.rank, s.title, s.url, s.provider, k.confidence, k.keyword_suffix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that for Pocket suggestions, it will still try loading keywords from the keywords table (line 149), although there should be none. Seems like keywords loading can be skipped for AMO and Pocket suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's get a follow-up on file to let callers specify which providers they want—we can use that to optimize which query we run, and on mobile to restrict suggestions to AMP and Wikipedia.

},
)
}
)?.into_iter().flatten().collect())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion: we could provide some extra convenience by sorting the returning suggestions by certain criteria, e.g. by score or rank. For example, if a keyword matches both the high confidence and the low confidence keywords for a Pocket suggestion, it'll be nice to put the high confidence one in the front.

@mergify mergify bot dismissed ncloudioj’s stale review October 4, 2023 02:23

The pull request has been modified, dismissing previous reviews.

/// assert_eq!(split_keyword("foo"), ("foo", ""));
/// assert_eq!(split_keyword("foo bar baz"), ("foo", "bar baz"));
/// ```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, those backticks are needed for cargo doc to render inline snippets properly.

w/o:

Screenshot 2023-10-04 at 9 47 25 AM

w/:

Screenshot 2023-10-04 at 9 47 53 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiftran Did you happen to get a build failure with the backticks? 😅

The triple backticks will also compile and run the code between them as a "doctest". Under the hood, rustdoc builds each doctest as a separate program, so it has to import any functions and types that it uses, like this:

/// ```
/// # use suggest::pocket::split_keyword;
/// assert_eq!(split_keyword("foo"), ("foo", ""));
/// assert_eq!(split_keyword("foo bar baz"), ("foo", "bar baz"));
/// ```

(The # hides the import from the rendered snippet).

But that runs into another issue: the pocket module is private, so the doctest can't use it. It's called out (though a bit buried) in this section:

Note that they will still link against only the public items of your crate; if you need to test private items, you need to write a unit test.

We have a few options to make this work:

  1. Add the ignore attribute to the code block.
  2. Make this module public in lib.rs, with pub mod pocket;, and add the # use above. It does mean that our Rust component's public API is different than the UniFFIed API—but, given that the Rust component is never consumed directly by other Rust code, I think that's okay.
  3. Rewrite this doctest as a unit test. This is what keyword.rs and suggestion.rs does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh

linabutler
linabutler previously approved these changes Oct 4, 2023
Copy link
Contributor

@linabutler linabutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful! r+wc from me, too! :shipit:

@mergify mergify bot dismissed linabutler’s stale review October 4, 2023 17:25

The pull request has been modified, dismissing previous reviews.

@tiftran tiftran force-pushed the pocket-suggestions branch from 1df5e5a to 2e27f25 Compare October 4, 2023 19:19
ncloudioj
ncloudioj previously approved these changes Oct 5, 2023
components/suggest/src/store.rs Outdated Show resolved Hide resolved
linabutler
linabutler previously approved these changes Oct 5, 2023
Copy link
Contributor

@linabutler linabutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all—they're all ingested correctly from Remote Settings now! 🎉

Please fold in @ncloudioj's last suggestion, squash, and let's get this landed!

@mergify mergify bot dismissed stale reviews from ncloudioj and linabutler October 5, 2023 17:13

The pull request has been modified, dismissing previous reviews.

@tiftran tiftran force-pushed the pocket-suggestions branch from 2a0dc4b to 0ec35cc Compare October 5, 2023 17:33
@linabutler linabutler added this pull request to the merge queue Oct 5, 2023
Merged via the queue into main with commit 9308fff Oct 5, 2023
16 checks passed
@linabutler linabutler deleted the pocket-suggestions branch October 5, 2023 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants