Bug 1931373 - Add FTS matching data #6531

bendk · 2024-12-18T20:41:29Z

Added extra data to Suggestion::Fakespot to capture how the FTS match was made. The plan is to use this as a facet for our metrics to help us consider how to tune the matching logic (i.e. maybe we should not use stemming, maybe we should reqiure that terms are close together).

Added Suggest CLI flag to print out the full debug repr for suggestions. This provides an easy way to test the new functionality.

Pull Request checklist

Breaking changes: This PR follows our breaking change policy
- This PR follows the breaking change policy:
  - This PR has no breaking API changes, or
  - There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
Quality: This PR builds and tests run cleanly
- Note:
  - For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
  - If this pull request includes a breaking change, consider cutting a new release after merging.
Tests: This PR includes thorough tests or an explanation of why it does not
Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
- Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
Dependencies: This PR follows our dependency management guidelines
- Any new dependencies are accompanied by a summary of the due diligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

bendk · 2024-12-18T20:47:11Z

I think this is working based on running some tests with the CLI:

I used variations on "running shoe" to check the stemming/prefix matching flags:


> cargo suggest query --fts-match-info "running shoe"

============================== Results  ==============================
* Brooks Running Shoe (http://amazon.com/dp/B0BJ131YJQ) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* Altra Running Shoe (http://amazon.com/dp/B01HNL5KI0) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* HOKA ONE ONE Running Shoes (http://amazon.com/dp/B0B14G2MJN) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* WHITIN Barefoot Running Shoes | Minimalist, Zero Drop Sole (http://amazon.com/dp/B0CQX1YVK1) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* Salomon Trail Running Shoes | Gore-tex (http://amazon.com/dp/B0992GHCSX) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* SAGUARO Trail Running Shoes | Walking, Running, Minimalist (http://amazon.com/dp/B084DTQPDW) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* HUMTTO Hiking Shoes | Trail Running, Climbing, Breathable, Non-Slip (http://amazon.com/dp/B0B2VJ4L7P) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* Zappos - Official Site (https://www.zappos.com/?utm_source=admarketplace&utm_medium=sem_a&utm_campaign=Zappos&utm_term=Zappos&utm_content=319154514us46192024121815&mfadid=adm) (with icon)


> cargo suggest query --fts-match-info "running sho"

============================== Results  ==============================
* Brooks Running Shoe (http://amazon.com/dp/B0BJ131YJQ) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* Altra Running Shoe (http://amazon.com/dp/B01HNL5KI0) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* HOKA ONE ONE Running Shoes (http://amazon.com/dp/B0B14G2MJN) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* WHITIN Barefoot Running Shoes | Minimalist, Zero Drop Sole (http://amazon.com/dp/B0CQX1YVK1) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* Salomon Trail Running Shoes | Gore-tex (http://amazon.com/dp/B0992GHCSX) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* SAGUARO Trail Running Shoes | Walking, Running, Minimalist (http://amazon.com/dp/B084DTQPDW) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* HUMTTO Hiking Shoes | Trail Running, Climbing, Breathable, Non-Slip (http://amazon.com/dp/B0B2VJ4L7P) (with icon)
   FtsMatchInfo { prefix: true, stemming: false, term_distance: Near }
* Zappos - Official Site (https://www.zappos.com/?utm_source=admarketplace&utm_medium=sem_a&utm_campaign=Zappos&utm_term=Zappos&utm_content=319154514us46192024121815&mfadid=adm) (with icon)

> cargo suggest query --fts-match-info "run shoe"

============================== Results  ==============================
* Brooks Running Shoe (http://amazon.com/dp/B0BJ131YJQ) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* Altra Running Shoe (http://amazon.com/dp/B01HNL5KI0) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* WHITIN Barefoot Running Shoes | Minimalist, Zero Drop Sole (http://amazon.com/dp/B0CQX1YVK1) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* Salomon Trail Running Shoes | Gore-tex (http://amazon.com/dp/B0992GHCSX) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* HOKA ONE ONE Running Shoes (http://amazon.com/dp/B0B14G2MJN) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* SAGUARO Trail Running Shoes | Walking, Running, Minimalist (http://amazon.com/dp/B084DTQPDW) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }
* HUMTTO Hiking Shoes | Trail Running, Climbing, Breathable, Non-Slip (http://amazon.com/dp/B0B2VJ4L7P) (with icon)
   FtsMatchInfo { prefix: false, stemming: true, term_distance: Near }


 > cargo suggest query --fts-match-info "run sho"
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/examples-suggest-cli query --fts-match-info 'run sho'`

============================== Results  ==============================
* Brooks Running Shoe (http://amazon.com/dp/B0BJ131YJQ) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* WHITIN Barefoot Running Shoes | Minimalist, Zero Drop Sole (http://amazon.com/dp/B0CQX1YVK1) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* Salomon Trail Running Shoes | Gore-tex (http://amazon.com/dp/B0992GHCSX) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* Altra Running Shoe (http://amazon.com/dp/B01HNL5KI0) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* HOKA ONE ONE Running Shoes (http://amazon.com/dp/B0B14G2MJN) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* SAGUARO Trail Running Shoes | Walking, Running, Minimalist (http://amazon.com/dp/B084DTQPDW) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* HUMTTO Hiking Shoes | Trail Running, Climbing, Breathable, Non-Slip (http://amazon.com/dp/B0B2VJ4L7P) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: true, stemming: true, term_distance: Near }

I used these queries to test the term distance:


> cargo suggest query --fts-match-info "new shoe"

============================== Results  ==============================
* New Balance Baseball Shoe (http://amazon.com/dp/B08PCF4RWJ) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Running Shoe (http://amazon.com/dp/B01CQT3CGG) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Trail Running Shoe (http://amazon.com/dp/B0C29H6LSW) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Near }
* New Balance Track and Field Shoe (http://amazon.com/dp/B07HMK152T) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Medium }


> cargo suggest query --fts-match-info "mangrove holder"

============================== Results  ==============================
* Mangrove Pickleball Bag, Pickleball Backpack | Adjustable Sling, Upgraded Capacity, Safety Pocket, Water Bottle Holder (http://amazon.com/dp/B0972GTS8W) (with icon)
   FtsMatchInfo { prefix: false, stemming: false, term_distance: Far }

bendk · 2024-12-18T20:48:29Z

components/suggest/src/suggestion.rs

+    // All terms in a 5-term chunk
+    Medium,
+    // No 5-term chunk that contains all the terms
+    Far,


The segments above were arbitrarily picked by me. Are there different numbers we should choose?

I don't know, what's the intuition behind collecting this distance data?

3 seems kind of like a large distance to start with. If I'm reading the fts5 doc right, that means terms can have 3 words between them and the match will still be successful. I would think we'd want to start at 1 at most, maybe even 0?

Is there a reason for using an enum and not reporting the numeric distance itself, as like min_term_distance?

How hard would it be to add tests for non-near distances?

Good catch on this one, the docs confused me and I thought 3 meant the total amount of words in the clump was 3. I changed these numbers to 1 and 3, which seems like more of a reasonable start.

That said, I'm still not sure what's correct to test for, I'm open to changing "Near" to meaning 0.

The one thing I don't think we can do is calculate actual minimum term distance number. AFAICT, there's no function for that, you just have to make a bunch of queries and see if they match or not. We have variants like Adjacent=0, Medium=1 and Far=2 or greater though.

bendk · 2024-12-18T20:56:16Z

components/suggest/src/query.rs

+
+    fn split_terms(phrase: &str) -> Vec<&str> {
+        phrase
+            .split([' ', '(', ')', ':', '^', '*', '"', ','])


I added the , char in the list of things to split on. This way the comma in Trail Running, isn't included in the search terms. The FTS tokenizer ignores that, but it was messing up the stemming check logic.

0c0w3

Thanks, lgtm. It's too bad several extra queries may be needed to get the match info, but I'm guessing it's not a big problem (in terms of latency at least) since they should all be using indexes? You could probably do one big query with SELECT subqueries to get all the info in one, but it's probably not worth it.

0c0w3 · 2024-12-19T00:43:22Z

components/suggest/src/query.rs

+        // This is used when passing the keywords into an FTS search.  It:
+        //   - Strips out any `():^*"` chars.  These are typically used for advanced searches, which
+        //     we don't support and it would be weird to only support for FTS searches, which
+        //     currently means Fakespot searches.


Nit: I would remove the part about Fakespot so we don't need to remember to update this when we add more FTS suggestions.

0c0w3 · 2024-12-19T02:43:48Z

components/suggest/src/query.rs

+    pub fn sqlite_match_without_prefix_match(&self) -> &str {
+        self.sqlite_match
+            .strip_suffix('*')
+            .unwrap_or(&self.sqlite_match)


Instead of needing a method to compute this, couldn't you have a FtsQuery::sqlite_match_without_prefix_match string that you would initialize in new() as part of the prefix_match if-statement?

components/suggest/src/query.rs

components/suggest/src/store.rs

components/suggest/src/suggestion.rs

0c0w3 · 2024-12-19T05:01:28Z

components/suggest/src/suggestion.rs

+    // All terms in a 5-term chunk
+    Medium,
+    // No 5-term chunk that contains all the terms
+    Far,


I don't know, what's the intuition behind collecting this distance data?

3 seems kind of like a large distance to start with. If I'm reading the fts5 doc right, that means terms can have 3 words between them and the match will still be successful. I would think we'd want to start at 1 at most, maybe even 0?

Is there a reason for using an enum and not reporting the numeric distance itself, as like min_term_distance?

How hard would it be to add tests for non-near distances?

components/suggest/src/db.rs

0c0w3 · 2024-12-19T05:14:49Z

components/suggest/src/suggestion.rs

    },
    Exposure {
        suggestion_type: String,
        score: f64,
    },
 }

+/// Additional data about how an FTS match was made(https://bugzilla.mozilla.org/show_bug.cgi?id=1931373)


Nit: Missing a space in "made(" but I'd just leave out the bug URL. If we need to find the bug/PR where this change was made, we can check blame.

bendk · 2024-12-23T22:08:06Z

I got distracted last week, but I'm picking it back up now. Thanks for the great review, I think this one is looking much better now.

bendk · 2024-12-23T22:14:46Z

Thanks, lgtm. It's too bad several extra queries may be needed to get the match info, but I'm guessing it's not a big problem (in terms of latency at least) since they should all be using indexes? You could probably do one big query with SELECT subqueries to get all the info in one, but it's probably not worth it.

I refactored this to use subqueries and I think it's actually cleaner this way, so let's go for it.

0c0w3

Thanks, lgtm! I was thinking we would do at most two SQL queries total: the main query to match suggestions, and then if that succeeds, a second query to get the match info. That way queries that don't match Fakespot at all -- which is most queries -- don't pay the extra cost of the match-info subqueries. Sorry for not making that clear. But I can't say if that would be worth it or not, so up to you if you want to stick with this or try that.

bendk · 2024-12-30T21:49:30Z

Thanks for all the comments about performance, I think that's really the main question here. I ran some benchmarks on this, expecting the difference to be relatively small and realized it was actually quite a problem. The slowest queries times were more than doubling because of these changes. That doesn't seem acceptable to me, so I went back and tried to speed things up in two ways:

Fetch the match info after the initial query, and only fetch it for the highest-scoring suggestions. AFAIK, that's the only one we will ever record engagement/abandonment metrics for anyways, so there's no reason to spend time calculating it for other results.
Removed the term distance calculation altogether. I tried many, many, different ways to speed this up, but it was always very slow. I think the best version of the code still increased the query time by about 80%. I think speeding it up more would require quite a bit of work, maybe writing a custom FTS5 auxiliary function or writing our own tokenizer/stemmer and calculating the term distance from the query directly (which would also require a DFS or something similar, since terms can match in multiple locations).

With the new version, the slowest query times only increased by about 5%`. Some of the query times increased more than that, but those were queries that were already running very fast -- on the order of nanoseconds rather than milliseconds.

0c0w3

Thanks for working all that out, sorry again for the delay. I agree with this:

Fetch the match info after the initial query, and only fetch it for the highest-scoring suggestions. AFAIK, that's the only one we will ever record engagement/abandonment metrics for anyways, so there's no reason to spend time calculating it for other results.

0c0w3 · 2025-01-06T22:24:22Z

components/suggest/src/db.rs

+        suggestion_id: usize,
+        title: &str,
+    ) -> Result<FtsMatchInfo> {
+        //let mut params: Vec<&dyn ToSql> = vec![];


Stray line?

0c0w3 · 2025-01-06T22:38:37Z

components/suggest/src/db.rs

+        let prefix = if fts_query.is_prefix_query {
+            // If the query was a prefix match query then test if the query without the prefix
+            // match would have also matched.  If not, then this counts as a prefix match.
+            let sql = "SELECT NOT EXISTS (SELECT 1 FROM fakespot_fts WHERE rowid = ? AND fakespot_fts MATCH ?)";


You could use exists() from ConnExt and that should allow you to remove the outer SELECT.

Nice, that looks much better.

0c0w3 · 2025-01-06T22:59:38Z

components/suggest/src/db.rs

+                        id,
+                    ))
+                })?;
+        // Sort the results, then add the FTS match info to the first one


Could you add a bit more on the reason for including match info for only the first one? Your comment in suggestion.rs is nice and even just copy-pasting that here would be helpful IMO.

0c0w3 · 2025-01-06T23:08:15Z

components/suggest/src/schema.rs

@@ -131,14 +131,15 @@ CREATE TABLE fakespot_custom_details(

 CREATE VIRTUAL TABLE IF NOT EXISTS fakespot_fts USING FTS5(
  title,
+  suggestion_id,


Is this used anywhere?

Nope, that was a stray from one of my attempts to optimize this.

0c0w3 · 2025-01-06T23:10:06Z

components/suggest/src/suggestion.rs

@@ -96,13 +96,26 @@ pub enum Suggestion {
        icon: Option<Vec<u8>>,
        icon_mimetype: Option<String>,
        score: f64,
+        // Details about the FTS match.  For performance reasons, this is only calculated for the
+        // result with the highest score.  That's the only one that will be shown to the user and


Turbo nit: I would phrase this more as assumption the component is making about how the consumer will use these suggestions, e.g., "we assume consumers will only show the highest-scoring suggestion."

Added extra data to `Suggestion::Fakespot` to capture how the FTS match was made. The plan is to use this as a facet for our metrics to help us consider how to tune the matching logic (i.e. maybe we should not use stemming, maybe we should reqiure that terms are close together). Added Suggest CLI flag to print out the FTS match info.

bendk · 2025-01-07T14:38:36Z

FWIW, based on messing around a bit with the query analyzer and testing performance I found a few things:

FTS5 virtual tables don't have an index on rowid, filtering by rowid requires a table scan. This is really unfortunate, because what we want to do is find matches with an initial query, then ask the FTS engine if those rows would have matched with an alternate query. However, if you do that FTS will run that second MATCH against the whole dataset. This made it so the query times were doubling or more with my initial implementation.
You can force an index by adding rowid as a column to the table, and I think this would make filtering by rowid alone efficient. However there doesn't seem to be way to do efficient queries that involve rowid and another term. For example if you executed SELECT * FROM fakespot_fts WHERE fakespot_fts MATCH '(suggestion_id: 42 AND title: "run*")', it's going to perform an FTS query for both of the terms, then merge those together. What we want it to do, is do the rowid match first which will result in 0 or 1 rows, then perform the title match. I couldn't figure out a way to force FTS5 to do that though.

bendk requested review from 0c0w3 and a team December 18, 2024 20:41

bendk force-pushed the push-mollqwzklyms branch from 474496c to 1ffb901 Compare December 18, 2024 20:47

bendk commented Dec 18, 2024

View reviewed changes

bendk force-pushed the push-mollqwzklyms branch from 1ffb901 to 302ec01 Compare December 18, 2024 20:54

bendk commented Dec 18, 2024

View reviewed changes

bendk force-pushed the push-mollqwzklyms branch 3 times, most recently from 33d369c to e6880b6 Compare December 18, 2024 21:27

0c0w3 approved these changes Dec 19, 2024

View reviewed changes

bendk force-pushed the push-mollqwzklyms branch from e6880b6 to 074c301 Compare December 23, 2024 22:06

bendk requested a review from 0c0w3 December 27, 2024 16:16

0c0w3 approved these changes Dec 27, 2024

View reviewed changes

bendk force-pushed the push-mollqwzklyms branch from 074c301 to dcfce33 Compare December 30, 2024 21:41

bendk requested a review from 0c0w3 December 30, 2024 21:49

bendk force-pushed the push-mollqwzklyms branch 2 times, most recently from 2089c84 to cad4831 Compare December 30, 2024 22:35

0c0w3 approved these changes Jan 6, 2025

View reviewed changes

bendk force-pushed the push-mollqwzklyms branch from cad4831 to 983d585 Compare January 7, 2025 14:32

bendk enabled auto-merge January 7, 2025 14:43

bendk added this pull request to the merge queue Jan 7, 2025

Merged via the queue into mozilla:main with commit f9539fb Jan 7, 2025
15 checks passed

bendk deleted the push-mollqwzklyms branch January 7, 2025 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1931373 - Add FTS matching data #6531

Bug 1931373 - Add FTS matching data #6531

bendk commented Dec 18, 2024

bendk commented Dec 18, 2024 •

edited

Loading

bendk Dec 18, 2024

0c0w3 Dec 19, 2024

bendk Dec 23, 2024 •

edited

Loading

bendk Dec 18, 2024

0c0w3 left a comment

0c0w3 Dec 19, 2024

0c0w3 Dec 19, 2024

0c0w3 Dec 19, 2024

0c0w3 Dec 19, 2024

bendk commented Dec 23, 2024

bendk commented Dec 23, 2024

0c0w3 left a comment

bendk commented Dec 30, 2024 •

edited

Loading

0c0w3 left a comment

0c0w3 Jan 6, 2025

0c0w3 Jan 6, 2025

bendk Jan 7, 2025

0c0w3 Jan 6, 2025

0c0w3 Jan 6, 2025

bendk Jan 7, 2025

0c0w3 Jan 6, 2025

bendk commented Jan 7, 2025 •

edited

Loading

Bug 1931373 - Add FTS matching data #6531

Bug 1931373 - Add FTS matching data #6531

Conversation

bendk commented Dec 18, 2024

Pull Request checklist

bendk commented Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendk Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0c0w3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendk commented Dec 23, 2024

bendk commented Dec 23, 2024

0c0w3 left a comment

Choose a reason for hiding this comment

bendk commented Dec 30, 2024 • edited Loading

0c0w3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendk commented Jan 7, 2025 • edited Loading

bendk commented Dec 18, 2024 •

edited

Loading

bendk Dec 23, 2024 •

edited

Loading

bendk commented Dec 30, 2024 •

edited

Loading

bendk commented Jan 7, 2025 •

edited

Loading