-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1931373 - Add FTS matching data #6531
Conversation
I think this is working based on running some tests with the CLI: I used variations on "running shoe" to check the stemming/prefix matching flags:
I used these queries to test the term distance:
|
474496c
to
1ffb901
Compare
components/suggest/src/suggestion.rs
Outdated
// All terms in a 5-term chunk | ||
Medium, | ||
// No 5-term chunk that contains all the terms | ||
Far, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The segments above were arbitrarily picked by me. Are there different numbers we should choose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, what's the intuition behind collecting this distance data?
3 seems kind of like a large distance to start with. If I'm reading the fts5 doc right, that means terms can have 3 words between them and the match will still be successful. I would think we'd want to start at 1 at most, maybe even 0?
Is there a reason for using an enum and not reporting the numeric distance itself, as like min_term_distance
?
How hard would it be to add tests for non-near distances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on this one, the docs confused me and I thought 3
meant the total amount of words in the clump was 3. I changed these numbers to 1
and 3
, which seems like more of a reasonable start.
That said, I'm still not sure what's correct to test for, I'm open to changing "Near" to meaning 0
.
The one thing I don't think we can do is calculate actual minimum term distance number. AFAICT, there's no function for that, you just have to make a bunch of queries and see if they match or not. We have variants like Adjacent=0
, Medium=1
and Far=2 or greater
though.
1ffb901
to
302ec01
Compare
|
||
fn split_terms(phrase: &str) -> Vec<&str> { | ||
phrase | ||
.split([' ', '(', ')', ':', '^', '*', '"', ',']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the ,
char in the list of things to split on. This way the comma in Trail Running,
isn't included in the search terms. The FTS tokenizer ignores that, but it was messing up the stemming check logic.
33d369c
to
e6880b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, lgtm. It's too bad several extra queries may be needed to get the match info, but I'm guessing it's not a big problem (in terms of latency at least) since they should all be using indexes? You could probably do one big query with SELECT
subqueries to get all the info in one, but it's probably not worth it.
components/suggest/src/query.rs
Outdated
// This is used when passing the keywords into an FTS search. It: | ||
// - Strips out any `():^*"` chars. These are typically used for advanced searches, which | ||
// we don't support and it would be weird to only support for FTS searches, which | ||
// currently means Fakespot searches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would remove the part about Fakespot so we don't need to remember to update this when we add more FTS suggestions.
components/suggest/src/query.rs
Outdated
pub fn sqlite_match_without_prefix_match(&self) -> &str { | ||
self.sqlite_match | ||
.strip_suffix('*') | ||
.unwrap_or(&self.sqlite_match) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of needing a method to compute this, couldn't you have a FtsQuery::sqlite_match_without_prefix_match
string that you would initialize in new()
as part of the prefix_match
if-statement?
components/suggest/src/suggestion.rs
Outdated
// All terms in a 5-term chunk | ||
Medium, | ||
// No 5-term chunk that contains all the terms | ||
Far, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, what's the intuition behind collecting this distance data?
3 seems kind of like a large distance to start with. If I'm reading the fts5 doc right, that means terms can have 3 words between them and the match will still be successful. I would think we'd want to start at 1 at most, maybe even 0?
Is there a reason for using an enum and not reporting the numeric distance itself, as like min_term_distance
?
How hard would it be to add tests for non-near distances?
components/suggest/src/suggestion.rs
Outdated
}, | ||
Exposure { | ||
suggestion_type: String, | ||
score: f64, | ||
}, | ||
} | ||
|
||
/// Additional data about how an FTS match was made(https://bugzilla.mozilla.org/show_bug.cgi?id=1931373) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Missing a space in "made(" but I'd just leave out the bug URL. If we need to find the bug/PR where this change was made, we can check blame.
e6880b6
to
074c301
Compare
I got distracted last week, but I'm picking it back up now. Thanks for the great review, I think this one is looking much better now. |
I refactored this to use subqueries and I think it's actually cleaner this way, so let's go for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, lgtm! I was thinking we would do at most two SQL queries total: the main query to match suggestions, and then if that succeeds, a second query to get the match info. That way queries that don't match Fakespot at all -- which is most queries -- don't pay the extra cost of the match-info subqueries. Sorry for not making that clear. But I can't say if that would be worth it or not, so up to you if you want to stick with this or try that.
074c301
to
dcfce33
Compare
Thanks for all the comments about performance, I think that's really the main question here. I ran some benchmarks on this, expecting the difference to be relatively small and realized it was actually quite a problem. The slowest queries times were more than doubling because of these changes. That doesn't seem acceptable to me, so I went back and tried to speed things up in two ways:
With the new version, the slowest query times only increased by about 5%`. Some of the query times increased more than that, but those were queries that were already running very fast -- on the order of nanoseconds rather than milliseconds. |
2089c84
to
cad4831
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working all that out, sorry again for the delay. I agree with this:
Fetch the match info after the initial query, and only fetch it for the highest-scoring suggestions. AFAIK, that's the only one we will ever record engagement/abandonment metrics for anyways, so there's no reason to spend time calculating it for other results.
components/suggest/src/db.rs
Outdated
suggestion_id: usize, | ||
title: &str, | ||
) -> Result<FtsMatchInfo> { | ||
//let mut params: Vec<&dyn ToSql> = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray line?
components/suggest/src/db.rs
Outdated
let prefix = if fts_query.is_prefix_query { | ||
// If the query was a prefix match query then test if the query without the prefix | ||
// match would have also matched. If not, then this counts as a prefix match. | ||
let sql = "SELECT NOT EXISTS (SELECT 1 FROM fakespot_fts WHERE rowid = ? AND fakespot_fts MATCH ?)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use exists()
from ConnExt and that should allow you to remove the outer SELECT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that looks much better.
id, | ||
)) | ||
})?; | ||
// Sort the results, then add the FTS match info to the first one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a bit more on the reason for including match info for only the first one? Your comment in suggestion.rs
is nice and even just copy-pasting that here would be helpful IMO.
components/suggest/src/schema.rs
Outdated
@@ -131,14 +131,15 @@ CREATE TABLE fakespot_custom_details( | |||
|
|||
CREATE VIRTUAL TABLE IF NOT EXISTS fakespot_fts USING FTS5( | |||
title, | |||
suggestion_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that was a stray from one of my attempts to optimize this.
components/suggest/src/suggestion.rs
Outdated
@@ -96,13 +96,26 @@ pub enum Suggestion { | |||
icon: Option<Vec<u8>>, | |||
icon_mimetype: Option<String>, | |||
score: f64, | |||
// Details about the FTS match. For performance reasons, this is only calculated for the | |||
// result with the highest score. That's the only one that will be shown to the user and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turbo nit: I would phrase this more as assumption the component is making about how the consumer will use these suggestions, e.g., "we assume consumers will only show the highest-scoring suggestion."
Added extra data to `Suggestion::Fakespot` to capture how the FTS match was made. The plan is to use this as a facet for our metrics to help us consider how to tune the matching logic (i.e. maybe we should not use stemming, maybe we should reqiure that terms are close together). Added Suggest CLI flag to print out the FTS match info.
cad4831
to
983d585
Compare
FWIW, based on messing around a bit with the query analyzer and testing performance I found a few things:
|
Added extra data to
Suggestion::Fakespot
to capture how the FTS match was made. The plan is to use this as a facet for our metrics to help us consider how to tune the matching logic (i.e. maybe we should not use stemming, maybe we should reqiure that terms are close together).Added Suggest CLI flag to print out the full debug repr for suggestions. This provides an easy way to test the new functionality.
Pull Request checklist
[ci full]
to the PR title.Branch builds: add
[firefox-android: branch-name]
to the PR title.