-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LCS Based algorithm that finds similar strings. #50
base: main
Are you sure you want to change the base?
Conversation
@dguo, are these changes sufficient? |
table[0] = table.pop().unwrap(); | ||
table.push(vec![0 as usize; left.len() + 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we reallocate on each iteration. Instead you should simply swap the rows: https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L331
Then again I feel like this should be possible using a single vector as long as table[0][col]
is stored in the previous iteration before overwriting it. Similar to https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L255-L257
|
||
#[inline] | ||
fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize { | ||
let (left, right) = get_shorter_longer_strings(left, right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leads to an extra allocation of two Strings. You should be able to switch them without this. Then again I am not sure we even want to swap them, since we have a large focus on binary size.
People who care about performance and not so much about binary size should use https://docs.rs/rapidfuzz/latest/rapidfuzz/distance/lcs_seq/index.html which is significantly faster.
/// assert_eq!(0.8, lcs_normalized("night", "fight")); | ||
/// assert_eq!(1.0, lcs_normalized("ferris", "ferris")); | ||
/// ``` | ||
pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This should follow the interface of other functions in the library which accept &str.
- Both the normalized and not normalized version should be public
- possibly name it lcs_seq instead of lcs, since lcs can mean both longest common subsequence and longest common substring which are different metrics with the same abbreviation
- probably we would want a generic version of the algorithms similar to other metrics.
/// assert_eq!(1.0, lcs_normalized("ferris", "ferris")); | ||
/// ``` | ||
pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 { | ||
let (len1, len2) = (left.as_ref().len(), right.as_ref().len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using len here is incorrect, since we operate on chars and not on bytes. It would need to use .chars().count()
#[inline] | ||
fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize { | ||
let (left, right) = get_shorter_longer_strings(left, right); | ||
let mut table = vec![vec![0 as usize; left.len() + 1]; 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the char count as well.
if rletter == lletter { | ||
table[1][col + 1] = 1 + table[0][col]; | ||
} else { | ||
table[1][col + 1] = max(table[0][col + 1], table[1][col]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In rust I would probably use something like:
if rletter == lletter { | |
table[1][col + 1] = 1 + table[0][col]; | |
} else { | |
table[1][col + 1] = max(table[0][col + 1], table[1][col]); | |
} | |
table[1][col + 1] = if rletter == lletter { | |
1 + table[0][col] | |
} else { | |
max(table[0][col + 1], table[1][col]) | |
}; |
instead.
Thank you @maxbachmann for your time. I'll fix the code soon and resolve conflicts |
This solution uses the length finding variant of LCS algorithm.
Time Complexity: O(n * m)
Memory Complexity: O(min(n, m))
The solution itself is based on a lightweight library that I've created myself some time ago. I just realised that this great library exists and wanted to contribute my solutions as I see that you do not have LCS based algorithm in your arsenal.