Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LCS Based algorithm that finds similar strings. #50

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Ph0enixKM
Copy link

@Ph0enixKM Ph0enixKM commented Aug 9, 2022

This solution uses the length finding variant of LCS algorithm.
Time Complexity: O(n * m)
Memory Complexity: O(min(n, m))

The solution itself is based on a lightweight library that I've created myself some time ago. I just realised that this great library exists and wanted to contribute my solutions as I see that you do not have LCS based algorithm in your arsenal.

@Ph0enixKM
Copy link
Author

@dguo, are these changes sufficient?

Comment on lines +506 to +507
table[0] = table.pop().unwrap();
table.push(vec![0 as usize; left.len() + 1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we reallocate on each iteration. Instead you should simply swap the rows: https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L331

Then again I feel like this should be possible using a single vector as long as table[0][col] is stored in the previous iteration before overwriting it. Similar to https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L255-L257


#[inline]
fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize {
let (left, right) = get_shorter_longer_strings(left, right);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to an extra allocation of two Strings. You should be able to switch them without this. Then again I am not sure we even want to swap them, since we have a large focus on binary size.

People who care about performance and not so much about binary size should use https://docs.rs/rapidfuzz/latest/rapidfuzz/distance/lcs_seq/index.html which is significantly faster.

/// assert_eq!(0.8, lcs_normalized("night", "fight"));
/// assert_eq!(1.0, lcs_normalized("ferris", "ferris"));
/// ```
pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This should follow the interface of other functions in the library which accept &str.
  2. Both the normalized and not normalized version should be public
  3. possibly name it lcs_seq instead of lcs, since lcs can mean both longest common subsequence and longest common substring which are different metrics with the same abbreviation
  4. probably we would want a generic version of the algorithms similar to other metrics.

/// assert_eq!(1.0, lcs_normalized("ferris", "ferris"));
/// ```
pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 {
let (len1, len2) = (left.as_ref().len(), right.as_ref().len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using len here is incorrect, since we operate on chars and not on bytes. It would need to use .chars().count()

#[inline]
fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize {
let (left, right) = get_shorter_longer_strings(left, right);
let mut table = vec![vec![0 as usize; left.len() + 1]; 2];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the char count as well.

Comment on lines +500 to +504
if rletter == lletter {
table[1][col + 1] = 1 + table[0][col];
} else {
table[1][col + 1] = max(table[0][col + 1], table[1][col]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rust I would probably use something like:

Suggested change
if rletter == lletter {
table[1][col + 1] = 1 + table[0][col];
} else {
table[1][col + 1] = max(table[0][col + 1], table[1][col]);
}
table[1][col + 1] = if rletter == lletter {
1 + table[0][col]
} else {
max(table[0][col + 1], table[1][col])
};

instead.

@Ph0enixKM
Copy link
Author

Thank you @maxbachmann for your time. I'll fix the code soon and resolve conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants