Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LCS Based algorithm that finds similar strings. #50

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ version = "0.10.0"
authors = ["Danny Guo <[email protected]>"]
description = """
Implementations of string similarity metrics. Includes Hamming, Levenshtein,
OSA, Damerau-Levenshtein, Jaro, Jaro-Winkler, and Sørensen-Dice.
OSA, Damerau-Levenshtein, Jaro, Jaro-Winkler, Sørensen-Dice and LCS based algorithm.
"""
license = "MIT"
readme = "README.md"
Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
- [Damerau-Levenshtein] - distance & normalized
- [Jaro and Jaro-Winkler] - this implementation of Jaro-Winkler does not limit the common prefix length
- [Sørensen-Dice]
- [LCS based algorithm] - this algorithm uses LCS length finding variant with O(n * m) time complexity and O(min(n, m)) memory complexity

The normalized versions return values between `0.0` and `1.0`, where `1.0` means
an exact match.
Expand Down Expand Up @@ -39,7 +40,7 @@ extern crate strsim;

use strsim::{hamming, levenshtein, normalized_levenshtein, osa_distance,
damerau_levenshtein, normalized_damerau_levenshtein, jaro,
jaro_winkler, sorensen_dice};
jaro_winkler, sorensen_dice, lcs_normalized};

fn main() {
match hamming("hamming", "hammers") {
Expand All @@ -66,6 +67,8 @@ fn main() {

assert_eq!(sorensen_dice("web applications", "applications of the web"),
0.7878787878787878);

assert!(lcs_normalized("foobar", "ofobar") > 0.8);
}
```

Expand Down Expand Up @@ -99,4 +102,5 @@ Benchmarks require a Nightly toolchain. Run `$ cargo +nightly bench`.
[Hamming]:http://en.wikipedia.org/wiki/Hamming_distance
[Optimal string alignment]:https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance
[Sørensen-Dice]:http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
[LCS Algorithm]:https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
[Docker]:https://docs.docker.com/engine/installation/
9 changes: 9 additions & 0 deletions benches/benches.rs
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,13 @@ mod benches {
strsim::sorensen_dice(&a, &b);
})
}

#[bench]
fn bench_lcs_normalized(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::lcs_normalized(&a, &b);
})
}
}
111 changes: 111 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,50 @@ pub fn sorensen_dice(a: &str, b: &str) -> f64 {
(2 * intersection_size) as f64 / (a.len() + b.len() - 2) as f64
}

/// Uses LCS algorithm to find longest common subsequence
/// and then divides it by the length of the longes string
/// ```
/// use strsim::lcs_normalized;
///
/// assert_eq!(1.0, lcs_normalized("", ""));
/// assert_eq!(0.0, lcs_normalized("", "umbrella"));
/// assert_eq!(0.8, lcs_normalized("night", "fight"));
/// assert_eq!(1.0, lcs_normalized("ferris", "ferris"));
/// ```
pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This should follow the interface of other functions in the library which accept &str.
  2. Both the normalized and not normalized version should be public
  3. possibly name it lcs_seq instead of lcs, since lcs can mean both longest common subsequence and longest common substring which are different metrics with the same abbreviation
  4. probably we would want a generic version of the algorithms similar to other metrics.

let (len1, len2) = (left.as_ref().len(), right.as_ref().len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using len here is incorrect, since we operate on chars and not on bytes. It would need to use .chars().count()

let lcs_len = lcs_length(left.as_ref(), right.as_ref());
let size = max(len1, len2);
// Empty strings should match
if size == 0 { 1.0 } else { lcs_len as f64 / size as f64 }
}

#[inline]
fn get_shorter_longer_strings(left: impl AsRef<str>, right: impl AsRef<str>) -> (String, String) {
if left.as_ref().len() < right.as_ref().len() {
(left.as_ref().to_string(), right.as_ref().to_string())
} else {
(right.as_ref().to_string(), left.as_ref().to_string())
}
}

#[inline]
fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize {
let (left, right) = get_shorter_longer_strings(left, right);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to an extra allocation of two Strings. You should be able to switch them without this. Then again I am not sure we even want to swap them, since we have a large focus on binary size.

People who care about performance and not so much about binary size should use https://docs.rs/rapidfuzz/latest/rapidfuzz/distance/lcs_seq/index.html which is significantly faster.

let mut table = vec![vec![0 as usize; left.len() + 1]; 2];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the char count as well.

for rletter in right.chars() {
for (col, lletter) in left.chars().enumerate() {
if rletter == lletter {
table[1][col + 1] = 1 + table[0][col];
} else {
table[1][col + 1] = max(table[0][col + 1], table[1][col]);
}
Comment on lines +500 to +504
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rust I would probably use something like:

Suggested change
if rletter == lletter {
table[1][col + 1] = 1 + table[0][col];
} else {
table[1][col + 1] = max(table[0][col + 1], table[1][col]);
}
table[1][col + 1] = if rletter == lletter {
1 + table[0][col]
} else {
max(table[0][col + 1], table[1][col])
};

instead.

}
table[0] = table.pop().unwrap();
table.push(vec![0 as usize; left.len() + 1]);
Comment on lines +506 to +507
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we reallocate on each iteration. Instead you should simply swap the rows: https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L331

Then again I feel like this should be possible using a single vector as long as table[0][col] is stored in the previous iteration before overwriting it. Similar to https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L255-L257

}
*table[0].last().unwrap()
}

#[cfg(test)]
mod tests {
Expand Down Expand Up @@ -989,4 +1033,71 @@ mod tests {
sorensen_dice("this has one extra word", "this has one word")
);
}

#[test]
fn lcs_normalized_diff_unequal_length() {
assert!(lcs_normalized("damerau", "aderuaxyz") < 0.5);
}

#[test]
fn lcs_normalized_diff_unequal_length_reversed() {
assert!(lcs_normalized("aderuaxyz", "damerau") < 0.5);
}

#[test]
fn lcs_normalized_diff_comedians() {
assert!(lcs_normalized("Stewart", "Colbert") < 0.5);
}

#[test]
fn lcs_normalized_many_transpositions() {
assert!(lcs_normalized("abcdefghijkl", "bacedfgihjlk") < 0.7);
}

#[test]
fn lcs_normalized_diff_longer() {
let a = "The quick brown fox jumped over the angry dog.";
let b = "Lehem ipsum dolor sit amet, dicta latine an eam.";
assert!(lcs_normalized(a, b) < 0.4);
}

#[test]
fn lcs_normalized_beginning_transposition() {
assert!(lcs_normalized("foobar", "ofobar") > 0.8);
}

#[test]
fn lcs_normalized_end_transposition() {
assert!(lcs_normalized("specter", "spectre") > 0.8);
}

#[test]
fn lcs_normalized_unrestricted_edit() {
assert!(lcs_normalized("a cat", "an abct") > 0.5);
}

#[test]
fn lcs_normalized_diff_short() {
assert!(lcs_normalized("levenshtein", "löwenbräu") < 0.01);
}

#[test]
fn lcs_normalized_for_empty_strings() {
assert!(lcs_normalized("", "") > 0.99);
}

#[test]
fn lcs_normalized_first_empty() {
assert!(lcs_normalized("", "flower") < 0.01);
}

#[test]
fn lcs_normalized_second_empty() {
assert!(lcs_normalized("tree", "") < 0.01);
}

#[test]
fn lcs_normalized_identical_strings() {
assert!(lcs_normalized("sunglasses", "sunglasses") > 0.99);
}
}