any comparation with lucene's Levenshtein Automaton ? #72

byronhe · 2019-08-21T07:10:31Z

http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

wolfgarbe · 2019-08-21T21:46:46Z

Something like http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata or https://issues.apache.org/jira/browse/LUCENE-2507 ?

From what I understand from Michael McCandless post:

Prior to 4.0, FuzzyQuery took a brute force approach: it visits every single unique term in the dictionary, computes the Levenshtein distance for it, and accepts the term if the edit distance<=maximum edit distance.

From 4.0 they were pre-calculating a Levenshtein automaton, that accepts only the terms within edit distance N. And then at query time they were intersecting that automaton with all the terms in the dictionary.

This would mean that they save time, because the don't need to calculate the Levenshtein distance for every single term, but it seems they still need to intersect the pre-built Levenshtein automaton with every single term in the dictionary, whereas SymSpell has to calculate the Levenshtein distance only for 0.0042% to 0.016% of the vocabulary.

It would sure be interesting to extend the benchmark, to see how the two algorithms compare in reality.

missinglink · 2022-01-29T21:18:00Z

but it seems they still need to intersect the pre-built Levenshtein automaton with every single term in the dictionary

This is not the case if the dictionary is also indexed in a trie structure.

In that case, intersecting the automata with the dictionary is a simple recursive algorithm which only follows branches of the dictionary with corresponding transitions in the state machine.

wolfgarbe · 2022-01-29T22:21:49Z

It seems like somebody did indeed benchmark SymSpell vs. Lucene's spelling correction:

Source: https://github.com/Shivakumar-Narayanan/Spell-Check
https://docs.google.com/document/d/1QQROh8ndwBHbPwx2t1kKZcquHDDfF0Pz

LifeIsStrange mentioned this issue Nov 19, 2021

Consider contributing to the search benchmark game mikemccand/luceneutil#150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

any comparation with lucene's Levenshtein Automaton ? #72

any comparation with lucene's Levenshtein Automaton ? #72

byronhe commented Aug 21, 2019

wolfgarbe commented Aug 21, 2019

missinglink commented Jan 29, 2022

wolfgarbe commented Jan 29, 2022

any comparation with lucene's Levenshtein Automaton ? #72

any comparation with lucene's Levenshtein Automaton ? #72

Comments

byronhe commented Aug 21, 2019

wolfgarbe commented Aug 21, 2019

missinglink commented Jan 29, 2022

wolfgarbe commented Jan 29, 2022