-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Few queries regarding upcoming changes, benchmarking spell check and dictionary formation etc. #87
Comments
Current approach: Prefix Future approach: Pigeonhole partitioning
Frequency dictionaries in many other languages can be found at the FrequencyWords repository You can also use Hunspell dictionaries and combine them with Google Books Ngram data which provides representative word frequencies.
See my comment on the Jamspell repository:
A synthetic benchmark is good for performance comparison. For measuring the correction quality using real world data is much more adequate. One could additionally implement a Weighted Edit distance giving a higher priority to pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound). Using more context information with word embeddings and word vectors is another promising approach:
answers to be continued .... |
@wolfgarbe This is awesome. Thanks for taking the time to answer this in a thorough fashion. |
Sorry to respond to such an old question, but is there any update on the pigeonhole partitioning implementation? Am I correct in thinking your example only works for Levenshtein distance (i.e. no transpositions)? |
No update on the pigeonhole partitioning implementation yet. SymSpell uses the Damerau-Levenshtein distance and therefore supports also the transposition of two adjacent characters. |
Thanks for getting back to me (and for the whole SymSpell implementation). Sorry, I meant your example of the pigeonhole partitioning above. For example if we have the word 'pigeonhole' and split it into 2 parts, 'pigeo' and 'nhole': if there are 3 edits that are either insert, delete or substitute then we can say that one partition will definitely have an edit distance of <=1. If, however, the characters on the boundary of the partitions are transposed, there could be the case that both partitions have an edit distance of 2. I believe that with a max edit distance of K, the word would have to be split into K+1 partitions to ensure that the one partition has a Damerau-Levenshtein distance <=1. |
Adjacent transposition
|
I'm very interested in the partitioning improvement, as I am trying to use the symspell algorithm to correct errors in quite long "words" (39 characters) and the performance and memory consumption really start to suffer there. These words are nearly random sequences (DNA), so the prefix assumption is not as useful. I only need to worry about Levenshtein distance, not Damerau-Levenshtein, so a partial solution would still be quite useful. I can sort of see how this would work but I'm curious about the complications you mentioned. I could try to implement it myself with a few pointers (likely as a fork of the Rust implementation) |
Thanks for maintaining this excellent library. I am currently contributing to the rust clone of this. I have a few queries.
Reducing memory would really help for webassembly usage in the browser.
To have maximum matches in the current frequency dictionary, what tokenizer (or just regex split) would you suggest to use?
For languages not in SCOWL, what would be an optimal way to form dictionaries?
In Jamspell, the author uses a 3 gram language model for re-ranking, is that on the roadmap and do you think it would bring in enough improvement that justifies it's use?
How to benchmark spell correction? Is synthetic benchmark on similarly tokenized dataset with random errors introduced a good way about it (when human errors are not present)
What are the metrics relevant for the real world (accuracy, false positives, precision etc) while benchmarking on such synthetic datasets?
The text was updated successfully, but these errors were encountered: