-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] About SymSpell model and probabilistic models (Norvig, etc.) #61
Comments
Are you referring to the ITRANS scheme of Devanagari transliteration? Character-based transliteration: Word-based transliteration SymSpell and Norvig's spelling correction are both word based. To use this model (word frequencies) for transliteration would make only sense, if there was ambiguity in word-based transliteration - which I don't see (but I don't know Hindi & Devanagari). |
@wolfgarbe yes exactly, that is the main issue. trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine'] I'm using indic-trans for this task. The problem raised by mother-tongue hindi speakers is that for words like |
To utilize a sentence-wide context to solve ambiguity you need n-gram probabilities (co-occurrence probabilities between multiple terms), not the single word probabilities (word frequencies) used in SymSpell/Norvig. See also Using N-grams to Process Hindi A similar approach is using GloVe word vectors. See A simple spell checker built from word vectors While I'm planning this for a future release, SymSpell is currently not using any sentence or document wide context for selecting the appropriate spelling suggestion. |
@wolfgarbe thanks a lot! I'm using Word2Vec for other similar tasks (top nearest words, etc.) and that is a great article actually, one of the first about spell checkers and word embedding approach. My concerns here are about performances, because the Word2vec models are typically huge binary files (words vectors of DIM let's say 100-300 for each word/subword) so you can have from 1GB to 4GB files, and it could a problem to apply it. For response times, using FastText it is like 10-12msec in inference, not sure compared to SymSpell performances how it would be, my impression is that Word2vec approach in thoses cases could slower performances if you do not quantize (hence shrink) the models without decreasing too much the accuracy. |
Let me know if you find something interesting. Thanks. |
I'm currently using both
Hunspell
and SymSpell as main spelling correction system. They works both ok, SymSpell works great (quality, performances, etc.) That said, I have a question about Norvig probabilistic Spell Checker, that I show up with a simple case.In some romanized languages, there is not one-to-one relation from the source script language term to the english (romanized) language term. So given that you have the romanization of let's say Hindi, you will get more possible english words as destination. Now this is a typical output of such a system:
1 (Hindi) word -> N (eng)
words.Typically decide which of the
N
words is the best is done with algorithm like beam search, viterbi, etc., but there are a lot of cases where the indecision stays on.Also in other case, we have
eng (N) -> hi (M)
, so this function is not bijective at all.Given that a Spell Checker have knowledge of all (most of) the words in a language, etc. and supposed I need context (like in this case) to go back from
eng (N) -> hi (M)
, do you think that SymSpell or Norvig's probabilistic model could give a valid hint about theM
choices (or theN
in the opposite way)? What's your opinion on that?The text was updated successfully, but these errors were encountered: