-
Notifications
You must be signed in to change notification settings - Fork 108
Add an API for finding all the misspelled words in a given string #27
Conversation
ad3e2d9
to
5cff726
Compare
Only semi related and I can open an issue for it if needed, but do we know if spellchecker / c side use same encoding as javascript ? There's one report where the correction of |
Fantastic 🤘
If anything, that seems preferable than doing the splitting ourselves. |
3d3dacc
to
6bb8b4c
Compare
ad37f1a
to
b65626a
Compare
This function will return an array of character ranges, indicating where *all* of the misspelled words are in a given string.
b65626a
to
ab01262
Compare
5e56852
to
0824cd3
Compare
db24c77
to
3085587
Compare
49dbbd4
to
531ec95
Compare
2fa7057
to
777cf8c
Compare
777cf8c
to
5f11ffd
Compare
] | ||
|
||
it "accounts for UTF16 pairs", -> | ||
string = "😎 cat caat dog dooog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Mac and Windows, this didn't require any extra work, because the NSRanges
returned by NSSpellChecker
(and probably all NSString
APIs) seem to refer to UTF16 code point indices, as opposed to logical character indices, and the same applies for the Windows spell-check APIs.
For Linux, the Hunspell library only provides a per-word spell-checking API; it doesn't handle arbitrary text. It also expects UTF8-encoded words. I deal with this by passing the string to the native spell-checkers in UTF16 (as V8 natively stores it), and for hunspell, transcoding to UTF8 one word at a time, so that I retain the UTF16 indices.
I think this is ready. I'd love to get somebody else's 👀 on it. |
Yeah, it looks like we can now spell-check words like |
Ok, seems to be working well on Windows. Gonna 🚢 |
Add an API for finding all the misspelled words in a given string
Fixes atom/spell-check#99
Fixes atom/spell-check#53
Supercedes atom/spell-check#100
Depends on #28
Refs atom/spell-check#53
Refs atom/atom#8908
When opening a large plain text file, Atom's spell check task takes a very long time to process the file. When I open
/usr/share/dict/words
, which contains 235,886 words, one per line, the spell check task runs for 95 seconds.Source of the slowness
On Mac, spell checking is implemented by calling into the central
AppleSpell
process, so there is some IPC overhead for each spell-checking call.There seem to be some overhead for each spell check call on Windows too, as I'm seeing a 2X improvement there. On Linux, our existing code was already fine.
Solution
This PR adds a new native API,
Spellchecker.checkSpelling(string)
, which takes a multi-word string and returns an array of character ranges representing all of the misspelled words. This way, the spell-checking can be performed in a single shot.TODO
Speedup
On my machine, spell-checking
/usr/share/dict/words
now takes about 11 seconds: ~9X faster than before. This is now short enough that my CPU fan stays quiet.Questions
This may cause some subtle behavior change, because the platform's spell-checking library will now be in charge of partitioning the text into words, rather than handling that in JS. This doesn't seem like a huge problem to me, but maybe someone else has some insight into this.
/cc @atom/feedback