Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Add an API for finding all the misspelled words in a given string #27

Merged
merged 16 commits into from
Jan 11, 2016

Conversation

maxbrunsfeld
Copy link
Contributor

Fixes atom/spell-check#99
Fixes atom/spell-check#53
Supercedes atom/spell-check#100

Depends on #28
Refs atom/spell-check#53
Refs atom/atom#8908

When opening a large plain text file, Atom's spell check task takes a very long time to process the file. When I open /usr/share/dict/words, which contains 235,886 words, one per line, the spell check task runs for 95 seconds.

Source of the slowness

On Mac, spell checking is implemented by calling into the central AppleSpell process, so there is some IPC overhead for each spell-checking call.

There seem to be some overhead for each spell check call on Windows too, as I'm seeing a 2X improvement there. On Linux, our existing code was already fine.

Solution

This PR adds a new native API, Spellchecker.checkSpelling(string), which takes a multi-word string and returns an array of character ranges representing all of the misspelled words. This way, the spell-checking can be performed in a single shot.

TODO
  • Mac
  • Windows
  • Linux
  • Account for paired characters, since native spell checkers will return indices in terms of logical code points, not Javascript's 2-byte values.
Speedup

On my machine, spell-checking /usr/share/dict/words now takes about 11 seconds: ~9X faster than before. This is now short enough that my CPU fan stays quiet.

Questions

This may cause some subtle behavior change, because the platform's spell-checking library will now be in charge of partitioning the text into words, rather than handling that in JS. This doesn't seem like a huge problem to me, but maybe someone else has some insight into this.

/cc @atom/feedback

@jeancroy
Copy link

jeancroy commented Jan 6, 2016

Only semi related and I can open an issue for it if needed, but do we know if spellchecker / c side use same encoding as javascript ?

There's one report where the correction of cliche to cliché was inserted as clich� on a utf-8 document on linux.

@joshaber
Copy link

joshaber commented Jan 6, 2016

Fantastic 🤘

This may cause some subtle behavior change, because the platform's spell-checking library will now be in charge of partitioning the text into words, rather than handling that in JS.

If anything, that seems preferable than doing the splitting ourselves.

@maxbrunsfeld maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch 6 times, most recently from 49dbbd4 to 531ec95 Compare January 7, 2016 20:16
@maxbrunsfeld maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch 4 times, most recently from 2fa7057 to 777cf8c Compare January 7, 2016 21:23
]

it "accounts for UTF16 pairs", ->
string = "😎 cat caat dog dooog"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Mac and Windows, this didn't require any extra work, because the NSRanges returned by NSSpellChecker (and probably all NSString APIs) seem to refer to UTF16 code point indices, as opposed to logical character indices, and the same applies for the Windows spell-check APIs.

For Linux, the Hunspell library only provides a per-word spell-checking API; it doesn't handle arbitrary text. It also expects UTF8-encoded words. I deal with this by passing the string to the native spell-checkers in UTF16 (as V8 natively stores it), and for hunspell, transcoding to UTF8 one word at a time, so that I retain the UTF16 indices.

@maxbrunsfeld
Copy link
Contributor Author

I think this is ready. I'd love to get somebody else's 👀 on it.

@maxbrunsfeld
Copy link
Contributor Author

If anything, that seems preferable than doing the splitting ourselves.

Yeah, it looks like we can now spell-check words like cliché, which previously would have been discarded by our regex.

@maxbrunsfeld
Copy link
Contributor Author

Ok, seems to be working well on Windows. Gonna 🚢

maxbrunsfeld pushed a commit that referenced this pull request Jan 11, 2016
Add an API for finding all the misspelled words in a given string
@maxbrunsfeld maxbrunsfeld merged commit e511bfa into master Jan 11, 2016
@maxbrunsfeld maxbrunsfeld deleted the mb-add-bulk-checking-method branch January 11, 2016 23:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants