Add an API for finding all the misspelled words in a given string #27

maxbrunsfeld · 2016-01-05T00:39:36Z

Fixes atom/spell-check#99
Fixes atom/spell-check#53
Supercedes atom/spell-check#100

Depends on #28
Refs atom/spell-check#53
Refs atom/atom#8908

When opening a large plain text file, Atom's spell check task takes a very long time to process the file. When I open /usr/share/dict/words, which contains 235,886 words, one per line, the spell check task runs for 95 seconds.

Source of the slowness

On Mac, spell checking is implemented by calling into the central AppleSpell process, so there is some IPC overhead for each spell-checking call.

There seem to be some overhead for each spell check call on Windows too, as I'm seeing a 2X improvement there. On Linux, our existing code was already fine.

Solution

This PR adds a new native API, Spellchecker.checkSpelling(string), which takes a multi-word string and returns an array of character ranges representing all of the misspelled words. This way, the spell-checking can be performed in a single shot.

TODO

Mac
Windows
Linux
Account for paired characters, since native spell checkers will return indices in terms of logical code points, not Javascript's 2-byte values.

Speedup

On my machine, spell-checking /usr/share/dict/words now takes about 11 seconds: ~9X faster than before. This is now short enough that my CPU fan stays quiet.

Questions

This may cause some subtle behavior change, because the platform's spell-checking library will now be in charge of partitioning the text into words, rather than handling that in JS. This doesn't seem like a huge problem to me, but maybe someone else has some insight into this.

/cc @atom/feedback

jeancroy · 2016-01-06T02:45:41Z

Only semi related and I can open an issue for it if needed, but do we know if spellchecker / c side use same encoding as javascript ?

There's one report where the correction of cliche to cliché was inserted as clich� on a utf-8 document on linux.

joshaber · 2016-01-06T15:29:54Z

Fantastic 🤘

This may cause some subtle behavior change, because the platform's spell-checking library will now be in charge of partitioning the text into words, rather than handling that in JS.

If anything, that seems preferable than doing the splitting ourselves.

This function will return an array of character ranges, indicating where *all* of the misspelled words are in a given string.

maxbrunsfeld · 2016-01-08T17:38:15Z

spec/spellchecker-spec.coffee

+      ]
+
+    it "accounts for UTF16 pairs", ->
+      string = "😎 cat caat dog dooog"


For Mac and Windows, this didn't require any extra work, because the NSRanges returned by NSSpellChecker (and probably all NSString APIs) seem to refer to UTF16 code point indices, as opposed to logical character indices, and the same applies for the Windows spell-check APIs.

For Linux, the Hunspell library only provides a per-word spell-checking API; it doesn't handle arbitrary text. It also expects UTF8-encoded words. I deal with this by passing the string to the native spell-checkers in UTF16 (as V8 natively stores it), and for hunspell, transcoding to UTF8 one word at a time, so that I retain the UTF16 indices.

maxbrunsfeld · 2016-01-08T17:50:40Z

I think this is ready. I'd love to get somebody else's 👀 on it.

maxbrunsfeld · 2016-01-08T18:25:22Z

If anything, that seems preferable than doing the splitting ourselves.

Yeah, it looks like we can now spell-check words like cliché, which previously would have been discarded by our regex.

maxbrunsfeld · 2016-01-11T23:33:22Z

Ok, seems to be working well on Windows. Gonna 🚢

Add an API for finding all the misspelled words in a given string

maxbrunsfeld mentioned this pull request Jan 5, 2016

Speed up spell checking atom/spell-check#102

Merged

maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch from ad3e2d9 to 5cff726 Compare January 5, 2016 21:06

maxbrunsfeld mentioned this pull request Jan 6, 2016

high CPU usage with spell checker (Mac) atom/atom#10306

Closed

maxbrunsfeld force-pushed the master branch 8 times, most recently from 3d3dacc to 6bb8b4c Compare January 6, 2016 19:34

maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch 5 times, most recently from ad37f1a to b65626a Compare January 6, 2016 21:12

maxbrunsfeld added 8 commits January 6, 2016 13:17

Add Mac impl for bulk spell-checking function

de71952

This function will return an array of character ranges, indicating where *all* of the misspelled words are in a given string.

Add stub hunspell impl for bulk spell-checking function

063ef56

Add windows impl for bulk spell-checking function

db0e388

🎨

dae76c8

Pass UTF16-encoded string to CheckSpelling

8609f76

Add real hunspell impl for bulk spell-checking function

f51154c

Fix MSVS warnings

1bc7ad2

Add spec for handling paired characters

ab01262

maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch from b65626a to ab01262 Compare January 6, 2016 21:18

maxbrunsfeld added 2 commits January 6, 2016 14:00

Add test for non-word characters

0d2fe14

Handle invalid inputs to bulk spell-checking function

0824cd3

maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch from 5e56852 to 0824cd3 Compare January 6, 2016 22:01

maxbrunsfeld added 2 commits January 6, 2016 19:54

In CheckSpelling, leave room for the terminating NULL

7f601b7

Clean up hunspell CheckSpelling

84fb4af

maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch from db24c77 to 3085587 Compare January 7, 2016 19:09

Test hunspell implementation on Windows CI

9aff496

maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch 6 times, most recently from 49dbbd4 to 531ec95 Compare January 7, 2016 20:16

Use std libraries for UTF16 -> UTF8 conversion in hunspell spellchecker

4262eb3

maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch 4 times, most recently from 2fa7057 to 777cf8c Compare January 7, 2016 21:23

In hunspell, handle apostrophes, ignore words w/ non-english letters

5f11ffd

maxbrunsfeld force-pushed the mb-add-bulk-checking-method branch from 777cf8c to 5f11ffd Compare January 7, 2016 22:23

maxbrunsfeld reviewed Jan 8, 2016
View reviewed changes

3.2.0-0

b47f706

maxbrunsfeld pushed a commit that referenced this pull request Jan 11, 2016

Merge pull request #27 from atom/mb-add-bulk-checking-method

e511bfa

Add an API for finding all the misspelled words in a given string

maxbrunsfeld merged commit e511bfa into master Jan 11, 2016

maxbrunsfeld deleted the mb-add-bulk-checking-method branch January 11, 2016 23:33

winstliu mentioned this pull request Feb 2, 2016

Atom Helper uses 70% CPU after a opening large plain text file is opened atom/atom#8908

Closed

jeancroy mentioned this pull request May 2, 2016

Changed spell-checking to be plugin-based. atom/spell-check#120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an API for finding all the misspelled words in a given string #27

Add an API for finding all the misspelled words in a given string #27

maxbrunsfeld commented Jan 5, 2016

jeancroy commented Jan 6, 2016

joshaber commented Jan 6, 2016

maxbrunsfeld Jan 8, 2016

maxbrunsfeld commented Jan 8, 2016

maxbrunsfeld commented Jan 8, 2016

maxbrunsfeld commented Jan 11, 2016

Add an API for finding all the misspelled words in a given string #27

Add an API for finding all the misspelled words in a given string #27

Conversation

maxbrunsfeld commented Jan 5, 2016

Source of the slowness

Solution

TODO

Speedup

Questions

jeancroy commented Jan 6, 2016

joshaber commented Jan 6, 2016

maxbrunsfeld Jan 8, 2016

Choose a reason for hiding this comment

maxbrunsfeld commented Jan 8, 2016

maxbrunsfeld commented Jan 8, 2016

maxbrunsfeld commented Jan 11, 2016