-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Quick check implementation #87
base: master
Are you sure you want to change the base?
Conversation
Latest benchmarkAll unicode-transforms-text NFD/AllChars: OK (2.76s) 5.41 ms ± 126 μs, 7% faster than baseline NFD/Deutsch: OK (0.59s) 2.27 ms ± 66 μs, 27% faster than baseline NFD/Devanagari: OK (0.33s) 5.09 ms ± 184 μs NFD/English: OK (13.85s) 1.67 ms ± 11 μs, 32% faster than baseline NFD/Japanese: OK (0.42s) 6.49 ms ± 227 μs NFD/Korean: OK (1.22s) 9.58 ms ± 99 μs NFD/Vietnamese: OK (0.30s) 4.68 ms ± 179 μs, 17% faster than baseline NFKD/AllChars: OK (1.11s) 8.68 ms ± 201 μs, 11% faster than baseline NFKD/Deutsch: OK (0.63s) 2.42 ms ± 65 μs, 32% faster than baseline NFKD/Devanagari: OK (0.35s) 5.37 ms ± 207 μs NFKD/English: OK (4.06s) 2.00 ms ± 24 μs, 34% faster than baseline NFKD/Japanese: OK (0.47s) 7.51 ms ± 238 μs, 8% slower than baseline NFKD/Korean: OK (0.64s) 10.0 ms ± 292 μs NFKD/Vietnamese: OK (0.63s) 4.95 ms ± 167 μs, 18% faster than baseline NFC/AllChars: OK (1.24s) 9.80 ms ± 92 μs, 10% slower than baseline NFC/Deutsch: OK (0.53s) 4.16 ms ± 99 μs, 14% slower than baseline NFC/Devanagari: OK (0.82s) 6.39 ms ± 92 μs, 8% slower than baseline NFC/English: OK (0.44s) 3.40 ms ± 121 μs, 18% slower than baseline NFC/Japanese: OK (0.64s) 10.1 ms ± 186 μs NFC/Korean: OK (0.31s) 4.89 ms ± 174 μs, 3% slower than baseline NFC/Vietnamese: OK (0.39s) 12.5 ms ± 463 μs, 6% slower than baseline NFKC/AllChars: OK (0.44s) 14.0 ms ± 368 μs NFKC/Deutsch: OK (0.32s) 5.28 ms ± 210 μs, 31% slower than baseline NFKC/Devanagari: OK (0.42s) 6.52 ms ± 170 μs, 13% slower than baseline NFKC/English: OK (0.47s) 3.64 ms ± 103 μs, 41% slower than baseline NFKC/Japanese: OK (0.36s) 11.7 ms ± 464 μs NFKC/Korean: OK (1.40s) 5.53 ms ± 64 μs, 18% slower than baseline NFKC/Vietnamese: OK (0.40s) 12.8 ms ± 447 μs, 6% slower than baseline So some improvements for decomposition but composition is slower. NFKC is also failling time to time when compared to ICU. Edited: use |
e188506
to
6b50a63
Compare
ab9a608
to
bf85902
Compare
@harendra-kumar I managed to obtain nice improvements: Benchmark resultsAll unicode-transforms-text NFD/AllChars: OK (1.31s) 5.13 ms ± 66 μs, 9.2 MB allocated, 815 B copied, 31 MB peak memory, 0.72x, 17% faster than baseline NFD/Deutsch: OK (1.12s) 2.19 ms ± 24 μs, 6.3 MB allocated, 807 B copied, 31 MB peak memory, 0.62x, 33% faster than baseline NFD/Devanagari: OK (1.21s) 4.70 ms ± 44 μs, 5.7 MB allocated, 799 B copied, 31 MB peak memory, 0.61x, 7% faster than baseline NFD/English: OK (1.70s) 1.65 ms ± 17 μs, 5.7 MB allocated, 791 B copied, 31 MB peak memory, 0.54x, 41% faster than baseline NFD/Japanese: OK (0.77s) 6.02 ms ± 84 μs, 9.3 MB allocated, 775 B copied, 31 MB peak memory, 0.70x, 2% faster than baseline NFD/Korean: OK (1.25s) 9.71 ms ± 154 μs, 15 MB allocated, 1.6 KB copied, 32 MB peak memory, 0.32x NFD/Vietnamese: OK (1.59s) 6.21 ms ± 44 μs, 9.3 MB allocated, 759 B copied, 32 MB peak memory, 1.12x, 11% faster than baseline NFKD/AllChars: OK (1.14s) 8.97 ms ± 101 μs, 12 MB allocated, 1.5 KB copied, 32 MB peak memory, 0.94x, 9% faster than baseline NFKD/Deutsch: OK (1.40s) 2.71 ms ± 50 μs, 6.3 MB allocated, 743 B copied, 32 MB peak memory, 0.77x, 23% faster than baseline NFKD/Devanagari: OK (0.63s) 4.91 ms ± 96 μs, 5.7 MB allocated, 735 B copied, 32 MB peak memory, 0.63x, 9% faster than baseline NFKD/English: OK (1.14s) 2.20 ms ± 22 μs, 5.7 MB allocated, 727 B copied, 32 MB peak memory, 0.72x, 27% faster than baseline NFKD/Japanese: OK (0.85s) 6.64 ms ± 117 μs, 9.7 MB allocated, 729 B copied, 32 MB peak memory, 0.70x, 3% faster than baseline NFKD/Korean: OK (1.27s) 10.0 ms ± 97 μs, 15 MB allocated, 1.5 KB copied, 32 MB peak memory, 0.33x NFKD/Vietnamese: OK (1.68s) 6.57 ms ± 45 μs, 9.3 MB allocated, 703 B copied, 32 MB peak memory, 0.62x, 8% faster than baseline NFC/AllChars: OK (1.14s) 4.51 ms ± 57 μs, 3.8 MB allocated, 347 B copied, 32 MB peak memory, 1.32x, 54% faster than baseline NFC/Deutsch: OK (1.01s) 1.98 ms ± 38 μs, 1.9 MB allocated, 227 B copied, 32 MB peak memory, 1.30x, 47% faster than baseline NFC/Devanagari: OK (1.48s) 5.77 ms ± 45 μs, 6.1 MB allocated, 679 B copied, 32 MB peak memory, 0.71x, 3% faster than baseline NFC/English: OK (2.04s) 1.97 ms ± 18 μs, 1.9 MB allocated, 223 B copied, 32 MB peak memory, 1.28x, 23% faster than baseline NFC/Japanese: OK (1.60s) 3.14 ms ± 24 μs, 1.9 MB allocated, 219 B copied, 32 MB peak memory, 1.25x, 68% faster than baseline NFC/Korean: OK (1.12s) 4.37 ms ± 56 μs, 1.9 MB allocated, 215 B copied, 32 MB peak memory, 1.10x, 2% faster than baseline NFC/Vietnamese: OK (1.56s) 12.3 ms ± 205 μs, 11 MB allocated, 1.4 KB copied, 32 MB peak memory, 0.86x, 25% faster than baseline NFKC/AllChars: OK (1.21s) 9.58 ms ± 86 μs, 13 MB allocated, 1.2 KB copied, 32 MB peak memory, 1.05x, 36% faster than baseline NFKC/Deutsch: OK (1.33s) 2.58 ms ± 27 μs, 5.7 MB allocated, 623 B copied, 32 MB peak memory, 0.92x, 40% faster than baseline NFKC/Devanagari: OK (0.73s) 5.73 ms ± 101 μs, 6.1 MB allocated, 615 B copied, 32 MB peak memory, 0.71x, 10% faster than baseline NFKC/English: OK (1.03s) 1.98 ms ± 29 μs, 1.9 MB allocated, 201 B copied, 32 MB peak memory, 1.32x, 15% faster than baseline NFKC/Japanese: OK (4.59s) 4.58 ms ± 17 μs, 6.7 MB allocated, 599 B copied, 32 MB peak memory, 0.94x, 64% faster than baseline NFKC/Korean: OK (1.36s) 5.33 ms ± 51 μs, 5.7 MB allocated, 591 B copied, 32 MB peak memory, 0.71x, 4% slower than baseline NFKC/Vietnamese: OK (0.75s) 12.1 ms ± 234 μs, 11 MB allocated, 1.3 KB copied, 32 MB peak memory, 0.84x, 31% faster than baseline Some remarks:
|
@wismill these results look great! Considering that we were already comparable or faster than text-icu in most cases. Now we are even better. Also we are now comparable in the cases where we were slower earlier (NFC and NFKC for Japanese, Deutsch and AllChars). I will take a better look when I get some time. |
I updated the benchmark results to include allocated memory & diff with |
I am looking at it, will update soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put Data/Unicode/Internal/Char/DerivedNormalizationProperties.hs in the unicode-data package. Even though as of now it may be custom code for consumption of unicode-transforms. If other consumers of this data arrive at some later point we can try adapting the requirements of all consumers in the same module. In the best case other consumers may also find this format useful without having to make any changes it.
We can keep it Internal/experimental to begin with.
I did not review the QC data generation. We are relying on your due diligence and the tests.
ComposeStarter s -> {-# SCC compose_YesStarter #-} case quickCheck ch of | ||
-- QC = Yes, starter (includes Jamo L & Hangul syllables), | ||
-- may decompose, may compose with next | ||
QC.YesStarter -> {-# SCC compose_YesStarter_YesStarter #-} do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not looked carefully into QC. I assume that we will be never here if the two starters can combine. Perhaps that will go in the QC.combining case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QC.YesStarter
means the second starter cannot combine with the previous, but can only with the next character.
-- Pending decomposition | ||
| UC.isDecomposable mode s -> | ||
{-# SCC compose_YesStarter_Decomposable_decomp #-} | ||
go (UC.decompose mode s ++ UC.decompose mode ch) i ComposeNone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hoping that these string appends won't be costly as the strings are small and there is only one append at max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was similar code in the original version. I do not think it is avoidable.
-- Jamo L + jamo V | ||
| UC.jamoLFirst <= cp && cp <= UC.jamoLLast && | ||
UC.jamoVFirst <= ich && ich <= UC.jamoVLast -> | ||
pure (i, composeJamoL s ch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The where
block defining cp
can be moved here as this is the only use of cp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not manage to find the correct syntax.
UC.jamoVFirst <= ich && ich <= UC.jamoVLast -> | ||
pure (i, composeJamoL s ch) | ||
-- Hangul LV + T | ||
| UC.isHangul s && UC.isHangulLV s && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clarify isHangulLV
documentation - it should say that the char being tested must be a Hangul char, it does not check whether it is Hangul or not.
|
||
-- Recursive decomposition | ||
go [] !i !st = pure (i, st) | ||
go (ch : rest) i st = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can check if making the list strict and adding a SPEC argument helps any of the benchmarks.
@@ -75,6 +74,9 @@ library | |||
Data.Unicode.Types | |||
Data.Text.Normalize | |||
|
|||
-- Internal | |||
Data.Unicode.Internal.Char.DerivedNormalizationProperties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can go in the other modules section below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not possible, because it is used in tests.
I see the following benchmarks getting slower:
Mainly Korean has significant regression. We can try to review and see if there are any opportunities to correct these. |
thanks
Your judgement, if you think they might be helpful keep them.
We can still have it in unicode-data. It is not necessary to use the same encoding as in the original db, but whaetver seems generally useful as per the algo should be ok.
I do not know. Unless someone requests it. |
We are faster than text-icu in all cases except the following:
We are slower only in some NFC cases and some of those are the ones that regressed with this change. So maybe we can improve those. |
I do not observe these results. Mine are all same or faster than baseline (ghc 9.2.4). Results (NNNx is compared to ICU, NNN% is against baseline)All unicode-transforms-te NFD/AllChars: OK (0.37s) 5.73 ms ± 207 μs, 0.76x, 7% less than baseline NFD/Deutsch: OK (0.76s) 2.94 ms ± 85 μs, 0.80x, 17% less than baseline NFD/Devanagari: OK (0.33s) 5.17 ms ± 189 μs, 0.66x, same as baseline NFD/English: OK (0.33s) 2.49 ms ± 90 μs, 0.79x, 18% less than baseline NFD/Japanese: OK (0.41s) 6.46 ms ± 172 μs, 0.69x, 7% more than baseline NFD/Korean: OK (0.62s) 9.86 ms ± 183 μs, 0.32x, same as baseline NFD/Vietnamese: OK (0.44s) 6.88 ms ± 216 μs, 1.21x, same as baseline NFKD/AllChars: OK (0.54s) 8.51 ms ± 271 μs, 0.85x, 10% less than baseline NFKD/Deutsch: OK (0.62s) 2.40 ms ± 81 μs, 0.67x, 26% less than baseline NFKD/Devanagari: OK (0.59s) 4.66 ms ± 109 μs, 0.60x, 13% less than baseline NFKD/English: OK (0.50s) 1.92 ms ± 45 μs, 0.61x, 30% less than baseline NFKD/Japanese: OK (0.83s) 6.56 ms ± 107 μs, 0.64x, 6% less than baseline NFKD/Korean: OK (0.61s) 9.66 ms ± 311 μs, 0.31x, same as baseline NFKD/Vietnamese: OK (0.39s) 6.06 ms ± 169 μs, 0.55x, 13% less than baseline NFC/AllChars: OK (0.45s) 3.53 ms ± 87 μs, 0.97x, 63% less than baseline NFC/Deutsch: OK (0.38s) 1.43 ms ± 48 μs, 0.88x, 62% less than baseline NFC/Devanagari: OK (0.67s) 5.23 ms ± 120 μs, 0.63x, 12% less than baseline NFC/English: OK (0.38s) 1.42 ms ± 56 μs, 0.86x, 49% less than baseline NFC/Japanese: OK (0.33s) 2.62 ms ± 90 μs, 0.98x, 75% less than baseline NFC/Korean: OK (0.53s) 4.08 ms ± 89 μs, 1.00x, 14% less than baseline NFC/Vietnamese: OK (0.73s) 11.8 ms ± 307 μs, 0.82x, 24% less than baseline NFKC/AllChars: OK (0.56s) 9.12 ms ± 202 μs, 0.97x, 36% less than baseline NFKC/Deutsch: OK (0.66s) 2.55 ms ± 56 μs, 0.87x, 37% less than baseline NFKC/Devanagari: OK (0.72s) 5.55 ms ± 94 μs, 0.68x, 7% less than baseline NFKC/English: OK (0.52s) 1.98 ms ± 45 μs, 1.23x, 30% less than baseline NFKC/Japanese: OK (0.47s) 3.77 ms ± 111 μs, 0.75x, 67% less than baseline NFKC/Korean: OK (0.30s) 4.83 ms ± 177 μs, 0.64x, 6% less than baseline NFKC/Vietnamese: OK (0.74s) 12.0 ms ± 255 μs, 0.83x, 22% less than baseline You may want to use |
It may depend on the CPU as well. I tested on Linux, AWS VM - which says "model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz" in "/proc/cpuinfo". Which CPU/Hardware are you testing on? |
Sure thing. I tested on a laptop equipped with AMD Ryzen 5 2500U @ 2.0GHz, on Linux. |
2955acc
to
91fdefb
Compare
I tested on a different machine: Results with GHC 9.4.1 on 4 × Intel® Core™ i5-3340M CPU @ 2.70GHzAll unicode-transforms-text NFD/AllChars: OK (2.83s) 11.0 ms ± 62 μs, 5% less than baseline NFD/Deutsch: OK (0.87s) 6.81 ms ± 121 μs, 5% less than baseline NFD/Devanagari: OK (1.56s) 12.2 ms ± 206 μs, 2% less than baseline NFD/English: OK (1.60s) 6.23 ms ± 51 μs, 4% less than baseline NFD/Japanese: OK (0.97s) 15.3 ms ± 267 μs, same as baseline NFD/Korean: OK (1.19s) 18.8 ms ± 270 μs, 4% more than baseline NFD/Vietnamese: OK (1.50s) 11.8 ms ± 93 μs, 4% less than baseline NFKD/AllChars: OK (0.98s) 15.4 ms ± 289 μs, 3% less than baseline NFKD/Deutsch: OK (1.91s) 7.67 ms ± 98 μs, 2% more than baseline NFKD/Devanagari: OK (1.59s) 12.2 ms ± 112 μs, 2% less than baseline NFKD/English: OK (0.77s) 5.99 ms ± 102 μs, 12% less than baseline NFKD/Japanese: OK (0.99s) 15.7 ms ± 192 μs, 1% less than baseline NFKD/Korean: OK (1.19s) 18.7 ms ± 363 μs, 3% more than baseline NFKD/Vietnamese: OK (2.96s) 11.6 ms ± 175 μs, 4% less than baseline NFC/AllChars: OK (5.88s) 11.5 ms ± 95 μs, 43% less than baseline NFC/Deutsch: OK (2.12s) 8.24 ms ± 111 μs, 28% less than baseline NFC/Devanagari: OK (1.93s) 15.2 ms ± 175 μs, 18% less than baseline NFC/English: OK (2.09s) 8.16 ms ± 144 μs, 21% less than baseline NFC/Japanese: OK (3.00s) 11.7 ms ± 123 μs, 50% less than baseline NFC/Korean: OK (1.80s) 14.1 ms ± 195 μs, 2% more than baseline NFC/Vietnamese: OK (0.61s) 19.5 ms ± 351 μs, 21% less than baseline NFKC/AllChars: OK (0.99s) 15.6 ms ± 220 μs, 34% less than baseline NFKC/Deutsch: OK (0.56s) 8.81 ms ± 169 μs, 22% less than baseline NFKC/Devanagari: OK (0.97s) 15.2 ms ± 169 μs, 16% less than baseline NFKC/English: OK (1.09s) 8.49 ms ± 96 μs, 16% less than baseline NFKC/Japanese: OK (0.80s) 12.6 ms ± 212 μs, 45% less than baseline NFKC/Korean: OK (0.91s) 14.6 ms ± 250 μs, 2% more than baseline NFKC/Vietnamese: OK (1.24s) 19.5 ms ± 218 μs, 21% less than baseline @harendra-kumar I observe some regressions but not as close as the one you got. Could you please re-run the benchmark carefully? |
e601c68
to
68b5bf6
Compare
@wismill can you rebase this on master? |
68b5bf6
to
c89cb0d
Compare
@harendra-kumar done |
Sorry for a long delay. I benchmarked again, the benchmarks look good. It seems tests are failing in the CIs. Can you take a look at those and make these pass? |
Implement Quick Check algorithm.
This is very much WIP; it is intended to open the discussion on the implementation.
It relies on an ongoing PR of
unicode-data
.Releaseunicode-data
with QuickCheck properties.Deletechangecabal.project
;stack.yaml
andto uses390x.yaml
unicode-data
release.Fixes #1