Correctly merge lowercase and uppercase bigrams #24
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some entries in the wordsegment/bigrams.txt file used to be duplicated.
In particular, each bigrams was lowercased, but since some bigrams had
an uppercase and lowercase appearance, the same bigram appeared in
lowercase twice. The code only uses one of these entries, causing the
frequency of these bigrams to be underestimated.
The attached program lowercase_ngrams.py lowercases its input while
merging the frequencies correctly. The wordsegment/bigrams.txt file is
updated using this program. The wordsegment/unigrams.txt file did not
have this issue, so it was not changed.
A new test was added to tests/test_coverage.py, showing how "helloworld"
is now correctly segmented as "hello world". Past iterations would
segment this as "helloworld" because the frequency of the bigram was
underestimated.