Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly merge lowercase and uppercase bigrams #24

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kvakil
Copy link

@kvakil kvakil commented Oct 11, 2019

Some entries in the wordsegment/bigrams.txt file used to be duplicated.
In particular, each bigrams was lowercased, but since some bigrams had
an uppercase and lowercase appearance, the same bigram appeared in
lowercase twice. The code only uses one of these entries, causing the
frequency of these bigrams to be underestimated.

The attached program lowercase_ngrams.py lowercases its input while
merging the frequencies correctly. The wordsegment/bigrams.txt file is
updated using this program. The wordsegment/unigrams.txt file did not
have this issue, so it was not changed.

A new test was added to tests/test_coverage.py, showing how "helloworld"
is now correctly segmented as "hello world". Past iterations would
segment this as "helloworld" because the frequency of the bigram was
underestimated.

Some entries in the wordsegment/bigrams.txt file used to be duplicated.
In particular, each bigrams was lowercased, but since some bigrams had
an uppercase and lowercase appearance, the same bigram appeared in
lowercase twice. The code only uses one of these entries, causing the
frequency of these bigrams to be underestimated.

The attached program lowercase_ngrams.py lowercases its input while
merging the frequencies correctly. The wordsegment/bigrams.txt file is
updated using this program. The wordsegment/unigrams.txt file did not
have this issue, so it was not changed.

A new test was added to tests/test_coverage.py, showing how "helloworld"
is now correctly segmented as "hello world". Past iterations would
segment this as "helloworld" because the frequency of the bigram was
underestimated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant