Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For Chinese, we should be able to load user dictionary using Jieba #275

Open
aash949 opened this issue Jul 4, 2022 · 2 comments
Open

For Chinese, we should be able to load user dictionary using Jieba #275

aash949 opened this issue Jul 4, 2022 · 2 comments

Comments

@aash949
Copy link

aash949 commented Jul 4, 2022

Jieba has a function to load a user's dictionary to make word segmentation more accurate to your dictionary of choice, i.e. the cc-cedict dictionary. Here's the function...

jieba.load_userdict(file_name)

I am proposing that, when Jieba is initialized, we check to see if there is a userdict.txt file in dbs (like frequency.txt) and, if there is a userdict.txt file, we use this function to load the contents of this file before implementing any word segmentation.

I haven't wrote much code since University but I'll check to see if I can implement this change myself.

@ghost
Copy link

ghost commented Oct 4, 2022

Are there any news on this?

@aash949
Copy link
Author

aash949 commented Oct 5, 2022

Are there any news on this?

Implementing this and achieving the desired result (or at least my desired result) could be more complicated than I first thought.

If you load a user dictionary using Jieba before performing word segmentation, it will improve the word segmentation relative to your dictionary which is nice.

However, Jieba will continue to segment words the way it thinks words should be segmented rather than according to your dictionary.

What I find often happens is that Jieba thinks that two words with two separate dictionary entries in your dictionary are actually one longer word (i.e. a portmanteau) but that longer word isn't in your dictionary which can be a bit annoying if you would rather learn the words and their individual meanings separately.

I think the best way to resolve this is to load your dictionary, perform word segmentation, and then check word by word if each word is in your dictionary. If the word is not in your dictionary, use Jieba's del_word(word) function to delete the word (which is probably two words combined without a dictionary entry in your dictionary available) and then try word segmentation again to see if those two words are now separately segmented with a dictionary entry available for each.

I think this would slow things down a lot though.

Perhaps I'm overthinking this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant