For Chinese, we should be able to load user dictionary using Jieba #275

aash949 · 2022-07-04T23:54:25Z

Jieba has a function to load a user's dictionary to make word segmentation more accurate to your dictionary of choice, i.e. the cc-cedict dictionary. Here's the function...

jieba.load_userdict(file_name)

I am proposing that, when Jieba is initialized, we check to see if there is a userdict.txt file in dbs (like frequency.txt) and, if there is a userdict.txt file, we use this function to load the contents of this file before implementing any word segmentation.

I haven't wrote much code since University but I'll check to see if I can implement this change myself.

ghost · 2022-10-04T17:25:59Z

Are there any news on this?

aash949 · 2022-10-05T22:21:59Z

Are there any news on this?

Implementing this and achieving the desired result (or at least my desired result) could be more complicated than I first thought.

If you load a user dictionary using Jieba before performing word segmentation, it will improve the word segmentation relative to your dictionary which is nice.

However, Jieba will continue to segment words the way it thinks words should be segmented rather than according to your dictionary.

What I find often happens is that Jieba thinks that two words with two separate dictionary entries in your dictionary are actually one longer word (i.e. a portmanteau) but that longer word isn't in your dictionary which can be a bit annoying if you would rather learn the words and their individual meanings separately.

I think the best way to resolve this is to load your dictionary, perform word segmentation, and then check word by word if each word is in your dictionary. If the word is not in your dictionary, use Jieba's del_word(word) function to delete the word (which is probably two words combined without a dictionary entry in your dictionary available) and then try word segmentation again to see if those two words are now separately segmented with a dictionary entry available for each.

I think this would slow things down a lot though.

Perhaps I'm overthinking this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For Chinese, we should be able to load user dictionary using Jieba #275

For Chinese, we should be able to load user dictionary using Jieba #275

aash949 commented Jul 4, 2022 •

edited

Loading

ghost commented Oct 4, 2022

aash949 commented Oct 5, 2022

For Chinese, we should be able to load user dictionary using Jieba #275

For Chinese, we should be able to load user dictionary using Jieba #275

Comments

aash949 commented Jul 4, 2022 • edited Loading

ghost commented Oct 4, 2022

aash949 commented Oct 5, 2022

aash949 commented Jul 4, 2022 •

edited

Loading