We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
你好,我在用自己数据运行 create_pretraining_data.py 的时候提示: KeyError: '##cry', 看了下是 def convert_by_vocab(vocab, items): """Converts a sequence of [tokens|ids] using the vocab.""" output = [] for i,item in enumerate(items): #print(i,"item:",item) # ##期 output.append(vocab[item]) return output 函数报的错,感觉应该是在做jieba中文分词后生成的一些token不在词表里
跑 create_pretraining_data.py 的参数如下: --do_lower_case=True --max_seq_length=40 --do_whole_word_mask=True --max_predictions_per_seq=20 --masked_lm_prob=0.15 --dupe_factor=3
vocab 用的是bert的
The text was updated successfully, but these errors were encountered:
我也遇到的同样的问题。 考虑的解决方法是词库里中英文混合词,不做wwm策略了。
代码修改: get_new_segment函数里的 if segment_str in seq_cws_dict: 改成 if segment_str in seq_cws_dict and len(re.findall('[a-zA-Z]', segment_str))==0:
原因举例: bert分词:'顺', '利', '的', '无', '创', 'dna' jieba分词:'顺', '##利', '的', '无', '##创', '##dna' 再往后,bert词库里没有 ##dna,就报错了
Sorry, something went wrong.
No branches or pull requests
你好,我在用自己数据运行 create_pretraining_data.py 的时候提示: KeyError: '##cry', 看了下是 def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for i,item in enumerate(items):
#print(i,"item:",item) # ##期
output.append(vocab[item])
return output
函数报的错,感觉应该是在做jieba中文分词后生成的一些token不在词表里
跑 create_pretraining_data.py 的参数如下:
--do_lower_case=True --max_seq_length=40 --do_whole_word_mask=True --max_predictions_per_seq=20 --masked_lm_prob=0.15 --dupe_factor=3
vocab 用的是bert的
The text was updated successfully, but these errors were encountered: