Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

优化文本字数统计算法,兼容英文段落场景 #23

Merged
merged 1 commit into from
Dec 28, 2022

Conversation

yjshi2015
Copy link
Contributor

@yjshi2015 yjshi2015 commented Jun 30, 2022

针对issue 22的问题,优化了文本字数的统计算法。

该算法使用场景:中文网页 & 中文网页包含英文段落;
如果text中英文字符数量 / len(text) > 0.5,则默认该文本以英文为主,按照“单词数量”计算,而非“字符数量”计数,进而修正“文本密度”指标(其中0.5为经验值);否则按原逻辑统计。

@Germey Germey merged commit d65ddd0 into Gerapy:master Dec 28, 2022
@Germey
Copy link
Member

Germey commented Dec 28, 2022

感谢贡献,不好意思一直没注意到这个 PR,merge 晚了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants