New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

中文detail页面包含英文段落会导致识别准确度下降 #22

Open

yjshi2015 opened this issue Jun 29, 2022 · 0 comments

Assignees

Labels

bug

Contributor

yjshi2015 commented Jun 29, 2022 •

edited

Loading

描述
用的是“故宫低调点”的最新页面（见末尾附件），识别的结果为“特别声明”部分，而非文章实际内容。

原因
该部分主要为英文，导致“文本密度”比汉字节点的要高很多，英文的字数统计按照字符，而非单词，比如“hello world”字数为10，而非2，相比中文具有明显的字数优势，因此“文本密度”指标出现偏差，进而影响了节点的最终得分。具体数据如下：

方案
如果页面以中文为主，那么针对英文段落，其中字数的统计应该跟中文保持一致，标准统一，即按照单词数来统计，而非字符来统计。

我针对number_of_char和number_of_a_char这2个方法，按照如上思路进行了优化，得到了预期结果。如下：

附件
网页源代码，把后缀改为html即可
gugong_detail.txt

yjshi2015 added the bug label

yjshi2015 assigned Germey

yjshi2015 mentioned this issue

优化文本字数统计算法，兼容英文段落场景 #23

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment