Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

中文detail页面包含英文段落会导致识别准确度下降 #22

Open
yjshi2015 opened this issue Jun 29, 2022 · 0 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@yjshi2015
Copy link
Contributor

yjshi2015 commented Jun 29, 2022

描述
用的是“故宫低调点”的最新页面(见末尾附件),识别的结果为“特别声明”部分,而非文章实际内容。

detail_extract

原因
该部分主要为英文,导致“文本密度”比汉字节点的要高很多,英文的字数统计按照字符,而非单词,比如“hello world”字数为10,而非2,相比中文具有明显的字数优势,因此“文本密度”指标出现偏差,进而影响了节点的最终得分。具体数据如下:
img

方案
如果页面以中文为主,那么针对英文段落,其中字数的统计应该跟中文保持一致,标准统一,即按照单词数来统计,而非字符来统计。

我针对number_of_char和number_of_a_char这2个方法,按照如上思路进行了优化,得到了预期结果。如下:
img_1

附件
网页源代码,把后缀改为html即可
gugong_detail.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants